🗓️Evaluation

Functional Achievement Assessment

Systematic evaluation of whether models or agents achieve preset goals. Specifically, we assess whether their performance in completing web tasks meets expected requirements. The evaluation process focuses on the precision of task completion status, such as verifying order processing status or the accuracy of inventory information updates.

Multi-level Task Assessment

Focuses on evaluating the execution of compound tasks involving multiple steps, especially in cross-platform operation scenarios. Using MIND2WEB and WebArena as examples, these platforms contain continuous tasks requiring precise execution of multiple stages. WebArena emphasizes the success rate of state transitions throughout tasks and conducts in-depth analysis of diverse task completion paths.

Adaptability Assessment

In-depth evaluation of model performance when facing new environments. For example, using MIND2WEB to assess system adaptability in handling new webpage architectures and task requirements. Comprehensively evaluates the model's practical application capabilities through multi-dimensional scenario testing (including tourism, commerce, service, and other domains).

Logic Transparency Assessment

In-depth analysis of the rationality of model decision processes. Using HOTPOTQA and FEVER as examples, these platforms provide detailed factual support to facilitate verification of the model's logical foundations. Systematically evaluates reasoning process accuracy by comparing model reasoning bases with standard answers.

Comprehensive Reasoning Assessment

Evaluates the system's ability to integrate multi-source information to complete complex tasks. HOTPOTQA particularly emphasizes the model's need for deep reasoning analysis from multiple information sources.

Environmental Response Assessment

In dynamic platforms like WebArena, focuses on evaluating system adaptability to real-time changes, including navigation efficiency, interaction quality, and exception handling capability. Ensures that tasks completed by models or agents functionally achieve expected goals. For example, after completing a webpage task, checking whether results align with high-level intentions in the task description. Specific operations include checking whether the final state of task execution matches expectations, such as verifying successful order placement or warehouse content updates.

Last updated