Chapter 19: Evaluation and Monitoring

第 19 章:评估和监控

This chapter examines methodologies that allow intelligent agents to systematically assess their performance, monitor progress toward goals, and detect operational anomalies. While Chapter 11 outlines goal setting and monitoring, and Chapter 17 addresses Reasoning mechanisms, this chapter focuses on the continuous, often external, measurement of an agent’s effectiveness, efficiency, and compliance with requirements. This includes defining metrics, establishing feedback loops, and implementing reporting systems to ensure agent performance aligns with expectations in operational environments (see Fig.1) 本章探讨使智能体能够系统地评估其性能、监控目标进展以及检测操作异常的方法论。虽然第 11 章概述了目标设定和监控,第 17 章讨论了推理机制,但本章侧重于对智能体效率、效能和合规性要求进行持续的(通常是外部的)测量。这包括定义指标、建立反馈循环以及实施报告系统,以确保智能体的性能在操作环境中与期望保持一致(见图 1)。

Fig:1. Best practices for evaluation and monitoring 图 1:评估和监控的最佳实践

Practical Applications & Use Cases

实际应用与用例

Most Common Applications and Use Cases: 最常见的应用和用例:

Example usage

agent_response = “The capital of France is Paris.” ground_truth = “Paris is the capital of France.” score = evaluate_response_accuracy(agent_response, ground_truth) print(f”Response accuracy: {score}”)

1
2
3
4
5
6
7
8
9
10
11
```python
def evaluate_response_accuracy(agent_output: str, expected_output: str) -> float:
    """计算智能体响应的简单准确度分数。"""
    # 这是一个非常基本的精确匹配;现实世界会使用更复杂的指标
    return 1.0 if agent_output.strip().lower() == expected_output.strip().lower() else 0.0

## 示例使用
agent_response = "The capital of France is Paris."
ground_truth = "Paris is the capital of France."
score = evaluate_response_accuracy(agent_response, ground_truth)
print(f"Response accuracy: {score}")

The Python function `evaluate_response_accuracy` calculates a basic accuracy score for an AI agent’s response by performing an exact, case-insensitive comparison between the agent’s output and the expected output, after removing leading or trailing whitespace. It returns a score of 1.0 for an exact match and 0.0 otherwise, representing a binary correct or incorrect evaluation. This method, while straightforward for simple checks, does not account for variations like paraphrasing or semantic equivalence.

Python 函数 evaluate_response_accuracy 通过对智能体输出和期望输出进行精确的、不区分大小写的比较(在去除前导和尾随空格后),计算 AI 智能体响应的基本准确度分数。对于精确匹配,它返回 1.0 分,否则返回 0.0,表示二元的正确或不正确评估。这种方法虽然适合简单检查,但不考虑释义或语义等价性等变化。 The problem lies in its method of comparison. The function performs a strict, character-for-character comparison of the two strings. In the example provided: 问题在于其比较方法。该函数执行两个字符串的严格逐字符比较。在提供的示例中:

Example usage

monitor = LLMInteractionMonitor() monitor.record_interaction(“What is the capital of France?”, “The capital of France is Paris.”) monitor.record_interaction(“Tell me a joke.”, “Why don’t scientists trust atoms? Because they make up everything!”) input_t, output_t = monitor.get_total_tokens() print(f”Total input tokens: {input_t}, Total output tokens: {output_t}”)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
```python
## 这是概念性的,因为实际的 token 计数取决于 LLM API
class LLMInteractionMonitor:
    def __init__(self):
        self.total_input_tokens = 0
        self.total_output_tokens = 0

    def record_interaction(self, prompt: str, response: str):
        # 在真实场景中,使用 LLM API 的 token 计数器或 tokenizer
        input_tokens = len(prompt.split())  # 占位符
        output_tokens = len(response.split())  # 占位符
        self.total_input_tokens += input_tokens
        self.total_output_tokens += output_tokens
        print(f"已记录交互:输入 tokens={input_tokens},输出 tokens={output_tokens}")

    def get_total_tokens(self):
        return self.total_input_tokens, self.total_output_tokens

## 示例使用
monitor = LLMInteractionMonitor()
monitor.record_interaction("What is the capital of France?", "The capital of France is Paris.")
monitor.record_interaction("Tell me a joke.", "Why don't scientists trust atoms? Because they make up everything!")
input_t, output_t = monitor.get_total_tokens()
print(f"总输入 tokens:{input_t},总输出 tokens:{output_t}")

This section introduces a conceptual Python class, `LLMInteractionMonitor`, developed to track token usage in large language model interactions. The class incorporates counters for both input and output tokens. Its `record_interaction` method simulates token counting by splitting the prompt and response strings. In a practical implementation, specific LLM API tokenizers would be employed for precise token counts. As interactions occur, the monitor accumulates the total input and output token counts. The `get_total_tokens` method provides access to these cumulative totals, essential for cost management and optimization of LLM usage.

本节介绍了一个概念性的 Python 类 LLMInteractionMonitor,旨在跟踪大语言模型交互中的 token 使用量。该类包含输入和输出 token 的计数器。其 record_interaction 方法通过分割提示词和响应字符串来模拟 token 计数。在实际实现中,将使用特定的 LLM API tokenizer 进行精确的 token 计数。随着交互的发生,监视器累积总的输入和输出 token 计数。get_total_tokens 方法提供对这些累积总数的访问,这对于成本管理和 LLM 使用优化至关重要。 Custom Metric for “Helpfulness” using LLM-as-a-Judge: Evaluating subjective qualities like an AI agent’s “helpfulness” presents challenges beyond standard objective metrics. A potential framework involves using an LLM as an evaluator. This LLM-as-a-Judge approach assesses another AI agent’s output based on predefined criteria for “helpfulness.” Leveraging the advanced linguistic capabilities of LLMs, this method offers nuanced, human-like evaluations of subjective qualities, surpassing simple keyword matching or rule-based assessments. Though in development, this technique shows promise for automating and scaling qualitative evaluations. 使用 LLM-as-a-Judge 的”有用性”自定义指标: 评估 AI 智能体”有用性”等主观品质带来了超越标准客观指标的挑战。一个潜在的框架涉及使用 LLM 作为评估者。这种 LLM-as-a-Judge 方法根据预定义的”有用性”标准评估另一个 AI 智能体的输出。利用 LLM 的高级语言能力,此方法提供细微的、类人的主观品质评估,超越了简单的关键词匹配或基于规则的评估。虽然仍在发展中,但这项技术显示出自动化和扩展定性评估的前景。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import google.generativeai as genai
import os
import json
import logging
from typing import Optional

## --- Configuration ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## Set your API key as an environment variable to run this script
## For example, in your terminal: export GOOGLE_API_KEY='your_key_here'
try:
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
except KeyError:
    logging.error("Error: GOOGLE_API_KEY environment variable not set.")
    exit(1)

## --- LLM-as-a-Judge Rubric for Legal Survey Quality ---
LEGAL_SURVEY_RUBRIC = """
You are an expert legal survey methodologist and a critical legal reviewer.
Your task is to evaluate the quality of a given legal survey question.
Provide a score from 1 to 5 for overall quality, along with a detailed rationale and specific feedback.
Focus on the following criteria:

1.  **Clarity & Precision (Score 1-5):**
    * 1: Extremely vague, highly ambiguous, or confusing.
    * 3: Moderately clear, but could be more precise.
    * 5: Perfectly clear, unambiguous, and precise in its legal terminology (if applicable) and intent.
2.  **Neutrality & Bias (Score 1-5):**
    * 1: Highly leading or biased, clearly influencing the respondent towards a specific answer.
    * 3: Slightly suggestive or could be interpreted as leading.
    * 5: Completely neutral, objective, and free from any leading language or loaded terms.
3.  **Relevance & Focus (Score 1-5):**
    * 1: Irrelevant to the stated survey topic or out of scope.
    * 3: Loosely related but could be more focused.
    * 5: Directly relevant to the survey's objectives and well-focused on a single concept.
4.  **Completeness (Score 1-5):**
    * 1: Omits critical information needed to answer accurately or provides insufficient context.
    * 3: Mostly complete, but minor details are missing.
    * 5: Provides all necessary context and information for the respondent to answer thoroughly.
5.  **Appropriateness for Audience (Score 1-5):**
    * 1: Uses jargon inaccessible to the target audience or is overly simplistic for experts.
    * 3: Generally appropriate, but some terms might be challenging or oversimplified.
    * 5: Perfectly tailored to the assumed legal knowledge and background of the target survey audience.

**Output Format:**
Your response MUST be a JSON object with the following keys:
* `overall_score`: An integer from 1 to 5 (average of criterion scores, or your holistic judgment).
* `rationale`: A concise summary of why this score was given, highlighting major strengths and weaknesses.
* `detailed_feedback`: A bullet-point list detailing feedback for each criterion (Clarity, Neutrality, Relevance, Completeness, Audience Appropriateness). Suggest specific improvements.
* `concerns`: A list of any specific legal, ethical, or methodological concerns.
* `recommended_action`: A brief recommendation (e.g., "Revise for neutrality", "Approve as is", "Clarify scope").
"""

class LLMJudgeForLegalSurvey:
    """A class to evaluate legal survey questions using a generative AI model."""
    def __init__(self, model_name: str = 'gemini-1.5-flash-latest', temperature: float = 0.2):
        """
        Initializes the LLM Judge.

        Args:
            model_name (str): The name of the Gemini model to use.
                              'gemini-1.5-flash-latest' is recommended for speed and cost.
                              'gemini-1.5-pro-latest' offers the highest quality.
            temperature (float): The generation temperature. Lower is better for deterministic evaluation.
        """
        self.model = genai.GenerativeModel(model_name)
        self.temperature = temperature

    def _generate_prompt(self, survey_question: str) -> str:
        """Constructs the full prompt for the LLM judge."""
        return f"{LEGAL_SURVEY_RUBRIC}\n\n---\n**LEGAL SURVEY QUESTION TO EVALUATE:**\n{survey_question}\n---"

    def judge_survey_question(self, survey_question: str) -> Optional[dict]:
        """
        Judges the quality of a single legal survey question using the LLM.
        Args:
            survey_question (str): The legal survey question to be evaluated.
        Returns:
            Optional[dict]: A dictionary containing the LLM's judgment, or None if an error occurs.
        """
        full_prompt = self._generate_prompt(survey_question)

        try:
            logging.info(f"Sending request to '{self.model.model_name}' for judgment...")
            response = self.model.generate_content(
                full_prompt,
                generation_config=genai.types.GenerationConfig(
                    temperature=self.temperature,
                    response_mime_type="application/json"
                )
            )
            # Check for content moderation or other reasons for an empty response.
            if not response.parts:
                safety_ratings = response.prompt_feedback.safety_ratings
                logging.error(f"LLM response was empty or blocked. Safety Ratings: {safety_ratings}")
                return None

            return json.loads(response.text)
        except json.JSONDecodeError:
            logging.error(f"Failed to decode LLM response as JSON. Raw response: {response.text}")
            return None
        except Exception as e:
            logging.error(f"An unexpected error occurred during LLM judgment: {e}")
            return None

## --- Example Usage ---
if __name__ == "__main__":
    judge = LLMJudgeForLegalSurvey()

    # --- Good Example ---
    good_legal_survey_question = """
    To what extent do you agree or disagree that current intellectual property laws in Switzerland adequately protect emerging AI-generated content, assuming the content meets the originality criteria established by the Federal Supreme Court?
    (Select one: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree)
    """
    print("\n--- Evaluating Good Legal Survey Question ---")
    judgment_good = judge.judge_survey_question(good_legal_survey_question)
    if judgment_good:
        print(json.dumps(judgment_good, indent=2))

    # --- Biased/Poor Example ---
    biased_legal_survey_question = """
    Don't you agree that overly restrictive data privacy laws like the FADP are hindering essential technological innovation and economic growth in Switzerland?
    (Select one: Yes, No)
    """
    print("\n--- Evaluating Biased Legal Survey Question ---")
    judgment_biased = judge.judge_survey_question(biased_legal_survey_question)
    if judgment_biased:
        print(json.dumps(judgment_biased, indent=2))

    # --- Ambiguous/Vague Example ---
    vague_legal_survey_question = """
    What are your thoughts on legal tech?
    """
    print("\n--- Evaluating Vague Legal Survey Question ---")
    judgment_vague = judge.judge_survey_question(vague_legal_survey_question)
    if judgment_vague:
        print(json.dumps(judgment_vague, indent=2))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import google.generativeai as genai
import os
import json
import logging
from typing import Optional

## --- 配置 ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## 将您的 API 密钥设置为环境变量以运行此脚本
## 例如,在您的终端中:export GOOGLE_API_KEY='your_key_here'
try:
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
except KeyError:
    logging.error("错误:GOOGLE_API_KEY 环境变量未设置。")
    exit(1)

## --- 法律调查质量的 LLM-as-a-Judge 评分标准 ---
LEGAL_SURVEY_RUBRIC = """
您是一位专家法律调查方法学家和严格的法律审查员。您的任务是评估给定法律调查问题的质量。为整体质量提供 1 到 5 的分数,以及详细的理由和具体反馈。重点关注以下标准:

1.  **清晰性和精确性(分数 1-5):**
    * 1:极度模糊、高度歧义或令人困惑。
    * 3:中等清晰,但可以更精确。
    * 5:完全清晰、无歧义,在法律术语(如适用)和意图上精确。
2.  **中立性和偏见(分数 1-5):**
    * 1:高度引导性或有偏见,明确影响受访者偏向特定答案。
    * 3:略微暗示性或可能被解释为引导性。
    * 5:完全中立、客观,没有任何引导性语言或带有倾向性的术语。
3.  **相关性和焦点(分数 1-5):**
    * 1:与声明的调查主题无关或超出范围。
    * 3:松散相关,但可以更集中。
    * 5:与调查目标直接相关,并且集中于单一概念。
4.  **完整性(分数 1-5):**
    * 1:遗漏了准确回答所需的关键信息或提供的上下文不足。
    * 3:基本完整,但缺少次要细节。
    * 5:提供受访者彻底回答所需的所有必要上下文和信息。
5.  **受众适当性(分数 1-5):**
    * 1:使用目标受众无法理解的术语或对专家来说过于简单。
    * 3:通常适当,但某些术语可能具有挑战性或过于简化。
    * 5:完全适合目标调查受众的假定法律知识和背景。

**输出格式:** 您的响应必须是具有以下键的 JSON 对象:
* `overall_score`:一个从 1 到 5 的整数(标准分数的平均值或您的整体判断)。
* `rationale`:给出此分数原因的简明摘要,突出主要优势和劣势。
* `detailed_feedback`:详细说明每个标准(清晰性、中立性、相关性、完整性、受众适当性)反馈的要点列表。建议具体改进。
* `concerns`:任何具体的法律、道德或方法学问题的列表。
* `recommended_action`:简短的建议(例如"修改以保持中立","按原样批准","明确范围")。
"""

class LLMJudgeForLegalSurvey:
    """使用生成式 AI 模型评估法律调查问题的类。"""
    def __init__(self, model_name: str = 'gemini-1.5-flash-latest', temperature: float = 0.2):
        """
        初始化 LLM Judge。

        Args:
            model_name (str):要使用的 Gemini 模型名称。
                               推荐使用 'gemini-1.5-flash-latest' 以获得速度和成本效益。
                               'gemini-1.5-pro-latest' 提供最高质量。
            temperature (float):生成温度。较低的温度更适合确定性评估。
        """
        self.model = genai.GenerativeModel(model_name)
        self.temperature = temperature

    def _generate_prompt(self, survey_question: str) -> str:
        """为 LLM judge 构建完整提示词。"""
        return f"{LEGAL_SURVEY_RUBRIC}\n\n---\n**要评估的法律调查问题:**\n{survey_question}\n---"

    def judge_survey_question(self, survey_question: str) -> Optional[dict]:
        """
        使用 LLM 判断单个法律调查问题的质量。

        Args:
            survey_question (str):要评估的法律调查问题。
        Returns:
            Optional[dict]:包含 LLM 判断的字典,如果发生错误则返回 None。
        """
        full_prompt = self._generate_prompt(survey_question)

        try:
            logging.info(f"向 '{self.model.model_name}' 发送判断请求...")
            response = self.model.generate_content(
                full_prompt,
                generation_config=genai.types.GenerationConfig(
                    temperature=self.temperature,
                    response_mime_type="application/json"
                )
            )
            # 检查内容审核或其他导致响应为空的原因。
            if not response.parts:
                safety_ratings = response.prompt_feedback.safety_ratings
                logging.error(f"LLM 响应为空或被阻止。安全评级:{safety_ratings}")
                return None

            return json.loads(response.text)
        except json.JSONDecodeError:
            logging.error(f"无法将 LLM 响应解码为 JSON。原始响应:{response.text}")
            return None
        except Exception as e:
            logging.error(f"LLM 判断期间发生意外错误:{e}")
            return None

## --- 示例使用 ---
if __name__ == "__main__":
    judge = LLMJudgeForLegalSurvey()

    # --- 好的示例 ---
    good_legal_survey_question = """
    在多大程度上您同意或不同意瑞士当前的知识产权法充分保护新兴的 AI 生成内容,假设该内容满足联邦最高法院确立的原创性标准?
    (选择一项:强烈不同意、不同意、中立、同意、强烈同意)
    """
    print("\n--- 评估好的法律调查问题 ---")
    judgment_good = judge.judge_survey_question(good_legal_survey_question)
    if judgment_good:
        print(json.dumps(judgment_good, indent=2))

    # --- 有偏见/差的示例 ---
    biased_legal_survey_question = """
    难道您不同意像 FADP 这样过度限制性的数据隐私法正在阻碍瑞士的基本技术创新和经济增长吗?
    (选择一项:是、否)
    """
    print("\n--- 评估有偏见的法律调查问题 ---")
    judgment_biased = judge.judge_survey_question(biased_legal_survey_question)
    if judgment_biased:
        print(json.dumps(judgment_biased, indent=2))

    # --- 模糊/含糊的示例 ---
    vague_legal_survey_question = """
    您对法律科技有什么想法?
    """
    print("\n--- 评估含糊的法律调查问题 ---")
    judgment_vague = judge.judge_survey_question(vague_legal_survey_question)
    if judgment_vague:
        print(json.dumps(judgment_vague, indent=2))

The Python code defines a class LLMJudgeForLegalSurvey designed to evaluate the quality of legal survey questions using a generative AI model. It utilizes the google.generativeai library to interact with Gemini models.

Python 代码定义了一个名为 LLMJudgeForLegalSurvey 的类,旨在使用生成式 AI 模型评估法律调查问题的质量。它利用 google.generativeai 库与 Gemini 模型交互。 The core functionality involves sending a survey question to the model along with a detailed rubric for evaluation. The rubric specifies five criteria for judging survey questions: Clarity & Precision, Neutrality & Bias, Relevance & Focus, Completeness, and Appropriateness for Audience. For each criterion, a score from 1 to 5 is assigned, and a detailed rationale and feedback are required in the output. The code constructs a prompt that includes the rubric and the survey question to be evaluated. 核心功能涉及将调查问题与详细的评估标准一起发送给模型。评分标准指定了判断调查问题的五个标准:清晰性和精确性、中立性和偏见、相关性和焦点、完整性以及受众适当性。对于每个标准,分配 1 到 5 的分数,并要求在输出中提供详细的理由和反馈。代码构建一个提示词,其中包括标准和要评估的调查问题。 The judge_survey_question method sends this prompt to the configured Gemini model, requesting a JSON response formatted according to the defined structure. The expected output JSON includes an overall score, a summary rationale, detailed feedback for each criterion, a list of concerns, and a recommended action. The class handles potential errors during the AI model interaction, such as JSON decoding issues or empty responses. The script demonstrates its operation by evaluating examples of legal survey questions, illustrating how the AI assesses quality based on the predefined criteria. judge_survey_question 方法将此提示词发送到配置的 Gemini 模型,请求根据定义的结构格式化的 JSON 响应。期望的输出 JSON 包括整体分数、摘要理由、每个标准的详细反馈、问题列表和推荐的操作。该类处理 AI 模型交互期间的潜在错误,例如 JSON 解码问题或空响应。脚本通过评估法律调查问题的示例来演示其操作,说明 AI 如何根据预定义的标准评估质量。 Before we conclude, let’s examine various evaluation methods, considering their strengths and weaknesses. 在结束之前,让我们检查各种评估方法,考虑它们的优势和劣势。 | Evaluation Method | Strengths | Weaknesses | | :—- | :—- | :—- | | Human Evaluation | Captures subtle behavior | Difficult to scale, expensive, and time-consuming, as it considers subjective human factors. | | LLM-as-a-Judge | Consistent, efficient, and scalable. | Intermediate steps may be overlooked. Limited by LLM capabilities. | | Automated Metrics | Scalable, efficient, and objective | Potential limitation in capturing complete capabilities. | | 评估方法 | 优势 | 劣势 | | :—- | :—- | :—- | | 人工评估 | 捕获细微行为 | 难以扩展、昂贵且耗时,因为它考虑主观的人为因素。 | | LLM-as-a-Judge | 一致、高效且可扩展。 | 可能忽略中间步骤。受 LLM 能力限制。 | | 自动化指标 | 可扩展、高效且客观 | 在捕获完整能力方面可能存在限制。 |

Agents trajectories

智能体轨迹

Evaluating agents’ trajectories is essential, as traditional software tests are insufficient. Standard code yields predictable pass/fail results, whereas agents operate probabilistically, necessitating qualitative assessment of both the final output and the agent’s trajectory—the sequence of steps taken to reach a solution. Evaluating multi-agent systems is challenging because they are constantly in flux. This requires developing sophisticated metrics that go beyond individual performance to measure the effectiveness of communication and teamwork. Moreover, the environments themselves are not static, demanding that evaluation methods, including test cases, adapt over time. 评估智能体的轨迹至关重要,因为传统的软件测试是不够的。标准代码产生可预测的通过/失败结果,而智能体以概率方式运行,需要对最终输出和智能体的轨迹——即达到解决方案所采取的步骤序列——进行定性评估。评估多智能体系统具有挑战性,因为它们不断变化。这需要开发超越个体性能的复杂指标来衡量沟通和团队合作的有效性。此外,环境本身不是静态的,要求评估方法(包括测试用例)随时间适应。 This involves examining the quality of decisions, the reasoning process, and the overall outcome. Implementing automated evaluations is valuable, particularly for development beyond the prototype stage. Analyzing trajectory and tool use includes evaluating the steps an agent employs to achieve a goal, such as tool selection, strategies, and task efficiency. For example, an agent addressing a customer’s product query might ideally follow a trajectory involving intent determination, database search tool use, result review, and report generation. The agent’s actual actions are compared to this expected, or ground truth, trajectory to identify errors and inefficiencies. Comparison methods include exact match (requiring a perfect match to the ideal sequence), in-order match (correct actions in order, allowing extra steps), any-order match (correct actions in any order, allowing extra steps), precision (measuring the relevance of predicted actions), recall (measuring how many essential actions are captured), and single-tool use (checking for a specific action). Metric selection depends on specific agent requirements, with high-stakes scenarios potentially demanding an exact match, while more flexible situations might use an in-order or any-order match. 这涉及检查决策的质量、推理过程和整体结果。实施自动化评估很有价值,特别是对于超越原型阶段的开发。分析轨迹和工具使用包括评估智能体实现目标的步骤,例如工具选择、策略和任务效率。例如,处理客户产品查询的智能体可能理想地遵循涉及意图确定、数据库搜索工具使用、结果审查和报告生成的轨迹。将智能体的实际操作与这个预期的或真实的轨迹进行比较,以识别错误和低效率。比较方法包括精确匹配(要求与理想序列完美匹配)、按序匹配(正确的操作按顺序,允许额外步骤)、任意顺序匹配(正确的操作以任何顺序,允许额外步骤)、精确度(测量预测操作的相关性)、召回率(测量捕获了多少基本操作)和单工具使用(检查特定操作)。指标选择取决于特定的智能体要求,高风险场景可能要求精确匹配,而更灵活的情况可能使用按序或任意顺序匹配。 Evaluation of AI agents involves two primary approaches: using test files and using evalset files. Test files, in JSON format, represent single, simple agent-model interactions or sessions and are ideal for unit testing during active development, focusing on rapid execution and simple session complexity. Each test file contains a single session with multiple turns, where a turn is a user-agent interaction including the user’s query, expected tool use trajectory, intermediate agent responses, and final response. For example, a test file might detail a user request to “Turn off device_2 in the Bedroom,” specifying the agent’s use of a set_device_info tool with parameters like location: Bedroom, device_id: device_2, and status: OFF, and an expected final response of “I have set the device_2 status to off.” Test files can be organized into folders and may include a test_config.json file to define evaluation criteria. Evalset files utilize a dataset called an “evalset” to evaluate interactions, containing multiple potentially lengthy sessions suited for simulating complex, multi-turn conversations and integration tests. An evalset file comprises multiple “evals,” each representing a distinct session with one or more “turns” that include user queries, expected tool use, intermediate responses, and a reference final response. An example evalset might include a session where the user first asks “What can you do?” and then says “Roll a 10 sided dice twice and then check if 9 is a prime or not,” defining expected roll\_die tool calls and a check_prime tool call, along with the final response summarizing the dice rolls and the prime check. AI 智能体评估涉及两种主要方法:使用测试文件和使用评估集文件。测试文件以 JSON 格式表示单个、简单的智能体-模型交互或会话,非常适合在活跃开发期间进行单元测试,侧重于快速执行和简单的会话复杂性。每个测试文件包含一个带有多个回合的单个会话,其中回合是一个用户-智能体交互,包括用户的查询、预期的工具使用轨迹、中间智能体响应和最终响应。例如,测试文件可能详细说明用户请求”关闭卧室中的 device_2”,指定智能体使用带有 location: Bedroom、device_id: device_2 和 status: OFF 等参数的 set_device_info 工具,以及预期的最终响应”我已将 device_2 状态设置为关闭”。测试文件可以组织到文件夹中,并可能包含 test_config.json 文件来定义评估标准。评估集文件利用称为”evalset”的数据集来评估交互,包含多个可能很长的会话,适合模拟复杂的多回合对话和集成测试。评估集文件包含多个”评估”,每个评估代表一个具有一个或多个”回合”的不同会话,其中包括用户查询、预期的工具使用、中间响应和参考最终响应。示例评估集可能包括一个会话,其中用户首先询问”你能做什么?”然后说”掷两次 10 面骰子,然后检查 9 是否是质数”,定义预期的 roll_die 工具调用和 check_prime 工具调用,以及总结骰子结果和质数检查的最终响应。 Multi-agents: Evaluating a complex AI system with multiple agents is much like assessing a team project. Because there are many steps and handoffs, its complexity is an advantage, allowing you to check the quality of work at each stage. You can examine how well each individual “agent” performs its specific job, but you must also evaluate how the entire system is performing as a whole. 多智能体: 评估具有多个智能体的复杂 AI 系统非常像评估团队项目。因为有许多步骤和交接,其复杂性是一个优势,允许您检查每个阶段的工作质量。您可以检查每个单独的”智能体”如何执行其特定工作,但您还必须评估整个系统作为一个整体的表现如何。 To do this, you ask key questions about the team’s dynamics, supported by concrete examples: 为此,您需要提出关于团队动态的关键问题,并以具体示例作为支持:

最后,第四个支柱是通过子合约的层次化分解。对于复杂度很高的任务,主承包商智能体可以充当项目经理,将主要目标分解为更小、更易管理的子任务。它通过生成新的、正式的”子合约”来实现这一点。例如,”构建电子商务移动应用程序”的主合约可以由主智能体分解为”设计 UI/UX”、”开发用户身份验证模块”、”创建产品数据库架构”和”集成支付网关”的子合约。这些子合约中的每一个都是一个完整的、独立的合约,具有自己的可交付成果和规格,可以分配给其他专门的智能体。这种结构化的分解使系统能够以高度组织和可扩展的方式处理巨大的、多方面的项目,标志着 AI 从简单工具向真正自主和可靠的问题解决引擎的转变。

Ultimately, this contractor framework reimagines AI interaction by embedding principles of formal specification, negotiation, and verifiable execution directly into the agent’s core logic. This methodical approach elevates artificial intelligence from a promising but often unpredictable assistant into a dependable system capable of autonomously managing complex projects with auditable precision. By solving the critical challenges of ambiguity and reliability, this model paves the way for deploying AI in mission-critical domains where trust and accountability are paramount.

最终,这个承包商框架通过将正式规范、协商和可验证执行的原则直接嵌入到智能体的核心逻辑中,重新构想了 AI 交互。这种系统化的方法将人工智能从一个有前途但经常不可预测的助手提升为能够以可审计的精度自主管理复杂项目的可靠系统。通过解决歧义和可靠性的关键挑战,该模型为在信任和问责至关重要的关键任务领域部署 AI 铺平了道路。

Google’s ADK

Google 的 ADK

Before concluding, let’s look at a concrete example of a framework that supports evaluation. Agent evaluation with Google’s ADK (see Fig.3) can be conducted via three methods: web-based UI (adk web) for interactive evaluation and dataset generation, programmatic integration using pytest for incorporation into testing pipelines, and direct command-line interface (adk eval) for automated evaluations suitable for regular build generation and verification processes.

在结束之前,让我们看一个支持评估的框架的具体示例。使用 Google 的 ADK 进行智能体评估(见图 3)可以通过三种方法进行:基于 Web 的 UI(adk web)用于交互式评估和数据集生成、使用 pytest 的编程集成用于整合到测试管道中,以及直接命令行界面(adk eval)用于适合定期构建生成和验证过程的自动化评估。

Fig.3: Evaluation Support for Google ADK

图 3:Google ADK 的评估支持

The web-based UI enables interactive session creation and saving into existing or new eval sets, displaying evaluation status. Pytest integration allows running test files as part of integration tests by calling AgentEvaluator.evaluate, specifying the agent module and test file path.

基于 Web 的 UI 支持交互式会话创建并保存到现有或新的评估集中,显示评估状态。Pytest 集成允许通过调用 AgentEvaluator.evaluate 并指定智能体模块和测试文件路径,将测试文件作为集成测试的一部分运行。

The command-line interface facilitates automated evaluation by providing the agent module path and eval set file, with options to specify a configuration file or print detailed results. Specific evals within a larger eval set can be selected for execution by listing them after the eval set filename, separated by commas.

命令行界面通过提供智能体模块路径和评估集文件来促进自动化评估,并具有指定配置文件或打印详细结果的选项。可以通过在评估集文件名后列出(用逗号分隔)来选择较大评估集中的特定评估以供执行。

At a Glance

概览

What: Agentic systems and LLMs operate in complex, dynamic environments where their performance can degrade over time. Their probabilistic and non-deterministic nature means that traditional software testing is insufficient for ensuring reliability. Evaluating dynamic multi-agent systems is a significant challenge because their constantly changing nature and that of their environments demand the development of adaptive testing methods and sophisticated metrics that can measure collaborative success beyond individual performance. Problems like data drift, unexpected interactions, tool calling, and deviations from intended goals can arise after deployment. Continuous assessment is therefore necessary to measure an agent’s effectiveness, efficiency, and adherence to operational and safety requirements.

内容: Agentic 系统和 LLM 在复杂的动态环境中运行,它们的性能可能会随时间退化。它们的概率性和非确定性本质意味着传统的软件测试不足以确保可靠性。评估动态多智能体系统是一项重大挑战,因为它们不断变化的性质以及其环境的性质要求开发适应性测试方法和能够测量超越个体性能的协作成功的复杂指标。部署后可能出现数据漂移、意外交互、工具调用和偏离预期目标等问题。因此,持续评估对于测量智能体的效能、效率以及对操作和安全要求的遵守是必要的。

Why: A standardized evaluation and monitoring framework provides a systematic way to assess and ensure the ongoing performance of intelligent agents. This involves defining clear metrics for accuracy, latency, and resource consumption, like token usage for LLMs. It also includes advanced techniques such as analyzing agentic trajectories to understand the reasoning process and employing an LLM-as-a-Judge for nuanced, qualitative assessments. By establishing feedback loops and reporting systems, this framework allows for continuous improvement, A/B testing, and the detection of anomalies or performance drift, ensuring the agent remains aligned with its objectives.

原因: 标准化的评估和监控框架提供了一种系统的方式来评估和确保智能体的持续性能。这涉及为准确性、延迟和资源消耗(如 LLM 的 token 使用量)定义明确的指标。它还包括高级技术,例如分析智能体轨迹以理解推理过程,以及采用 LLM-as-a-Judge 进行细微的、定性的评估。通过建立反馈循环和报告系统,该框架允许持续改进、A/B 测试以及检测异常或性能漂移,确保智能体与目标保持一致。

Rule of thumb: Use this pattern when deploying agents in live, production environments where real-time performance and reliability are critical. Additionally, use it when needing to systematically compare different versions of an agent or its underlying models to drive improvements, and when operating in regulated or high-stakes domains requiring compliance, safety, and ethical audits. This pattern is also suitable when an agent’s performance may degrade over time due to changes in data or the environment (drift), or when evaluating complex agentic behavior, including the sequence of actions (trajectory) and the quality of subjective outputs like helpfulness.

经验法则: 在部署智能体到实时性能和可靠性至关重要的生产环境时使用此模式。此外,当需要系统地比较智能体或其底层模型的不同版本以推动改进时,以及在需要合规性、安全性和道德审计的受监管或高风险领域运营时使用它。当智能体的性能可能由于数据或环境变化(漂移)而随时间退化时,或者在评估复杂的智能体行为时(包括操作序列(轨迹)和主观输出(如有用性)的质量),此模式也适用。

Visual summary

可视化摘要

Fig.4: Evaluation and Monitoring design pattern

图 4:评估和监控设计模式

Key Takeaways

关键要点

Conclusions

结论

In conclusion, effectively evaluating AI agents requires moving beyond simple accuracy checks to a continuous, multi-faceted assessment of their performance in dynamic environments. This involves practical monitoring of metrics like latency and resource consumption, as well as sophisticated analysis of an agent’s decision-making process through its trajectory. For nuanced qualities like helpfulness, innovative methods such as the LLM-as-a-Judge are becoming essential, while frameworks like Google’s ADK provide structured tools for both unit and integration testing. The challenge intensifies with multi-agent systems, where the focus shifts to evaluating collaborative success and effective cooperation.

总之,有效评估 AI 智能体需要超越简单的准确性检查,对其在动态环境中的性能进行持续的、多方面的评估。这涉及实际监控延迟和资源消耗等指标,以及通过其轨迹对智能体决策过程进行复杂分析。对于有用性等细微品质,诸如 LLM-as-a-Judge 之类的创新方法变得必不可少,而像 Google 的 ADK 这样的框架为单元和集成测试提供了结构化的工具。对于多智能体系统,挑战加剧,重点转向评估协作成功和有效合作。

To ensure reliability in critical applications, the paradigm is shifting from simple, prompt-driven agents to advanced “contractors” bound by formal agreements. These contractor agents operate on explicit, verifiable terms, allowing them to negotiate, decompose tasks, and self-validate their work to meet rigorous quality standards. This structured approach transforms agents from unpredictable tools into accountable systems capable of handling complex, high-stakes tasks. Ultimately, this evolution is crucial for building the trust required to deploy sophisticated agentic AI in mission-critical domains.

为了确保关键应用中的可靠性,范式正在从简单的、提示词驱动的智能体转向受正式协议约束的高级”承包商”。这些承包商智能体在明确的、可验证的条款上运作,允许它们协商、分解任务并自我验证其工作以满足严格的质量标准。这种结构化方法将智能体从不可预测的工具转变为能够处理复杂的、高风险任务的可问责系统。最终,这种演变对于建立在关键任务领域部署复杂的 Agentic AI 所需的信任至关重要。

References

参考文献

Relevant research includes:

相关研究包括:

  1. ADK Web: https://github.com/google/adk-web
  2. ADK Evaluate: https://google.github.io/adk-docs/evaluate/
  3. Survey on Evaluation of LLM-based Agents, https://arxiv.org/abs/2503.16416
  4. Agent-as-a-Judge: Evaluate Agents with Agents, https://arxiv.org/abs/2410.10934
  5. Agent Companion, gulli et al: https://www.kaggle.com/whitepaper-agent-companion

Finally, the fourth pillar is Hierarchical Decomposition via Subcontracts. For tasks of significant complexity, a primary contractor agent can act as a project manager, breaking the main goal into smaller, more manageable sub-tasks. It achieves this by generating new, formal “subcontracts.” For example, a master contract to “build an e-commerce mobile application” could be decomposed by the primary agent into subcontracts for “designing the UI/UX,” “developing the user authentication module,” “creating the product database schema,” and “integrating a payment gateway.” Each of these subcontracts is a complete, independent contract with its own deliverables and specifications, which could be assigned to other specialized agents. This structured decomposition allows the system to tackle immense, multifaceted projects in a highly organized and scalable manner, marking the transition of AI from a simple tool to a truly autonomous and reliable problem-solving engine. 最后,第四个支柱是通过子合约的层次化分解。对于复杂度很高的任务,主承包商智能体可以充当项目经理,将主要目标分解为更小的、更易管理的子任务。它通过生成新的、正式的”子合约”来实现这一点。例如,”构建电子商务移动应用程序”的主合约可以由主智能体分解为”设计 UI/UX”、”开发用户身份验证模块”、”创建产品数据库架构”和”集成支付网关”的子合约。这些子合约中的每一个都是一个完整的、独立的合约,具有自己的可交付成果和规格,可以分配给其他专门的智能体。这种结构化的分解使系统能够以高度组织和可扩展的方式处理巨大的、多方面的项目,标志着 AI 从简单工具向真正自主和可靠的问题解决引擎的转变。 Ultimately, this contractor framework reimagines AI interaction by embedding principles of formal specification, negotiation, and verifiable execution directly into the agent’s core logic. This methodical approach elevates artificial intelligence from a promising but often unpredictable assistant into a dependable system capable of autonomously managing complex projects with auditable precision. By solving the critical challenges of ambiguity and reliability, this model paves the way for deploying AI in mission-critical domains where trust and accountability are paramount. 最终,这个承包商框架通过将正式规范、协商和可验证执行的原则直接嵌入到智能体的核心逻辑中,重新构想了 AI 交互。这种系统化的方法将人工智能从一个有前途但经常不可预测的助手提升为能够以可审计的精度自主管理复杂项目的可靠系统。通过解决歧义和可靠性的关键挑战,该模型为在信任和问责至关重要的关键任务领域部署 AI 铺平了道路。

Google’s ADK

Google 的 ADK

Before concluding, let’s look at a concrete example of a framework that supports evaluation. Agent evaluation with Google’s ADK (see Fig.3) can be conducted via three methods: web-based UI (adk web) for interactive evaluation and dataset generation, programmatic integration using pytest for incorporation into testing pipelines, and direct command-line interface (adk eval) for automated evaluations suitable for regular build generation and verification processes. 在结束之前,让我们看一个支持评估的框架的具体示例。使用 Google 的 ADK 进行智能体评估(见图 3)可以通过三种方法进行:基于 Web 的 UI(adk web)用于交互式评估和数据集生成、使用 pytest 的编程集成用于整合到测试管道中,以及直接命令行界面(adk eval)用于适合定期构建生成和验证过程的自动化评估。 Fig.3: Evaluation Support for Google ADK 图 3:Google ADK 的评估支持 The web-based UI enables interactive session creation and saving into existing or new eval sets, displaying evaluation status. Pytest integration allows running test files as part of integration tests by calling AgentEvaluator.evaluate, specifying the agent module and test file path. 基于 Web 的 UI 支持交互式会话创建并保存到现有或新的评估集中,显示评估状态。Pytest 集成允许通过调用 AgentEvaluator.evaluate 并指定智能体模块和测试文件路径,将测试文件作为集成测试的一部分运行。 The command-line interface facilitates automated evaluation by providing the agent module path and eval set file, with options to specify a configuration file or print detailed results. Specific evals within a larger eval set can be selected for execution by listing them after the eval set filename, separated by commas. 命令行界面通过提供智能体模块路径和评估集文件来促进自动化评估,并具有指定配置文件或打印详细结果的选项。可以通过在评估集文件名后列出(用逗号分隔)来选择较大评估集中的特定评估以供执行。

At a Glance

概览

What: Agentic systems and LLMs operate in complex, dynamic environments where their performance can degrade over time. Their probabilistic and non-deterministic nature means that traditional software testing is insufficient for ensuring reliability. Evaluating dynamic multi-agent systems is a significant challenge because their constantly changing nature and that of their environments demand the development of adaptive testing methods and sophisticated metrics that can measure collaborative success beyond individual performance. Problems like data drift, unexpected interactions, tool calling, and deviations from intended goals can arise after deployment. Continuous assessment is therefore necessary to measure an agent’s effectiveness, efficiency, and adherence to operational and safety requirements. 内容: Agentic 系统和 LLM 在复杂的动态环境中运行,它们的性能可能会随时间退化。它们的概率性和非确定性本质意味着传统的软件测试不足以确保可靠性。评估动态多智能体系统是一项重大挑战,因为它们不断变化的性质以及其环境的性质要求开发适应性测试方法和能够测量超越个体性能的协作成功的复杂指标。部署后可能出现数据漂移、意外交互、工具调用和偏离预期目标等问题。因此,持续评估对于测量智能体的效能、效率以及对操作和安全要求的遵守是必要的。 Why: A standardized evaluation and monitoring framework provides a systematic way to assess and ensure the ongoing performance of intelligent agents. This involves defining clear metrics for accuracy, latency, and resource consumption, like token usage for LLMs. It also includes advanced techniques such as analyzing agentic trajectories to understand the reasoning process and employing an LLM-as-a-Judge for nuanced, qualitative assessments. By establishing feedback loops and reporting systems, this framework allows for continuous improvement, A/B testing, and the detection of anomalies or performance drift, ensuring the agent remains aligned with its objectives. 原因: 标准化的评估和监控框架提供了一种系统的方式来评估和确保智能体的持续性能。这涉及为准确性、延迟和资源消耗(如 LLM 的 token 使用量)定义明确的指标。它还包括高级技术,例如分析智能体轨迹以理解推理过程,以及采用 LLM-as-a-Judge 进行细微的、定性的评估。通过建立反馈循环和报告系统,该框架允许持续改进、A/B 测试以及检测异常或性能漂移,确保智能体与目标保持一致。 Rule of thumb: Use this pattern when deploying agents in live, production environments where real-time performance and reliability are critical. Additionally, use it when needing to systematically compare different versions of an agent or its underlying models to drive improvements, and when operating in regulated or high-stakes domains requiring compliance, safety, and ethical audits. This pattern is also suitable when an agent’s performance may degrade over time due to changes in data or the environment (drift), or when evaluating complex agentic behavior, including the sequence of actions (trajectory) and the quality of subjective outputs like helpfulness. 经验法则: 在部署智能体到实时性能和可靠性至关重要的生产环境时使用此模式。此外,当需要系统地比较智能体或其底层模型的不同版本以推动改进时,以及在需要合规性、安全性和道德审计的受监管或高风险领域运营时使用它。当智能体的性能可能由于数据或环境变化(漂移)而随时间退化时,或者在评估复杂的智能体行为时(包括操作序列(轨迹)和主观输出(如有用性)的质量),此模式也适用。 Visual summary 可视化摘要
Fig.4: Evaluation and Monitoring design pattern 图 4:评估和监控设计模式

Key Takeaways

关键要点