工具错误处理

核对日期：2026-05-09。

1. 定义与边界

工具错误处理是指工具调用在参数、权限、网络、业务、外部依赖、模型选择等环节失败时，系统如何分类、恢复、降级、告知用户并保留可审计证据。

它不是简单的 try/catch。Agent 工具错误会影响模型后续推理，因此错误必须结构化、可解释、可回放。

2. 为什么重要

工具调用失败在生产中是常态：

模型生成缺字段或非法枚举。
外部 API 超时、限流或返回 5xx。
用户无权限或审批过期。
工具执行部分成功。
同一工具被重复调用导致副作用。
工具返回内容过长或包含不可信指令。

没有明确错误语义，模型会猜测、重复调用或给出虚假成功结论。

3. 核心机制

3.1 错误分类

错误类型	是否重试	是否回给模型	处理方式
`invalid_arguments`	否	是	让模型修正或向用户澄清
`permission_denied`	否	是	说明权限不足，不暴露敏感细节
`approval_required`	否	否/部分	进入审批流程
`timeout`	可重试	是	指数退避，超过预算降级
`rate_limited`	可重试	是	等待或切换备用路径
`upstream_error`	可重试	是	记录上游状态，必要时降级
`partial_success`	视情况	是	明确哪些成功、哪些失败
`unsafe_output`	否	否	隔离输出，进入安全处理

3.2 错误对象

{
  "status": "error",
  "code": "rate_limited",
  "message_for_model": "The CRM API is rate limited. Retry after 30 seconds or ask the user to try later.",
  "message_for_user": "CRM 查询暂时限流，稍后可重试。",
  "retry_after_ms": 30000,
  "safe_to_retry": true,
  "trace_id": "tr_123"
}

4. 架构模式

4.1 Retry Budget

每次 Agent run:
  max_tool_calls = 12
  max_retries_per_tool = 2
  max_total_latency = 60s

重试预算要按一次任务整体控制，而不是每个工具各自无限重试。

4.2 幂等执行

写操作必须有幂等键：

{
  "idempotency_key": "run_123_call_456",
  "operation": "email.send",
  "arguments_hash": "sha256:..."
}

如果同一调用因网络失败重试，工具端应返回第一次执行结果，而不是重复执行。

4.3 降级路径

主路径	降级路径
实时 API 查询	使用缓存并标注时间
写入业务系统	生成草稿或工单等待人工处理
批量工具	分批执行并汇报部分失败
MCP 远程 Server	本地只读能力或提示用户连接失败

5. 工程实现

5.1 工具包装器

async function runToolWithRecovery(call: ToolCall, ctx: RunContext) {
  const tool = registry.get(call.name);
  const args = validate(tool.inputSchema, call.arguments);
  const decision = await policy.evaluate(call, args, ctx);

  if (decision.type !== "allow") return decisionToObservation(decision);

  return await retry.withBudget(ctx.retryBudget, async () => {
    const result = await tool.handler(args, {
      timeoutMs: tool.runtime.timeoutMs,
      idempotencyKey: `${ctx.runId}:${call.id}`
    });
    return validateOutput(tool.outputSchema, result);
  });
}

5.2 给模型的错误应可行动

差：

Error: failed

好：

{
  "code": "missing_required_field",
  "field": "start_time",
  "recoverable": true,
  "instruction": "Ask the user for the event start time before calling calendar.create_event again."
}

6. 生产实践

每个工具定义稳定错误码表。
对 timeout、rate limit、5xx 使用退避重试。
对 4xx、权限、非法参数不自动重试。
把用户可见错误和模型可见错误分开，避免泄露内部细节。
长耗时工具返回任务 ID，通过状态查询工具轮询。
对工具结果进行输出 schema 校验，失败时不要直接喂给模型。

7. 常见反模式

反模式	后果
所有错误都当成自然语言返回给模型	模型无法稳定恢复
对写操作无幂等重试	重复发送、重复下单
工具失败后模型假装成功	用户信任受损
无限自动重试	成本和延迟失控
把内部异常栈给模型	泄露系统实现

8. 评测方法

fault injection：模拟超时、限流、上游 500。
invalid argument eval：给歧义输入，检查澄清能力。
idempotency eval：网络失败后重试，检查是否重复副作用。
partial success eval：批量任务部分失败时，检查汇报是否准确。
unsafe output eval：工具返回提示注入文本，检查是否隔离。

9. 安全与治理

错误消息不能包含密钥、token、内部网络地址、SQL 语句。
权限失败只返回必要信息，不告诉攻击者哪些资源存在。
工具执行失败要记录审计事件，尤其是被拒绝和被审批取消的高风险动作。
对连续失败的工具进行熔断，避免 Agent 反复触发异常系统。

10. 错误处理决策表

错误码	给模型的信息	给用户的信息	自动动作	审计
`missing_required_field`	缺哪个字段，如何澄清	需要补充信息	询问用户	trace
`permission_denied`	权限不足，不暴露资源细节	无权执行该操作	停止	audit
`approval_expired`	需要重新审批	审批已过期	重新发起审批	audit
`timeout`	可稍后重试或降级	系统暂时超时	退避重试	trace
`unsafe_output`	不回填原始结果	工具返回不安全内容	隔离并告警	security audit
`idempotency_conflict`	相同 key 参数不同	操作状态不确定	人工介入	audit

11. 工具错误恢复流程

12. Fault Injection 方案

{
  "suite": "tool_error_recovery_v1",
  "cases": [
    {
      "id": "crm_timeout_retry_once",
      "tool": "crm.search_customer",
      "fault": {"type": "timeout", "times": 1},
      "expected": "retry_then_success"
    },
    {
      "id": "email_send_network_after_commit",
      "tool": "email.send",
      "fault": {"type": "network_error_after_side_effect"},
      "expected": "idempotency_key_prevents_duplicate_send"
    },
    {
      "id": "unsafe_html_result",
      "tool": "web.fetch",
      "fault": {"type": "prompt_injection_in_result"},
      "expected": "unsafe_output_blocked"
    }
  ]
}

13. 权威资料

OpenAI Function Calling guide: https://platform.openai.com/docs/guides/function-calling
Anthropic Handle tool calls: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/overview
MCP Tools specification 2025-11-25: https://modelcontextprotocol.io/specification/2025-11-25/server/tools
MCP Transports specification 2025-11-25: https://modelcontextprotocol.io/specification/2025-11-25/basic/transports

1. 定义与边界​

2. 为什么重要​

3. 核心机制​

3.1 错误分类​

3.2 错误对象​

4. 架构模式​

4.1 Retry Budget​

4.2 幂等执行​

4.3 降级路径​

5. 工程实现​

5.1 工具包装器​

5.2 给模型的错误应可行动​

6. 生产实践​

7. 常见反模式​

8. 评测方法​

9. 安全与治理​

10. 错误处理决策表​

11. 工具错误恢复流程​

12. Fault Injection 方案​

13. 权威资料​