{
  "metadata": {
    "id": "ch25",
    "title": "第25章：调试与诊断",
    "volume": "vol7",
    "volume_title": "Agent编程技法",
    "word_count": 2032,
    "difficulty": "intermediate",
    "prerequisites": [
      "ch15"
    ],
    "key_concepts": [
      "Agent调试的挑战",
      "非确定性",
      "调试难度等级",
      "本地调试环境搭建",
      "Mock LLM",
      "Replay机制",
      "交互式调试器",
      "思维链调试",
      "CoT可视化",
      "推理步骤检查",
      "工具调用调试",
      "工具执行追踪",
      "参数验证",
      "副作用回放",
      "Prompt调试"
    ],
    "learning_objectives": [],
    "estimated_tokens": 1219,
    "source_file": "vol7/ch25_调试与诊断.md"
  },
  "overview": "传统程序的调试是确定性的——断点、单步执行、变量检查足以定位大多数问题。但 Agent 系统调试面临独特挑战：LLM 的非确定性输出、多步推理的复杂链路、工具调用的副作用、以及\"幻觉\"等 LLM 特有的问题。本章将系统地介绍 Agent 调试的方法论和工具链，从本地环境搭建到生产环境故障排查，帮助你高效地定位和修复 Agent 系统中的各种问题。",
  "sections": [
    {
      "id": "25.1",
      "title": "25.1 Agent调试的挑战",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "25.1.1",
          "title": "25.1.1 非确定性",
          "content": ""
        },
        {
          "id": "25.1.2",
          "title": "25.1.2 调试难度等级",
          "content": "| 问题类型 | 难度 | 常见表现 | 调试方法 |\n|---------|------|---------|---------|\n| 代码Bug | ⭐ | 异常、崩溃 | 传统调试器 |\n| Prompt问题 | ⭐⭐ | 输出格式错误、不遵循指令 | Prompt版本测试 |\n| 工具调用失败 | ⭐⭐ | 超时、参数错误 | 工具日志分析 |\n| 推理路径错误 | ⭐⭐⭐ | 选择了错误的工具/方向 | 思维链可视化 |\n| 幻觉 | ⭐⭐⭐⭐ | 编造信息 | 事实核查、RAG增强 |\n| 循环推理 | ⭐⭐⭐⭐ | Agent反复执行相同操作 | 步骤计数器、终止条件 |\n| 上下文溢出 | ⭐⭐⭐ | Token超限、截断 | 上下文监控 |"
        }
      ]
    },
    {
      "id": "25.2",
      "title": "25.2 本地调试环境搭建",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "25.2.1",
          "title": "25.2.1 Mock LLM",
          "content": ""
        },
        {
          "id": "25.2.2",
          "title": "25.2.2 Replay机制",
          "content": ""
        },
        {
          "id": "25.2.3",
          "title": "25.2.3 交互式调试器",
          "content": ""
        }
      ]
    },
    {
      "id": "25.3",
      "title": "25.3 思维链调试",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "25.3.1",
          "title": "25.3.1 CoT可视化",
          "content": ""
        },
        {
          "id": "25.3.2",
          "title": "25.3.2 推理步骤检查",
          "content": ""
        }
      ]
    },
    {
      "id": "25.4",
      "title": "25.4 工具调用调试",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "25.4.1",
          "title": "25.4.1 工具执行追踪",
          "content": ""
        },
        {
          "id": "25.4.2",
          "title": "25.4.2 参数验证",
          "content": ""
        },
        {
          "id": "25.4.3",
          "title": "25.4.3 副作用回放",
          "content": ""
        }
      ]
    },
    {
      "id": "25.5",
      "title": "25.5 Prompt调试",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "25.5.1",
          "title": "25.5.1 A/B测试",
          "content": ""
        },
        {
          "id": "25.5.2",
          "title": "25.5.2 Prompt版本管理",
          "content": ""
        }
      ]
    },
    {
      "id": "25.6",
      "title": "25.6 常见Agent故障诊断",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "25.6.1",
          "title": "25.6.1 循环推理",
          "content": ""
        },
        {
          "id": "25.6.2",
          "title": "25.6.2 工具调用失败",
          "content": ""
        },
        {
          "id": "25.6.3",
          "title": "25.6.3 上下文溢出",
          "content": ""
        }
      ]
    },
    {
      "id": "25.7",
      "title": "25.7 日志分析与问题定位",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "25.7.1",
          "title": "25.7.1 日志关联分析",
          "content": ""
        }
      ]
    },
    {
      "id": "25.8",
      "title": "25.8 Agent调试工具生态",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "25.8.1",
          "title": "25.8.1 主流工具对比",
          "content": "| 工具 | 类型 | 核心功能 | 适用场景 | 开源 |\n|------|------|---------|---------|------|\n| **LangSmith** | 云服务 | 追踪、评估、调试 | LangChain生态 | ❌ |\n| **Weave** | 开源 | 追踪、评估、实验管理 | 通用LLM应用 | ✅ |\n| **PromptFoo** | 开源 | Prompt测试、评估、红队 | Prompt工程 | ✅ |\n| **Arize Phoenix** | 开源 | 追踪、评估、可观测 | 生产监控 | ✅ |\n| **Langfuse** | 开源 | 追踪、Prompt管理、评估 | 团队协作 | ✅ |\n| **Helicone** | 开源 | 缓存、日志、成本追踪 | 成本优化 | ✅ |"
        },
        {
          "id": "25.8.2",
          "title": "25.8.2 Langfuse集成示例",
          "content": ""
        }
      ]
    },
    {
      "id": "25.9",
      "title": "25.9 生产环境故障排查",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "25.9.1",
          "title": "25.9.1 实时监控仪表盘",
          "content": ""
        },
        {
          "id": "25.9.2",
          "title": "25.9.2 故障注入测试",
          "content": ""
        }
      ]
    },
    {
      "id": "最佳实践",
      "title": "最佳实践",
      "level": 2,
      "content": "1. **结构化追踪**：从第一天起就记录完整的执行追踪（LLM I/O、工具调用、决策）\n2. **Mock优先**：本地调试时用Mock LLM替代真实调用，快速迭代\n3. **版本化Prompt**：每次修改Prompt都保存版本，方便回滚和A/B测试\n4. **设置护栏**：最大步数、超时、成本上限，防止Agent失控\n5. **故障注入**：主动注入故障，验证Agent的容错能力",
      "subsections": []
    },
    {
      "id": "常见陷阱",
      "title": "常见陷阱",
      "level": 2,
      "content": "1. **只在线上调试**：在生产环境直接调试是危险的。搭建本地调试环境\n2. **忽略temperature**：调试时忘记设temperature=0，导致结果不可复现\n3. **日志过于详细**：记录了完整Prompt导致日志存储成本巨大。脱敏+截断\n4. **不保存执行记录**：出了问题没有回放数据。始终保存追踪记录\n5. **忽略间歇性故障**：只在持续故障时才排查。间歇性故障需要长时间监控",
      "subsections": []
    },
    {
      "id": "小结",
      "title": "小结",
      "level": 2,
      "content": "Agent 调试是一个兼具技术和耐心的过程。通过 Mock LLM、执行回放、思维链可视化、工具调用追踪、Prompt 版本管理等工具和方法，我们可以有效定位 Agent 系统中的各种问题。关键原则是：**可观测性先行**——没有日志和追踪，调试就是盲人摸象。",
      "subsections": []
    },
    {
      "id": "延伸阅读",
      "title": "延伸阅读",
      "level": 2,
      "content": "1. **LangSmith文档**: https://docs.smith.langchain.com/\n2. **Langfuse文档**: https://langfuse.com/docs\n3. **PromptFoo**: https://promptfoo.dev/ — Prompt测试框架\n4. **Weave**: https://wandb.ai/weave — LLM追踪工具\n5. **论文**: \"Debugging LLM-as-a-Judge\" — 评估方法的调试验证",
      "subsections": []
    }
  ],
  "code_blocks": [
    {
      "id": "code-1",
      "language": "python",
      "description": "",
      "code": "class NonDeterminismDebugger:\n    \"\"\"非确定性调试工具\"\"\"\n    \n    def __init__(self):\n        self._execution_history: list[dict] = []\n    \n    async def debug_with_replay(self, agent, input_data: str,\n                                 num_runs: int = 5) -> dict:\n        \"\"\"多次执行同一输入，对比差异\"\"\"\n        results = []\n        for i in range(num_runs):\n            result = await agent.run(input_data)\n            results.append({\n                \"run\": i + 1,\n                \"output\": result,\n                \"steps\": agent.get_last_trace(),\n            })\n        \n        # 分析一致性\n        outputs = [r[\"output\"] for r in results]\n        unique_outputs = set(outputs)\n        \n        return {\n            \"consistency_rate\": len(unique_outputs) / num_runs,\n            \"unique_outputs\": len(unique_outputs),\n            \"total_runs\": num_runs,\n            \"results\": results,\n            \"diagnosis\": self._diagnose_inconsistency(results),\n        }\n    \n    def _diagnose_inconsistency(self, results: list[dict]) -> str:\n        if all(r[\"output\"] == results[0][\"output\"] for r in results):\n            return \"输出一致，非确定性不是问题来源。\"\n        \n        steps_variations = set(\n            tuple(s[\"tool\"] for s in r[\"steps\"]) \n            for r in results\n        )\n        \n        if len(steps_variations) > 1:\n            return \"推理路径不一致：Agent在不同运行中选择了不同的工具/步骤。\"\n        return \"推理路径一致但输出不同：可能是LLM生成波动，尝试降低temperature。\"",
      "section_ref": "25.1.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-2",
      "language": "python",
      "description": "",
      "code": "from typing import Any\n\nclass MockLLM:\n    \"\"\"Mock LLM：用于调试时替代真实LLM调用\"\"\"\n    \n    def __init__(self):\n        self._responses: list[str] = []\n        self._call_log: list[dict] = []\n        self._mode = \"record\"  # record / replay / fixed\n    \n    def set_fixed_response(self, response: str):\n        \"\"\"设置固定响应\"\"\"\n        self._mode = \"fixed\"\n        self._fixed_response = response\n    \n    def set_replay_responses(self, responses: list[str]):\n        \"\"\"设置回放响应序列\"\"\"\n        self._mode = \"replay\"\n        self._responses = responses\n        self._index = 0\n    \n    async def chat(self, messages: list[dict], **kwargs) -> str:\n        \"\"\"模拟LLM调用\"\"\"\n        # 记录调用\n        call_info = {\n            \"messages\": messages,\n            \"model\": kwargs.get(\"model\", \"mock\"),\n            \"timestamp\": datetime.now().isoformat(),\n        }\n        self._call_log.append(call_info)\n        \n        if self._mode == \"fixed\":\n            return self._fixed_response\n        elif self._mode == \"replay\":\n            if self._index < len(self._responses):\n                response = self._responses[self._index]\n                self._index += 1\n                return response\n            return \"Mock response exhausted\"\n        else:\n            return \"[MOCK] This is a mock response\"\n    \n    def get_call_log(self) -> list[dict]:\n        return self._call_log\n\n# 使用示例\nmock_llm = MockLLM()\nmock_llm.set_fixed_response(\n    '我需要调用搜索工具来回答这个问题。\\n'\n    '```json\\n{\"tool\": \"search\", \"args\": {\"query\": \"Python Agent\"}}\\n```'\n)\nagent = Agent(llm=mock_llm, tools=[search_tool])\nresult = await agent.run(\"什么是Python Agent？\")\nprint(mock_llm.get_call_log())  # 查看LLM被调用的情况",
      "section_ref": "25.2.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-3",
      "language": "python",
      "description": "",
      "code": "import pickle\n\nclass ExecutionReplayer:\n    \"\"\"执行回放器\"\"\"\n    \n    def __init__(self, storage_path: str = \"replays/\"):\n        self.storage_path = storage_path\n        os.makedirs(storage_path, exist_ok=True)\n    \n    async def record(self, trace_id: str, \n                     agent: Any, input_data: str,\n                     result: str):\n        \"\"\"记录完整执行过程\"\"\"\n        recording = {\n            \"trace_id\": trace_id,\n            \"input\": input_data,\n            \"result\": result,\n            \"llm_calls\": agent.llm_call_log,\n            \"tool_calls\": agent.tool_call_log,\n            \"decisions\": agent.decision_log,\n            \"timestamp\": datetime.now().isoformat(),\n        }\n        \n        filepath = os.path.join(self.storage_path, f\"{trace_id}.pkl\")\n        with open(filepath, \"wb\") as f:\n            pickle.dump(recording, f)\n    \n    async def replay(self, trace_id: str) -> dict:\n        \"\"\"回放执行过程\"\"\"\n        filepath = os.path.join(self.storage_path, f\"{trace_id}.pkl\")\n        with open(filepath, \"rb\") as f:\n            recording = pickle.load(f)\n        \n        print(f\"📋 回放 Trace: {trace_id}\")\n        print(f\"   输入: {recording['input'][:100]}...\")\n        print(f\"   结果: {recording['result'][:100]}...\")\n        print(f\"   LLM调用: {len(recording['llm_calls'])}次\")\n        print(f\"   工具调用: {len(recording['tool_calls'])}次\")\n        \n        return recording",
      "section_ref": "25.2.2",
      "runnable": true,
      "dependencies": [
        "pickle"
      ]
    },
    {
      "id": "code-4",
      "language": "python",
      "description": "",
      "code": "class AgentDebugger:\n    \"\"\"Agent交互式调试器\"\"\"\n    \n    def __init__(self, agent):\n        self.agent = agent\n        self.breakpoints: set[str] = set()\n        self._step_mode = False\n    \n    def set_breakpoint(self, event_type: str):\n        \"\"\"设置断点\"\"\"\n        self.breakpoints.add(event_type)\n    \n    async def debug_run(self, input_data: str) -> str:\n        \"\"\"带调试的执行\"\"\"\n        self.agent.on(\"llm_call\", self._on_llm_call)\n        self.agent.on(\"tool_call\", self._on_tool_call)\n        self.agent.on(\"decision\", self._on_decision)\n        \n        result = await self.agent.run(input_data)\n        return result\n    \n    async def _on_llm_call(self, event: dict):\n        if \"llm_call\" in self.breakpoints or self._step_mode:\n            print(f\"\\n🧠 LLM调用:\")\n            print(f\"   输入: {event['input'][:200]}...\")\n            print(f\"   模型: {event.get('model', 'unknown')}\")\n            \n            if self._step_mode:\n                action = input(\"   [c]继续 [s]单步 [m]修改 [q]退出: \")\n                if action == 'q':\n                    raise KeyboardInterrupt()\n                elif action == 'm':\n                    new_input = input(\"   新输入: \")\n                    event['input'] = new_input\n    \n    async def _on_tool_call(self, event: dict):\n        if \"tool_call\" in self.breakpoints or self._step_mode:\n            print(f\"\\n🔧 工具调用: {event['tool_name']}\")\n            print(f\"   参数: {event['args']}\")\n            \n            if self._step_mode:\n                action = input(\"   [c]继续 [s]跳过 [e]编辑参数 [q]退出: \")\n                if action == 's':\n                    event['skip'] = True\n                elif action == 'e':\n                    new_args = input(\"   新参数(JSON): \")\n                    event['args'] = json.loads(new_args)\n    \n    async def _on_decision(self, event: dict):\n        if \"decision\" in self.breakpoints or self._step_mode:\n            print(f\"\\n💭 决策: {event['reasoning'][:200]}...\")",
      "section_ref": "25.2.3",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-5",
      "language": "python",
      "description": "",
      "code": "class ChainOfThoughtDebugger:\n    \"\"\"思维链调试器\"\"\"\n    \n    def visualize(self, trace: list[dict]) -> str:\n        \"\"\"将思维链可视化为文本\"\"\"\n        lines = [\"\\n\" + \"=\" * 70, \"🧠 Agent 思维链\", \"=\" * 70]\n        \n        for i, step in enumerate(trace):\n            step_type = step.get(\"type\", \"unknown\")\n            \n            if step_type == \"reasoning\":\n                lines.append(f\"\\n📌 步骤 {i+1}: 推理\")\n                lines.append(f\"   {step['content']}\")\n            \n            elif step_type == \"tool_call\":\n                lines.append(f\"\\n🔧 步骤 {i+1}: 调用工具\")\n                lines.append(f\"   工具: {step['tool_name']}\")\n                lines.append(f\"   参数: {json.dumps(step['args'], ensure_ascii=False)}\")\n                status = \"✅\" if step.get(\"success\") else \"❌\"\n                lines.append(f\"   结果: {status} {str(step.get('result', ''))[:200]}\")\n            \n            elif step_type == \"observation\":\n                lines.append(f\"\\n👁️ 步骤 {i+1}: 观察\")\n                lines.append(f\"   {step['content'][:300]}\")\n            \n            elif step_type == \"final_answer\":\n                lines.append(f\"\\n✅ 最终回答:\")\n                lines.append(f\"   {step['content']}\")\n        \n        lines.append(\"\\n\" + \"=\" * 70)\n        return \"\\n\".join(lines)\n    \n    def find_anomalies(self, trace: list[dict]) -> list[dict]:\n        \"\"\"检测思维链中的异常\"\"\"\n        anomalies = []\n        \n        for i, step in enumerate(trace):\n            # 检测循环\n            if i > 0 and step.get(\"type\") == \"tool_call\":\n                for j in range(max(0, i-5), i):\n                    if (trace[j].get(\"tool_name\") == step.get(\"tool_name\") and\n                        trace[j].get(\"args\") == step.get(\"args\")):\n                        anomalies.append({\n                            \"type\": \"loop_detected\",\n                            \"step\": i,\n                            \"detail\": f\"重复调用 {step['tool_name']}，参数相同\"\n                        })\n            \n            # 检测工具调用失败\n            if step.get(\"type\") == \"tool_call\" and not step.get(\"success\"):\n                anomalies.append({\n                    \"type\": \"tool_failure\",\n                    \"step\": i,\n                    \"detail\": f\"{step['tool_name']} 失败: {step.get('error', '')}\"\n                })\n            \n            # 检测推理过长\n            if step.get(\"type\") == \"reasoning\":\n                if len(step.get(\"content\", \"\")) > 2000:\n                    anomalies.append({\n                        \"type\": \"verbose_reasoning\",\n                        \"step\": i,\n                        \"detail\": \"推理内容过长，可能导致Token浪费\"\n                    })\n        \n        return anomalies",
      "section_ref": "25.3.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-6",
      "language": "python",
      "description": "",
      "code": "class StepInspector:\n    \"\"\"推理步骤检查器\"\"\"\n    \n    def check_prompt_drift(self, trace: list[dict]) -> list[str]:\n        \"\"\"检查Prompt漂移——Agent是否偏离了原始目标\"\"\"\n        warnings = []\n        \n        if not trace:\n            return warnings\n        \n        original_goal = None\n        for step in trace:\n            if step.get(\"type\") == \"user_input\":\n                original_goal = step[\"content\"]\n                break\n        \n        if not original_goal:\n            return warnings\n        \n        # 检查后续步骤是否与原始目标相关\n        for i, step in enumerate(trace):\n            if step.get(\"type\") == \"tool_call\":\n                tool_name = step[\"tool_name\"]\n                # 如果工具调用与目标无关\n                if not self._is_relevant(tool_name, original_goal):\n                    warnings.append(\n                        f\"步骤{i}: 工具 {tool_name} 可能与目标不相关\"\n                    )\n        \n        return warnings\n    \n    def _is_relevant(self, tool_name: str, goal: str) -> bool:\n        \"\"\"简单的相关性判断\"\"\"\n        goal_lower = goal.lower()\n        tool_lower = tool_name.lower()\n        \n        # 关键词匹配\n        keywords_map = {\n            \"search\": [\"搜索\", \"查找\", \"查询\", \"search\", \"find\"],\n            \"calculate\": [\"计算\", \"统计\", \"分析\", \"calculate\"],\n            \"database\": [\"数据库\", \"记录\", \"数据\", \"database\"],\n            \"email\": [\"邮件\", \"发送\", \"通知\", \"email\", \"send\"],\n        }\n        \n        for key, keywords in keywords_map.items():\n            if key in tool_lower:\n                if any(kw in goal_lower for kw in keywords):\n                    return True\n        \n        return True  # 默认相关",
      "section_ref": "25.3.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-7",
      "language": "python",
      "description": "",
      "code": "class ToolCallTracer:\n    \"\"\"工具调用追踪器\"\"\"\n    \n    def __init__(self):\n        self._traces: list[dict] = []\n    \n    def trace_call(self, tool_name: str, args: dict,\n                   result: Any, duration_ms: float,\n                   success: bool, error: str = None):\n        self._traces.append({\n            \"tool_name\": tool_name,\n            \"args\": args,\n            \"result_preview\": str(result)[:200] if result else None,\n            \"duration_ms\": duration_ms,\n            \"success\": success,\n            \"error\": error,\n            \"timestamp\": datetime.now().isoformat(),\n        })\n    \n    def get_failed_calls(self) -> list[dict]:\n        return [t for t in self._traces if not t[\"success\"]]\n    \n    def get_slow_calls(self, threshold_ms: float = 2000) -> list[dict]:\n        return [t for t in self._traces if t[\"duration_ms\"] > threshold_ms]\n    \n    def get_summary(self) -> dict:\n        if not self._traces:\n            return {\"total_calls\": 0}\n        \n        return {\n            \"total_calls\": len(self._traces),\n            \"success_rate\": sum(1 for t in self._traces if t[\"success\"]) / len(self._traces),\n            \"avg_duration_ms\": sum(t[\"duration_ms\"] for t in self._traces) / len(self._traces),\n            \"failed_calls\": self.get_failed_calls(),\n            \"slow_calls\": self.get_slow_calls(),\n            \"tools_called\": list(set(t[\"tool_name\"] for t in self._traces)),\n        }",
      "section_ref": "25.4.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-8",
      "language": "python",
      "description": "",
      "code": "from pydantic import BaseModel, ValidationError\n\nclass ToolParameterValidator:\n    \"\"\"工具参数验证器\"\"\"\n    \n    def __init__(self, tool_schemas: dict[str, type[BaseModel]]):\n        self.schemas = tool_schemas\n    \n    def validate(self, tool_name: str, args: dict) -> tuple[bool, str]:\n        \"\"\"验证工具参数\"\"\"\n        if tool_name not in self.schemas:\n            return True, \"\"  # 未注册的工具跳过验证\n        \n        schema = self.schemas[tool_name]\n        try:\n            schema(**args)\n            return True, \"\"\n        except ValidationError as e:\n            return False, self._format_error(e)\n    \n    def _format_error(self, error: ValidationError) -> str:\n        errors = []\n        for err in error.errors():\n            field = \".\".join(str(loc) for loc in err[\"loc\"])\n            errors.append(f\"  {field}: {err['msg']}\")\n        return \"参数验证失败:\\n\" + \"\\n\".join(errors)\n\n# 示例：定义工具参数Schema\nclass SearchArgs(BaseModel):\n    query: str\n    max_results: int = 5\n    language: str = \"zh\"\n\nvalidator = ToolParameterValidator({\"search\": SearchArgs})\nvalid, error_msg = validator.validate(\"search\", {\"query\": \"test\"})",
      "section_ref": "25.4.2",
      "runnable": true,
      "dependencies": [
        "pydantic"
      ]
    },
    {
      "id": "code-9",
      "language": "python",
      "description": "",
      "code": "class SideEffectRecorder:\n    \"\"\"副作用记录器\"\"\"\n    \n    def __init__(self):\n        self._recordings: list[dict] = []\n        self._replaying = False\n        self._replay_index = 0\n    \n    def start_recording(self):\n        self._recordings = []\n        self._replaying = False\n    \n    def start_replaying(self):\n        self._replaying = True\n        self._replay_index = 0\n    \n    async def execute_with_recording(self, func, *args, **kwargs):\n        if self._replaying:\n            # 回放模式：返回记录的结果\n            recording = self._recordings[self._replay_index]\n            self._replay_index += 1\n            return recording[\"result\"]\n        \n        # 记录模式：执行并记录\n        result = await func(*args, **kwargs)\n        self._recordings.append({\n            \"func\": func.__name__,\n            \"args\": str(args),\n            \"kwargs\": str(kwargs),\n            \"result\": result,\n        })\n        return result",
      "section_ref": "25.4.3",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-10",
      "language": "python",
      "description": "",
      "code": "class PromptABTest:\n    \"\"\"Prompt A/B测试\"\"\"\n    \n    async def run_test(self, prompt_a: str, prompt_b: str,\n                      test_cases: list[dict],\n                      evaluator) -> dict:\n        \"\"\"运行A/B测试\"\"\"\n        results_a = []\n        results_b = []\n        \n        for case in test_cases:\n            input_data = case[\"input\"]\n            expected = case[\"expected\"]\n            \n            # 测试Prompt A\n            output_a = await self._run_with_prompt(prompt_a, input_data)\n            score_a = await evaluator.evaluate(expected, output_a)\n            results_a.append(score_a)\n            \n            # 测试Prompt B\n            output_b = await self._run_with_prompt(prompt_b, input_data)\n            score_b = await evaluator.evaluate(expected, output_b)\n            results_b.append(score_b)\n        \n        return {\n            \"prompt_a\": {\n                \"avg_score\": sum(results_a) / len(results_a),\n                \"scores\": results_a,\n            },\n            \"prompt_b\": {\n                \"avg_score\": sum(results_b) / len(results_b),\n                \"scores\": results_b,\n            },\n            \"winner\": \"A\" if sum(results_a) > sum(results_b) else \"B\",\n            \"improvement\": (\n                (sum(results_b) - sum(results_a)) / sum(results_a) * 100\n            )\n        }",
      "section_ref": "25.5.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-11",
      "language": "python",
      "description": "",
      "code": "class PromptVersionManager:\n    \"\"\"Prompt版本管理\"\"\"\n    \n    def __init__(self, storage_path: str = \"prompts/\"):\n        self.storage_path = storage_path\n        os.makedirs(storage_path, exist_ok=True)\n    \n    def save_version(self, prompt_name: str, version: str,\n                     content: str, description: str = \"\"):\n        \"\"\"保存Prompt版本\"\"\"\n        version_data = {\n            \"name\": prompt_name,\n            \"version\": version,\n            \"content\": content,\n            \"description\": description,\n            \"created_at\": datetime.now().isoformat(),\n            \"content_hash\": hashlib.md5(content.encode()).hexdigest(),\n        }\n        \n        filepath = os.path.join(\n            self.storage_path, f\"{prompt_name}_v{version}.json\"\n        )\n        with open(filepath, \"w\", encoding=\"utf-8\") as f:\n            json.dump(version_data, f, ensure_ascii=False, indent=2)\n    \n    def load_version(self, prompt_name: str, \n                     version: str) -> str:\n        \"\"\"加载Prompt版本\"\"\"\n        filepath = os.path.join(\n            self.storage_path, f\"{prompt_name}_v{version}.json\"\n        )\n        with open(filepath, \"r\", encoding=\"utf-8\") as f:\n            data = json.load(f)\n        return data[\"content\"]\n    \n    def list_versions(self, prompt_name: str) -> list[dict]:\n        \"\"\"列出所有版本\"\"\"\n        versions = []\n        pattern = os.path.join(\n            self.storage_path, f\"{prompt_name}_v*.json\"\n        )\n        for filepath in glob.glob(pattern):\n            with open(filepath, \"r\", encoding=\"utf-8\") as f:\n                data = json.load(f)\n            versions.append({\n                \"version\": data[\"version\"],\n                \"description\": data[\"description\"],\n                \"created_at\": data[\"created_at\"],\n                \"hash\": data[\"content_hash\"],\n            })\n        return sorted(versions, key=lambda x: x[\"version\"])\n    \n    def diff_versions(self, prompt_name: str,\n                      v1: str, v2: str) -> str:\n        \"\"\"对比两个版本\"\"\"\n        c1 = self.load_version(prompt_name, v1)\n        c2 = self.load_version(prompt_name, v2)\n        \n        import difflib\n        diff = difflib.unified_diff(\n            c1.splitlines(), c2.splitlines(),\n            fromfile=f\"v{v1}\", tofile=f\"v{v2}\", lineterm=\"\"\n        )\n        return \"\\n\".join(diff)",
      "section_ref": "25.5.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-12",
      "language": "python",
      "description": "",
      "code": "class LoopDetector:\n    \"\"\"循环推理检测器\"\"\"\n    \n    def __init__(self, max_repeated_steps: int = 3):\n        self.max_repeated = max_repeated_steps\n    \n    def detect(self, trace: list[dict]) -> dict | None:\n        \"\"\"检测循环推理\"\"\"\n        if len(trace) < self.max_repeated * 2:\n            return None\n        \n        # 检查最近N步是否有重复模式\n        recent = trace[-self.max_repeated * 2:]\n        \n        for pattern_len in range(1, self.max_repeated + 1):\n            pattern = recent[:pattern_len]\n            repetitions = 0\n            \n            for i in range(0, len(recent), pattern_len):\n                segment = recent[i:i + pattern_len]\n                if self._steps_match(pattern, segment):\n                    repetitions += 1\n                else:\n                    break\n            \n            if repetitions >= self.max_repeated:\n                return {\n                    \"type\": \"loop\",\n                    \"pattern_length\": pattern_len,\n                    \"repetitions\": repetitions,\n                    \"pattern\": [\n                        s.get(\"tool_name\", s.get(\"type\", \"?\"))\n                        for s in pattern\n                    ],\n                    \"suggestion\": \"检查工具返回值是否有变化，或添加终止条件\"\n                }\n        \n        return None\n    \n    def _steps_match(self, a: list[dict], b: list[dict]) -> bool:\n        if len(a) != len(b):\n            return False\n        for sa, sb in zip(a, b):\n            if sa.get(\"tool_name\") != sb.get(\"tool_name\"):\n                return False\n            if sa.get(\"args\") != sb.get(\"args\"):\n                return False\n        return True",
      "section_ref": "25.6.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-13",
      "language": "python",
      "description": "",
      "code": "class ToolFailureDiagnoser:\n    \"\"\"工具故障诊断\"\"\"\n    \n    COMMON_FAILURES = {\n        \"timeout\": {\n            \"symptoms\": [\"TimeoutError\", \"timed out\", \"连接超时\"],\n            \"solutions\": [\n                \"增加超时时间\",\n                \"检查网络连接\",\n                \"添加重试机制\",\n                \"使用异步调用\"\n            ]\n        },\n        \"auth_failure\": {\n            \"symptoms\": [\"401\", \"403\", \"Unauthorized\", \"Authentication\"],\n            \"solutions\": [\n                \"检查API Key是否过期\",\n                \"验证权限范围\",\n                \"检查Token刷新逻辑\",\n                \"确认账户状态\"\n            ]\n        },\n        \"rate_limit\": {\n            \"symptoms\": [\"429\", \"Rate limit\", \"Too many requests\"],\n            \"solutions\": [\n                \"实现限流器\",\n                \"增加请求间隔\",\n                \"使用缓存减少调用\",\n                \"申请更高的速率限制\"\n            ]\n        },\n        \"invalid_params\": {\n            \"symptoms\": [\"400\", \"Bad Request\", \"Invalid\", \"required\"],\n            \"solutions\": [\n                \"检查参数类型和格式\",\n                \"验证必需参数是否提供\",\n                \"检查枚举值是否正确\",\n                \"添加参数校验\"\n            ]\n        },\n    }\n    \n    def diagnose(self, error: Exception) -> dict:\n        \"\"\"诊断工具失败原因\"\"\"\n        error_str = str(error)\n        \n        for failure_type, info in self.COMMON_FAILURES.items():\n            for symptom in info[\"symptoms\"]:\n                if symptom.lower() in error_str.lower():\n                    return {\n                        \"type\": failure_type,\n                        \"solutions\": info[\"solutions\"],\n                        \"error\": error_str,\n                    }\n        \n        return {\n            \"type\": \"unknown\",\n            \"solutions\": [\"查看详细错误日志\", \"检查API文档\", \"联系服务提供方\"],\n            \"error\": error_str,\n        }",
      "section_ref": "25.6.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-14",
      "language": "python",
      "description": "",
      "code": "class ContextOverflowHandler:\n    \"\"\"上下文溢出处理器\"\"\"\n    \n    def __init__(self, max_tokens: int = 128000):\n        self.max_tokens = max_tokens\n    \n    def check_and_fix(self, messages: list[dict]) -> tuple[list[dict], str]:\n        \"\"\"检查并修复上下文溢出\"\"\"\n        total = sum(self._estimate_tokens(m[\"content\"]) for m in messages)\n        \n        if total <= self.max_tokens:\n            return messages, \"ok\"\n        \n        # 修复策略\n        messages = self._emergency_trim(messages)\n        new_total = sum(self._estimate_tokens(m[\"content\"]) for m in messages)\n        \n        return messages, (\n            f\"context_overflow: {total} → {new_total} tokens \"\n            f\"(trimmed {total - new_total} tokens)\"\n        )\n    \n    def _emergency_trim(self, messages: list[dict]) -> list[dict]:\n        \"\"\"紧急裁剪：保留system + 最近3轮\"\"\"\n        system = [m for m in messages if m[\"role\"] == \"system\"]\n        non_system = [m for m in messages if m[\"role\"] != \"system\"]\n        \n        kept = non_system[-6:] if len(non_system) > 6 else non_system\n        return system + kept\n    \n    def _estimate_tokens(self, text: str) -> int:\n        return len(text) // 3",
      "section_ref": "25.6.3",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-15",
      "language": "python",
      "description": "",
      "code": "class LogCorrelator:\n    \"\"\"日志关联分析器\"\"\"\n    \n    def find_root_cause(self, trace_id: str, \n                        logs: list[dict]) -> dict:\n        \"\"\"通过日志关联找到根本原因\"\"\"\n        trace_logs = [l for l in logs if l.get(\"trace_id\") == trace_id]\n        \n        # 时间线排序\n        trace_logs.sort(key=lambda l: l.get(\"timestamp\", \"\"))\n        \n        # 查找第一个错误\n        first_error = None\n        for log in trace_logs:\n            if log.get(\"level\") in (\"ERROR\", \"WARNING\"):\n                first_error = log\n                break\n        \n        # 查找最慢的步骤\n        slowest = max(\n            (l for l in trace_logs if l.get(\"duration_ms\")),\n            key=lambda l: l.get(\"duration_ms\", 0),\n            default=None\n        )\n        \n        # 查找Token消耗最大的步骤\n        most_tokens = max(\n            (l for l in trace_logs if l.get(\"tokens\")),\n            key=lambda l: l.get(\"tokens\", 0),\n            default=None\n        )\n        \n        return {\n            \"trace_id\": trace_id,\n            \"total_steps\": len(trace_logs),\n            \"first_error\": first_error,\n            \"slowest_step\": slowest,\n            \"most_tokens_step\": most_tokens,\n            \"root_cause_hypothesis\": self._hypothesize(\n                first_error, slowest, most_tokens\n            ),\n        }\n    \n    def _hypothesize(self, error, slowest, most_tokens) -> str:\n        if error:\n            return f\"首个错误: {error.get('message', 'unknown')}\"\n        if slowest and slowest.get(\"duration_ms\", 0) > 5000:\n            return f\"性能瓶颈: {slowest.get('type', 'unknown')} 耗时 {slowest.get('duration_ms')}ms\"\n        return \"未发现明显异常\"",
      "section_ref": "25.7.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-16",
      "language": "python",
      "description": "| Helicone | 开源 | 缓存、日志、成本追踪 | 成本优化 | ✅ |",
      "code": "class LangfuseDebugger:\n    \"\"\"Langfuse调试集成\"\"\"\n    \n    def __init__(self, public_key: str, secret_key: str,\n                 host: str = \"https://cloud.langfuse.com\"):\n        from langfuse import Langfuse\n        self.langfuse = Langfuse(public_key, secret_key, host)\n    \n    def trace_agent(self, trace_id: str, agent_name: str,\n                    input_data: str, output_data: str,\n                    metadata: dict = None):\n        \"\"\"记录Agent执行追踪\"\"\"\n        self.langfuse.trace(\n            id=trace_id,\n            name=f\"{agent_name}_execution\",\n            input=input_data,\n            output=output_data,\n            metadata=metadata or {},\n        )\n    \n    def log_llm_call(self, trace_id: str, span_id: str,\n                     model: str, prompt: str, completion: str,\n                     usage: dict):\n        \"\"\"记录LLM调用\"\"\"\n        self.langfuse.generation(\n            trace_id=trace_id,\n            id=span_id,\n            model=model,\n            input=prompt,\n            output=completion,\n            usage=usage,\n        )",
      "section_ref": "25.8.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-17",
      "language": "python",
      "description": "",
      "code": "class AgentHealthDashboard:\n    \"\"\"Agent健康状态仪表盘\"\"\"\n    \n    def __init__(self, metrics_collector):\n        self.metrics = metrics_collector\n    \n    def get_health_status(self) -> dict:\n        \"\"\"获取健康状态\"\"\"\n        summary = self.metrics.get_summary()\n        \n        health = \"healthy\"\n        issues = []\n        \n        # 检查错误率\n        error_rate = 1 - summary[\"requests\"][\"success_rate\"] / 100\n        if error_rate > 0.1:\n            health = \"unhealthy\"\n            issues.append(f\"错误率过高: {error_rate:.1%}\")\n        elif error_rate > 0.05:\n            health = \"degraded\"\n            issues.append(f\"错误率偏高: {error_rate:.1%}\")\n        \n        # 检查延迟\n        p95 = summary[\"latency_ms\"][\"e2e_p95\"]\n        if p95 > 30000:\n            health = \"unhealthy\"\n            issues.append(f\"P95延迟过高: {p95:.0f}ms\")\n        elif p95 > 15000:\n            health = \"degraded\"\n            issues.append(f\"P95延迟偏高: {p95:.0f}ms\")\n        \n        return {\n            \"status\": health,\n            \"issues\": issues,\n            \"metrics\": summary,\n        }",
      "section_ref": "25.9.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-18",
      "language": "python",
      "description": "",
      "code": "class FaultInjector:\n    \"\"\"故障注入器：主动测试Agent的容错能力\"\"\"\n    \n    def __init__(self, agent):\n        self.agent = agent\n        self._fault_config: dict = {}\n    \n    def inject_delay(self, tool_name: str, delay_ms: float):\n        \"\"\"注入延迟\"\"\"\n        self._fault_config[f\"delay_{tool_name}\"] = delay_ms\n    \n    def inject_error(self, tool_name: str, error_msg: str):\n        \"\"\"注入错误\"\"\"\n        self._fault_config[f\"error_{tool_name}\"] = error_msg\n    \n    def inject_timeout(self, tool_name: str):\n        \"\"\"注入超时\"\"\"\n        self._fault_config[f\"timeout_{tool_name}\"] = True\n    \n    async def run_with_faults(self, task: str) -> dict:\n        \"\"\"在故障条件下执行\"\"\"\n        start = time.time()\n        try:\n            result = await self.agent.run(task)\n            return {\n                \"success\": True,\n                \"result\": result,\n                \"duration_ms\": (time.time() - start) * 1000,\n                \"faults_applied\": self._fault_config,\n            }\n        except Exception as e:\n            return {\n                \"success\": False,\n                \"error\": str(e),\n                \"duration_ms\": (time.time() - start) * 1000,\n                \"faults_applied\": self._fault_config,\n            }",
      "section_ref": "25.9.2",
      "runnable": true,
      "dependencies": []
    }
  ],
  "tables": [
    {
      "headers": [
        "问题类型",
        "难度",
        "常见表现",
        "调试方法"
      ],
      "data": [
        [
          "代码Bug",
          "⭐",
          "异常、崩溃",
          "传统调试器"
        ],
        [
          "Prompt问题",
          "⭐⭐",
          "输出格式错误、不遵循指令",
          "Prompt版本测试"
        ],
        [
          "工具调用失败",
          "⭐⭐",
          "超时、参数错误",
          "工具日志分析"
        ],
        [
          "推理路径错误",
          "⭐⭐⭐",
          "选择了错误的工具/方向",
          "思维链可视化"
        ],
        [
          "幻觉",
          "⭐⭐⭐⭐",
          "编造信息",
          "事实核查、RAG增强"
        ],
        [
          "循环推理",
          "⭐⭐⭐⭐",
          "Agent反复执行相同操作",
          "步骤计数器、终止条件"
        ],
        [
          "上下文溢出",
          "⭐⭐⭐",
          "Token超限、截断",
          "上下文监控"
        ]
      ]
    },
    {
      "headers": [
        "工具",
        "类型",
        "核心功能",
        "适用场景",
        "开源"
      ],
      "data": [
        [
          "**LangSmith**",
          "云服务",
          "追踪、评估、调试",
          "LangChain生态",
          "❌"
        ],
        [
          "**Weave**",
          "开源",
          "追踪、评估、实验管理",
          "通用LLM应用",
          "✅"
        ],
        [
          "**PromptFoo**",
          "开源",
          "Prompt测试、评估、红队",
          "Prompt工程",
          "✅"
        ],
        [
          "**Arize Phoenix**",
          "开源",
          "追踪、评估、可观测",
          "生产监控",
          "✅"
        ],
        [
          "**Langfuse**",
          "开源",
          "追踪、Prompt管理、评估",
          "团队协作",
          "✅"
        ],
        [
          "**Helicone**",
          "开源",
          "缓存、日志、成本追踪",
          "成本优化",
          "✅"
        ]
      ]
    }
  ],
  "key_takeaways": [],
  "common_pitfalls": [],
  "related_chapters": [
    "ch15"
  ]
}