{
  "metadata": {
    "id": "ch24",
    "title": "第24章：性能调优",
    "volume": "vol7",
    "volume_title": "Agent编程技法",
    "word_count": 1699,
    "difficulty": "intermediate",
    "prerequisites": [
      "ch04"
    ],
    "key_concepts": [
      "Agent系统性能概述",
      "延迟来源分析",
      "性能优化决策树",
      "LLM推理优化",
      "模型选择策略",
      "Prompt压缩",
      "上下文窗口管理",
      "并发与异步",
      "并发工具调用",
      "流式响应",
      "预取与预热",
      "缓存策略",
      "语义缓存",
      "Prompt缓存",
      "工具结果缓存"
    ],
    "learning_objectives": [],
    "estimated_tokens": 1019,
    "source_file": "vol7/ch24_性能调优.md"
  },
  "overview": "Agent 系统的性能瓶颈与传统软件有本质不同——它不仅涉及代码执行效率，更核心的是 LLM 推理时间、Token 消耗和多步推理的累积延迟。一个需要 5 次 LLM 调用和 3 次工具执行的 Agent 任务，即使每次调用只有 2 秒，端到端延迟也高达 16 秒。本章将系统讲解 Agent 性能优化的全链路方法，从 LLM 推理优化到并发调度，从缓存策略到基础设施优化，帮助你将 Agent 响应时间从\"分钟级\"压缩到\"秒级\"。",
  "sections": [
    {
      "id": "24.1",
      "title": "24.1 Agent系统性能概述",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "24.1.1",
          "title": "24.1.1 延迟来源分析",
          "content": "一次典型的 Agent 请求延迟由多个环节叠加而成：\n\n\n| 环节 | 典型延迟 | 占比 | 优化空间 |\n|------|---------|------|---------|\n| LLM推理 | 1-10秒 | 60-80% | 模型选择、Prompt优化 |\n| 工具执行 | 0.1-5秒 | 10-30% | 并发、缓存 |\n| 上下文构建 | 0.05-0.5秒 | 2-5% | 预构建、增量更新 |\n| 网络传输 | 0.01-0.5秒 | 1-5% | 连接池、CDN |\n| 后处理 | 0.01-0.1秒 | <1% | 流式输出 |"
        },
        {
          "id": "24.1.2",
          "title": "24.1.2 性能优化决策树",
          "content": ""
        }
      ]
    },
    {
      "id": "24.2",
      "title": "24.2 LLM推理优化",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "24.2.1",
          "title": "24.2.1 模型选择策略",
          "content": ""
        },
        {
          "id": "24.2.2",
          "title": "24.2.2 Prompt压缩",
          "content": ""
        },
        {
          "id": "24.2.3",
          "title": "24.2.3 上下文窗口管理",
          "content": ""
        }
      ]
    },
    {
      "id": "24.3",
      "title": "24.3 并发与异步",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "24.3.1",
          "title": "24.3.1 并发工具调用",
          "content": ""
        },
        {
          "id": "24.3.2",
          "title": "24.3.2 流式响应",
          "content": ""
        },
        {
          "id": "24.3.3",
          "title": "24.3.3 预取与预热",
          "content": ""
        }
      ]
    },
    {
      "id": "24.4",
      "title": "24.4 缓存策略",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "24.4.1",
          "title": "24.4.1 语义缓存",
          "content": ""
        },
        {
          "id": "24.4.2",
          "title": "24.4.2 Prompt缓存",
          "content": ""
        },
        {
          "id": "24.4.3",
          "title": "24.4.3 工具结果缓存",
          "content": ""
        }
      ]
    },
    {
      "id": "24.5",
      "title": "24.5 Token使用优化",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "24.5.1",
          "title": "24.5.1 上下文裁剪",
          "content": ""
        },
        {
          "id": "24.5.2",
          "title": "24.5.2 滑动窗口策略",
          "content": ""
        }
      ]
    },
    {
      "id": "24.6",
      "title": "24.6 工具调用优化",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "24.6.1",
          "title": "24.6.1 并行调用策略",
          "content": ""
        },
        {
          "id": "24.6.2",
          "title": "24.6.2 轻量工具替代",
          "content": ""
        }
      ]
    },
    {
      "id": "24.7",
      "title": "24.7 基础设施优化",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "24.7.1",
          "title": "24.7.1 连接池",
          "content": ""
        },
        {
          "id": "24.7.2",
          "title": "24.7.2 GPU资源管理",
          "content": ""
        }
      ]
    },
    {
      "id": "24.8",
      "title": "24.8 性能基准测试",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "24.8.1",
          "title": "24.8.1 延迟基准",
          "content": ""
        }
      ]
    },
    {
      "id": "24.9",
      "title": "24.9 性能监控与持续优化",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "24.9.1",
          "title": "24.9.1 性能预算",
          "content": ""
        },
        {
          "id": "24.9.2",
          "title": "24.9.2 自动降级",
          "content": ""
        }
      ]
    },
    {
      "id": "最佳实践",
      "title": "最佳实践",
      "level": 2,
      "content": "1. **测量优先**：在优化前先建立性能基线，没有数据的优化是盲目的\n2. **LLM是最大瓶颈**：60-80%的延迟来自LLM推理，优先优化模型选择和Prompt\n3. **能缓存就缓存**：语义缓存、工具结果缓存、Prompt缓存可以大幅减少重复计算\n4. **并发无依赖的工具**：分析依赖关系，最大化并行执行\n5. **设置性能预算**：明确各环节的延迟和成本预算，超出时自动降级",
      "subsections": []
    },
    {
      "id": "常见陷阱",
      "title": "常见陷阱",
      "level": 2,
      "content": "1. **过早优化**：在功能不稳定时做性能优化，浪费精力。先正确，再快速\n2. **忽略冷启动**：首次请求延迟特别高（模型加载、连接建立）。考虑预热\n3. **缓存不一致**：缓存了过期的工具结果。设置合理的TTL和失效策略\n4. **并发过多**：无限制并发可能导致API限流或资源耗尽。使用信号量控制\n5. **只优化延迟忽略成本**：用最大模型虽然快了，但成本可能不可接受",
      "subsections": []
    },
    {
      "id": "小结",
      "title": "小结",
      "level": 2,
      "content": "Agent 系统性能优化是一个系统工程，需要从 LLM 推理、工具调用、并发调度、缓存策略、Token管理和基础设施等多个层面协同优化。核心原则是：**测量→定位瓶颈→针对性优化→验证效果→持续监控**。记住，最快的代码是不需要执行的代码——缓存和预取是 Agent 性能优化的利器。",
      "subsections": []
    },
    {
      "id": "延伸阅读",
      "title": "延伸阅读",
      "level": 2,
      "content": "1. **OpenAI Prompt Engineering Guide**: https://platform.openai.com/docs/guides/prompt-engineering\n2. **论文**: \"Prompt Compression\" — Prompt压缩技术\n3. **论文**: \"GPTCache\" — 语义缓存框架\n4. **文章**: \"Optimizing LLM Applications for Production\" — 生产级LLM优化\n5. **工具**: vLLM, TGI — 高性能LLM推理引擎",
      "subsections": []
    }
  ],
  "code_blocks": [
    {
      "id": "code-1",
      "language": "mermaid",
      "description": "一次典型的 Agent 请求延迟由多个环节叠加而成：",
      "code": "graph LR\n    A[用户输入<br>~10ms] --> B[Prompt构建<br>~50ms]\n    B --> C[LLM推理<br>2000-10000ms]\n    C --> D{需要工具调用?}\n    D -->|是| E[工具执行<br>100-5000ms]\n    E --> F[结果处理<br>~50ms]\n    F --> C\n    D -->|否| G[输出格式化<br>~20ms]\n    G --> H[响应返回<br>~10ms]",
      "section_ref": "24.1.1",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-2",
      "language": "python",
      "description": "| 后处理 | 0.01-0.1秒 | <1% | 流式输出 |",
      "code": "def diagnose_performance_issue(metrics: dict) -> list[str]:\n    \"\"\"性能问题诊断\"\"\"\n    recommendations = []\n    \n    e2e_latency = metrics.get(\"e2e_latency_ms\", 0)\n    llm_latency = metrics.get(\"llm_latency_ms\", 0)\n    tool_latency = metrics.get(\"tool_latency_ms\", 0)\n    steps = metrics.get(\"avg_steps\", 0)\n    \n    # 诊断LLM延迟\n    if llm_latency > 5000:\n        recommendations.append(\n            \"🔍 LLM延迟过高(>5s): 考虑使用更快的模型或优化Prompt长度\"\n        )\n    if llm_latency / e2e_latency > 0.8:\n        recommendations.append(\n            \"🔍 LLM占总延迟>80%: 优先优化LLM调用\"\n        )\n    \n    # 诊断工具延迟\n    if tool_latency > 2000:\n        recommendations.append(\n            \"🔍 工具延迟过高(>2s): 检查工具实现，添加缓存\"\n        )\n    \n    # 诊断推理步数\n    if steps > 10:\n        recommendations.append(\n            \"🔍 平均推理步数过多(>10): 优化规划策略，减少不必要的循环\"\n        )\n    \n    # 诊断Token使用\n    tokens_per_request = metrics.get(\"avg_tokens\", 0)\n    if tokens_per_request > 10000:\n        recommendations.append(\n            \"🔍 Token消耗过高(>10K/请求): 压缩上下文，使用更精确的Prompt\"\n        )\n    \n    return recommendations",
      "section_ref": "24.1.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-3",
      "language": "python",
      "description": "",
      "code": "from dataclasses import dataclass\n\n@dataclass\nclass ModelProfile:\n    name: str\n    max_tokens: int\n    input_cost_per_1m: float  # 美元/百万Token\n    output_cost_per_1m: float\n    avg_latency_ms: float  # 典型延迟\n    quality_score: float  # 1-10\n\nMODELS = {\n    \"gpt-4o\": ModelProfile(\"gpt-4o\", 128000, 2.5, 10.0, 2500, 9.5),\n    \"gpt-4o-mini\": ModelProfile(\"gpt-4o-mini\", 128000, 0.15, 0.6, 800, 7.5),\n    \"claude-3.5-sonnet\": ModelProfile(\"claude-3.5-sonnet\", 200000, 3.0, 15.0, 3000, 9.0),\n    \"claude-3-haiku\": ModelProfile(\"claude-3-haiku\", 200000, 0.25, 1.25, 600, 7.0),\n}\n\nclass ModelRouter:\n    \"\"\"智能模型路由：根据任务复杂度选择模型\"\"\"\n    \n    def __init__(self):\n        self.complexity_keywords = {\n            \"high\": [\"分析\", \"推理\", \"规划\", \"设计\", \"优化\", \"评估\", \"对比\"],\n            \"low\": [\"翻译\", \"摘要\", \"提取\", \"分类\", \"格式化\", \"简单查询\"],\n        }\n    \n    def select_model(self, task: str, budget: float = None,\n                    max_latency_ms: float = None) -> str:\n        \"\"\"选择最优模型\"\"\"\n        # 评估任务复杂度\n        complexity = self._assess_complexity(task)\n        \n        # 筛选候选模型\n        candidates = []\n        for name, profile in MODELS.items():\n            if max_latency_ms and profile.avg_latency_ms > max_latency_ms:\n                continue\n            if budget:\n                estimated_cost = profile.input_cost_per_1m * 0.004  # ~4K tokens\n                if estimated_cost > budget:\n                    continue\n            candidates.append((name, profile))\n        \n        # 根据复杂度选择\n        if complexity == \"high\":\n            # 高复杂度：选质量最高的\n            candidates.sort(key=lambda x: x[1].quality_score, reverse=True)\n        elif complexity == \"low\":\n            # 低复杂度：选最快的\n            candidates.sort(key=lambda x: x[1].avg_latency_ms)\n        else:\n            # 中等：选性价比最高的\n            candidates.sort(\n                key=lambda x: x[1].quality_score / x[1].avg_latency_ms,\n                reverse=True\n            )\n        \n        return candidates[0][0] if candidates else \"gpt-4o-mini\"\n    \n    def _assess_complexity(self, task: str) -> str:\n        high_count = sum(1 for kw in self.complexity_keywords[\"high\"] \n                        if kw in task)\n        low_count = sum(1 for kw in self.complexity_keywords[\"low\"] \n                       if kw in task)\n        if high_count >= 2:\n            return \"high\"\n        elif low_count >= 2:\n            return \"low\"\n        return \"medium\"",
      "section_ref": "24.2.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-4",
      "language": "python",
      "description": "",
      "code": "class PromptCompressor:\n    \"\"\"Prompt压缩器\"\"\"\n    \n    def __init__(self, llm):\n        self.llm = llm\n    \n    async def compress(self, prompt: str, \n                      target_ratio: float = 0.5) -> str:\n        \"\"\"压缩Prompt，保留核心信息\"\"\"\n        compression_prompt = f\"\"\"\n        请压缩以下文本，保留所有关键信息，去除冗余描述。\n        目标：压缩到原文本的 {int(target_ratio * 100)}% 长度。\n        \n        原始文本:\n        {prompt}\n        \n        压缩后的文本:\n        \"\"\"\n        result = await self.llm.generate(compression_prompt)\n        return result\n    \n    def remove_redundant_instructions(self, system_prompt: str) -> str:\n        \"\"\"移除冗余指令\"\"\"\n        # 常见冗余模式\n        redundant_patterns = [\n            \"请务必\", \"一定要\", \"千万不要\",\n            \"你是一个专业的\", \"作为一个AI助手\",\n        ]\n        cleaned = system_prompt\n        for pattern in redundant_patterns:\n            cleaned = cleaned.replace(pattern, \"\")\n        return cleaned.strip()\n    \n    def tokenize_efficiently(self, messages: list[dict]) -> list[dict]:\n        \"\"\"Token高效的消息格式\"\"\"\n        optimized = []\n        for msg in messages:\n            content = msg[\"content\"]\n            # 移除多余空格和换行\n            content = \" \".join(content.split())\n            optimized.append({\"role\": msg[\"role\"], \"content\": content})\n        return optimized",
      "section_ref": "24.2.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-5",
      "language": "python",
      "description": "",
      "code": "class ContextWindowManager:\n    \"\"\"上下文窗口管理器\"\"\"\n    \n    def __init__(self, max_tokens: int = 128000,\n                 reserve_for_output: int = 4096):\n        self.max_tokens = max_tokens\n        self.reserve = reserve_for_output\n        self.available = max_tokens - reserve_for_output\n    \n    def fit_messages(self, messages: list[dict],\n                    model_context_limit: int) -> list[dict]:\n        \"\"\"确保消息在上下文窗口内\"\"\"\n        total_tokens = sum(\n            self._estimate_tokens(msg[\"content\"]) \n            for msg in messages\n        )\n        \n        if total_tokens <= self.available:\n            return messages\n        \n        # 裁剪策略：保留系统消息 + 最近的消息\n        system_msgs = [m for m in messages if m[\"role\"] == \"system\"]\n        other_msgs = [m for m in messages if m[\"role\"] != \"system\"]\n        \n        # 计算系统消息占用\n        system_tokens = sum(\n            self._estimate_tokens(m[\"content\"]) \n            for m in system_msgs\n        )\n        remaining = self.available - system_tokens\n        \n        # 从最近的消息开始保留\n        fitted = list(system_msgs)\n        for msg in reversed(other_msgs):\n            msg_tokens = self._estimate_tokens(msg[\"content\"])\n            if remaining >= msg_tokens:\n                fitted.insert(len(system_msgs), msg)\n                remaining -= msg_tokens\n            else:\n                break\n        \n        return fitted\n    \n    def _estimate_tokens(self, text: str) -> int:\n        \"\"\"粗略估算Token数\"\"\"\n        return len(text) // 3  # 中文约1字=2-3Token",
      "section_ref": "24.2.3",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-6",
      "language": "python",
      "description": "",
      "code": "import asyncio\nfrom typing import Any\n\nclass ConcurrentToolExecutor:\n    \"\"\"并发工具执行器\"\"\"\n    \n    def __init__(self, max_concurrent: int = 5):\n        self.semaphore = asyncio.Semaphore(max_concurrent)\n        self._results: dict[str, Any] = {}\n    \n    async def execute_tools(self, tool_calls: list[dict]) -> dict:\n        \"\"\"并发执行多个工具调用\"\"\"\n        tasks = []\n        for call in tool_calls:\n            task = asyncio.create_task(\n                self._execute_with_semaphore(call)\n            )\n            tasks.append(task)\n        \n        results = await asyncio.gather(*tasks, return_exceptions=True)\n        \n        output = {}\n        for call, result in zip(tool_calls, results):\n            call_id = call.get(\"id\", call[\"name\"])\n            if isinstance(result, Exception):\n                output[call_id] = {\"error\": str(result)}\n            else:\n                output[call_id] = result\n        \n        return output\n    \n    async def _execute_with_semaphore(self, call: dict) -> Any:\n        async with self.semaphore:\n            tool = self.get_tool(call[\"name\"])\n            return await tool.execute(**call[\"arguments\"])\n\n# 使用示例：Agent规划后并发执行\nasync def agent_execute_plan(agent, plan: list[dict]) -> str:\n    executor = ConcurrentToolExecutor(max_concurrent=5)\n    \n    # 分析依赖关系，找到可并行的工具调用\n    parallel_groups = analyze_dependencies(plan)\n    \n    for group in parallel_groups:\n        if len(group) > 1:\n            # 并行执行无依赖的工具\n            results = await executor.execute_tools(group)\n        else:\n            # 串行执行有依赖的工具\n            call = group[0]\n            tool = agent.get_tool(call[\"name\"])\n            results = {call[\"id\"]: await tool.execute(**call[\"arguments\"])}\n        \n        agent.update_context(results)\n    \n    return await agent.synthesize()",
      "section_ref": "24.3.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-7",
      "language": "python",
      "description": "",
      "code": "from openai import AsyncOpenAI\n\nclass StreamingAgent:\n    \"\"\"流式响应Agent\"\"\"\n    \n    def __init__(self, api_key: str):\n        self.client = AsyncOpenAI(api_key=api_key)\n    \n    async def stream_response(self, messages: list[dict],\n                              callback=None):\n        \"\"\"流式生成响应\"\"\"\n        stream = await self.client.chat.completions.create(\n            model=\"gpt-4o\",\n            messages=messages,\n            stream=True\n        )\n        \n        full_response = \"\"\n        async for chunk in stream:\n            delta = chunk.choices[0].delta\n            if delta.content:\n                full_response += delta.content\n                if callback:\n                    await callback(delta.content)\n        \n        return full_response\n\n# WebSocket流式推送\nasync def websocket_stream(websocket, agent, user_input: str):\n    async def send_chunk(chunk: str):\n        await websocket.send_json({\"type\": \"chunk\", \"content\": chunk})\n    \n    # 先发送思考过程（如果Agent有规划步骤）\n    plan = await agent.plan(user_input)\n    await websocket.send_json({\"type\": \"plan\", \"content\": plan})\n    \n    # 流式发送最终回答\n    result = await agent.stream_response(user_input, callback=send_chunk)\n    await websocket.send_json({\"type\": \"done\", \"content\": result})",
      "section_ref": "24.3.2",
      "runnable": true,
      "dependencies": [
        "openai"
      ]
    },
    {
      "id": "code-8",
      "language": "python",
      "description": "",
      "code": "class AgentPrefetcher:\n    \"\"\"Agent预取器\"\"\"\n    \n    def __init__(self, agent):\n        self.agent = agent\n        self._prefetch_cache: dict[str, Any] = {}\n    \n    async def prefetch_likely_tools(self, query: str):\n        \"\"\"根据查询预判可能需要的工具结果\"\"\"\n        # 使用轻量模型预测\n        prediction = await self._predict_tool_needs(query)\n        \n        for tool_name, args in prediction.get(\"likely_tools\", []):\n            key = f\"{tool_name}:{json.dumps(args, sort_keys=True)}\"\n            if key not in self._prefetch_cache:\n                try:\n                    tool = self.agent.get_tool(tool_name)\n                    result = await tool.execute(**args)\n                    self._prefetch_cache[key] = result\n                except Exception:\n                    pass  # 预取失败不影响主流程\n    \n    def get_prefetched(self, tool_name: str, args: dict) -> Any | None:\n        \"\"\"获取预取结果\"\"\"\n        key = f\"{tool_name}:{json.dumps(args, sort_keys=True)}\"\n        return self._prefetch_cache.get(key)",
      "section_ref": "24.3.3",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-9",
      "language": "python",
      "description": "",
      "code": "import numpy as np\nfrom sklearn.metrics.pairwise import cosine_similarity\n\nclass SemanticCache:\n    \"\"\"语义缓存：相似的查询复用结果\"\"\"\n    \n    def __init__(self, similarity_threshold: float = 0.95,\n                 ttl_seconds: int = 3600):\n        self.threshold = similarity_threshold\n        self.ttl = ttl_seconds\n        self._entries: list[dict] = []\n        self._embedder = get_embedder()\n    \n    async def get(self, query: str) -> dict | None:\n        \"\"\"查询缓存\"\"\"\n        query_embedding = await self._embedder.embed(query)\n        \n        for entry in self._entries:\n            # 检查TTL\n            if time.time() - entry[\"timestamp\"] > self.ttl:\n                continue\n            \n            # 计算语义相似度\n            similarity = cosine_similarity(\n                [query_embedding], [entry[\"embedding\"]]\n            )[0][0]\n            \n            if similarity >= self.threshold:\n                entry[\"hit_count\"] += 1\n                return {\n                    \"result\": entry[\"result\"],\n                    \"similarity\": similarity,\n                    \"cached\": True\n                }\n        \n        return None\n    \n    async def set(self, query: str, result: Any):\n        \"\"\"存入缓存\"\"\"\n        embedding = await self._embedder.embed(query)\n        self._entries.append({\n            \"query\": query,\n            \"embedding\": embedding,\n            \"result\": result,\n            \"timestamp\": time.time(),\n            \"hit_count\": 0,\n        })",
      "section_ref": "24.4.1",
      "runnable": true,
      "dependencies": [
        "numpy",
        "sklearn"
      ]
    },
    {
      "id": "code-10",
      "language": "python",
      "description": "",
      "code": "class PromptCache:\n    \"\"\"Prompt缓存：避免重复构建相同的Prompt\"\"\"\n    \n    def __init__(self, redis_client=None):\n        self.redis = redis_client\n        self._local_cache: dict[str, str] = {}\n    \n    def _compute_key(self, messages: list[dict], \n                     model: str) -> str:\n        content = json.dumps(messages, sort_keys=True, ensure_ascii=False)\n        return f\"prompt_cache:{model}:{hashlib.md5(content.encode()).hexdigest()}\"\n    \n    async def get(self, messages: list[dict], \n                  model: str) -> str | None:\n        key = self._compute_key(messages, model)\n        \n        # 先查本地缓存\n        if key in self._local_cache:\n            return self._local_cache[key]\n        \n        # 再查Redis\n        if self.redis:\n            result = await self.redis.get(key)\n            if result:\n                self._local_cache[key] = result\n                return result\n        \n        return None\n    \n    async def set(self, messages: list[dict], model: str,\n                  response: str, ttl: int = 300):\n        key = self._compute_key(messages, model)\n        self._local_cache[key] = response\n        \n        if self.redis:\n            await self.redis.setex(key, ttl, response)",
      "section_ref": "24.4.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-11",
      "language": "python",
      "description": "",
      "code": "class ToolResultCache:\n    \"\"\"工具结果缓存\"\"\"\n    \n    def __init__(self, ttl_map: dict[str, int] | None = None):\n        self._cache: dict[str, dict] = {}\n        self.ttl_map = ttl_map or {\n            \"default\": 300,         # 5分钟\n            \"search\": 1800,         # 30分钟\n            \"database\": 60,         # 1分钟\n            \"weather\": 600,         # 10分钟\n        }\n    \n    def get(self, tool_name: str, args: dict) -> Any | None:\n        key = self._make_key(tool_name, args)\n        entry = self._cache.get(key)\n        \n        if entry and time.time() - entry[\"time\"] < entry[\"ttl\"]:\n            entry[\"hits\"] += 1\n            return entry[\"data\"]\n        return None\n    \n    def set(self, tool_name: str, args: dict, result: Any):\n        key = self._make_key(tool_name, args)\n        ttl = self.ttl_map.get(tool_name, self.ttl_map[\"default\"])\n        self._cache[key] = {\n            \"data\": result,\n            \"time\": time.time(),\n            \"ttl\": ttl,\n            \"hits\": 0,\n        }\n    \n    def _make_key(self, tool_name: str, args: dict) -> str:\n        args_str = json.dumps(args, sort_keys=True)\n        return f\"{tool_name}:{hashlib.md5(args_str.encode()).hexdigest()}\"",
      "section_ref": "24.4.3",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-12",
      "language": "python",
      "description": "",
      "code": "class ContextTrimmer:\n    \"\"\"上下文裁剪器\"\"\"\n    \n    async def trim(self, messages: list[dict], \n                  max_tokens: int) -> list[dict]:\n        \"\"\"智能裁剪上下文\"\"\"\n        current_tokens = sum(\n            self._count_tokens(m[\"content\"]) for m in messages\n        )\n        \n        if current_tokens <= max_tokens:\n            return messages\n        \n        # 策略1：压缩旧的对话轮次\n        messages = await self._compress_old_turns(messages, max_tokens)\n        \n        # 策略2：如果还不够，移除最早的对话\n        while self._total_tokens(messages) > max_tokens:\n            # 保留system消息和最近3轮\n            system = [m for m in messages if m[\"role\"] == \"system\"]\n            user_assistant = [m for m in messages if m[\"role\"] != \"system\"]\n            \n            if len(user_assistant) <= 6:\n                break\n            \n            # 移除最早的一轮\n            user_assistant = user_assistant[2:]\n            messages = system + user_assistant\n        \n        return messages\n    \n    async def _compress_old_turns(self, messages: list[dict],\n                                   max_tokens: int) -> list[dict]:\n        \"\"\"压缩旧的对话轮次为摘要\"\"\"\n        system = [m for m in messages if m[\"role\"] == \"system\"]\n        turns = self._group_into_turns(messages)\n        \n        if len(turns) <= 3:\n            return messages\n        \n        # 保留最近2轮完整，压缩更早的\n        old_turns = turns[:-2]\n        recent_turns = turns[-2:]\n        \n        compressed_summary = await self._summarize_turns(old_turns)\n        \n        result = list(system)\n        result.append({\n            \"role\": \"system\",\n            \"content\": f\"以下是对话的先前摘要：\\n{compressed_summary}\"\n        })\n        result.extend(recent_turns)\n        \n        return result",
      "section_ref": "24.5.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-13",
      "language": "python",
      "description": "",
      "code": "class SlidingWindowContext:\n    \"\"\"滑动窗口上下文管理\"\"\"\n    \n    def __init__(self, window_size: int = 10,\n                 summary_threshold: int = 6):\n        self.window_size = window_size\n        self.summary_threshold = summary_threshold\n        self._all_messages: list[dict] = []\n        self._summary: str = \"\"\n    \n    def add(self, message: dict):\n        self._all_messages.append(message)\n    \n    def get_context(self) -> list[dict]:\n        \"\"\"获取当前上下文（滑动窗口 + 摘要）\"\"\"\n        if len(self._all_messages) <= self.window_size:\n            return list(self._all_messages)\n        \n        # 超出窗口的部分用摘要替代\n        recent = self._all_messages[-self.window_size:]\n        \n        result = []\n        if self._summary:\n            result.append({\n                \"role\": \"system\",\n                \"content\": f\"对话摘要：{self._summary}\"\n            })\n        result.extend(recent)\n        return result\n    \n    async def update_summary(self, llm):\n        \"\"\"更新摘要\"\"\"\n        if len(self._all_messages) <= self.summary_threshold:\n            return\n        \n        messages_to_summarize = self._all_messages[:-self.summary_threshold]\n        self._summary = await llm.generate(\n            f\"请简洁地总结以下对话内容：\\n\"\n            f\"{json.dumps(messages_to_summarize, ensure_ascii=False)}\"\n        )",
      "section_ref": "24.5.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-14",
      "language": "python",
      "description": "",
      "code": "class ToolCallOptimizer:\n    \"\"\"工具调用优化器\"\"\"\n    \n    def analyze_dependencies(self, tool_calls: list[dict]) -> list[list[dict]]:\n        \"\"\"分析工具调用依赖关系，生成并行执行组\"\"\"\n        if not tool_calls:\n            return []\n        \n        # 构建依赖图\n        dep_graph = {}\n        for call in tool_calls:\n            call_id = call[\"id\"]\n            deps = call.get(\"depends_on\", [])\n            dep_graph[call_id] = {\n                \"call\": call,\n                \"deps\": deps,\n            }\n        \n        # 拓扑排序，分层\n        groups = []\n        remaining = set(dep_graph.keys())\n        \n        while remaining:\n            # 找出无依赖的调用\n            ready = {\n                cid for cid in remaining \n                if not dep_graph[cid][\"deps\"] or\n                all(d not in remaining for d in dep_graph[cid][\"deps\"])\n            }\n            \n            if not ready:\n                # 有循环依赖，强制执行\n                ready = {next(iter(remaining))}\n            \n            group = [dep_graph[cid][\"call\"] for cid in ready]\n            groups.append(group)\n            remaining -= ready\n        \n        return groups",
      "section_ref": "24.6.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-15",
      "language": "python",
      "description": "",
      "code": "class LightweightToolRegistry:\n    \"\"\"轻量工具注册：为常用操作提供快速替代\"\"\"\n    \n    def __init__(self):\n        self._fast_tools: dict[str, Callable] = {}\n    \n    def register_fast(self, tool_name: str, \n                      fast_impl: Callable,\n                      condition: Callable = None):\n        \"\"\"注册快速替代实现\"\"\"\n        self._fast_tools[tool_name] = {\n            \"impl\": fast_impl,\n            \"condition\": condition or (lambda args: True),\n        }\n    \n    def get_tool(self, tool_name: str, \n                 args: dict) -> Callable | None:\n        \"\"\"判断是否可以使用轻量替代\"\"\"\n        entry = self._fast_tools.get(tool_name)\n        if entry and entry[\"condition\"](args):\n            return entry[\"impl\"]\n        return None\n\n# 示例：日期查询用本地函数替代API\nregistry = LightweightToolRegistry()\nregistry.register_fast(\n    \"get_current_date\",\n    lambda args: {\"date\": datetime.now().strftime(\"%Y-%m-%d\")},\n    condition=lambda args: args.get(\"timezone\") is None\n)",
      "section_ref": "24.6.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-16",
      "language": "python",
      "description": "",
      "code": "import httpx\n\nclass AgentConnectionPool:\n    \"\"\"Agent HTTP连接池\"\"\"\n    \n    def __init__(self, max_connections: int = 100,\n                 max_keepalive: int = 20):\n        self.client = httpx.AsyncClient(\n            limits=httpx.Limits(\n                max_connections=max_connections,\n                max_keepalive_connections=max_keepalive,\n                keepalive_expiry=30\n            ),\n            timeout=httpx.Timeout(30.0, connect=5.0),\n            http2=True  # 启用HTTP/2\n        )\n    \n    async def close(self):\n        await self.client.aclose()",
      "section_ref": "24.7.1",
      "runnable": true,
      "dependencies": [
        "httpx"
      ]
    },
    {
      "id": "code-17",
      "language": "python",
      "description": "",
      "code": "class GPULocalInferenceManager:\n    \"\"\"本地GPU推理管理\"\"\"\n    \n    def __init__(self, model_name: str, device: str = \"auto\"):\n        import torch\n        self.device = device if device != \"auto\" else (\n            \"cuda\" if torch.cuda.is_available() else \"cpu\"\n        )\n        self.model = None\n        self.model_name = model_name\n    \n    async def load_model(self):\n        \"\"\"懒加载模型\"\"\"\n        if self.model is None:\n            from transformers import AutoModelForCausalLM, AutoTokenizer\n            self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)\n            self.model = AutoModelForCausalLM.from_pretrained(\n                self.model_name,\n                torch_dtype=\"auto\",\n                device_map=self.device\n            )\n    \n    def is_available(self) -> bool:\n        \"\"\"检查GPU是否可用\"\"\"\n        import torch\n        return torch.cuda.is_available()\n    \n    def memory_usage(self) -> dict:\n        \"\"\"GPU内存使用情况\"\"\"\n        import torch\n        if torch.cuda.is_available():\n            return {\n                \"allocated_gb\": torch.cuda.memory_allocated() / 1e9,\n                \"reserved_gb\": torch.cuda.memory_reserved() / 1e9,\n                \"total_gb\": torch.cuda.get_device_properties(0).total_mem / 1e9,\n            }\n        return {\"status\": \"no_gpu\"}",
      "section_ref": "24.7.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-18",
      "language": "python",
      "description": "",
      "code": "import time\nimport statistics\n\nclass AgentBenchmark:\n    \"\"\"Agent性能基准测试\"\"\"\n    \n    async def run_latency_test(self, agent, test_cases: list[dict],\n                                warmup: int = 3) -> dict:\n        \"\"\"延迟基准测试\"\"\"\n        results = {}\n        \n        for case_name, case_input in test_cases.items():\n            latencies = []\n            \n            # 预热\n            for _ in range(warmup):\n                await agent.run(case_input)\n            \n            # 正式测试\n            for _ in range(10):\n                start = time.perf_counter()\n                await agent.run(case_input)\n                latency = (time.perf_counter() - start) * 1000\n                latencies.append(latency)\n            \n            results[case_name] = {\n                \"mean_ms\": statistics.mean(latencies),\n                \"median_ms\": statistics.median(latencies),\n                \"p95_ms\": sorted(latencies)[int(len(latencies) * 0.95)],\n                \"min_ms\": min(latencies),\n                \"max_ms\": max(latencies),\n                \"std_ms\": statistics.stdev(latencies),\n            }\n        \n        return results",
      "section_ref": "24.8.1",
      "runnable": true,
      "dependencies": [
        "statistics"
      ]
    },
    {
      "id": "code-19",
      "language": "python",
      "description": "",
      "code": "@dataclass\nclass PerformanceBudget:\n    \"\"\"性能预算\"\"\"\n    e2e_latency_p95_ms: float = 15000\n    llm_latency_p95_ms: float = 8000\n    tool_latency_p95_ms: float = 3000\n    max_tokens_per_request: int = 50000\n    max_cost_per_request_usd: float = 0.50\n    max_steps_per_request: int = 15\n\nclass BudgetEnforcer:\n    \"\"\"预算执行器\"\"\"\n    \n    def __init__(self, budget: PerformanceBudget):\n        self.budget = budget\n    \n    def check(self, metrics: dict) -> list[str]:\n        \"\"\"检查是否超出预算\"\"\"\n        violations = []\n        \n        if metrics.get(\"e2e_p95\", 0) > self.budget.e2e_latency_p95_ms:\n            violations.append(\n                f\"E2E延迟P95({metrics['e2e_p95']:.0f}ms) \"\n                f\"超出预算({self.budget.e2e_latency_p95_ms}ms)\"\n            )\n        if metrics.get(\"cost_per_request\", 0) > self.budget.max_cost_per_request_usd:\n            violations.append(\n                f\"单次请求成本(${metrics['cost_per_request']:.3f}) \"\n                f\"超出预算(${self.budget.max_cost_per_request_usd})\"\n            )\n        \n        return violations",
      "section_ref": "24.9.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-20",
      "language": "python",
      "description": "",
      "code": "class GracefulDegradation:\n    \"\"\"优雅降级\"\"\"\n    \n    def __init__(self, agent):\n        self.agent = agent\n    \n    async def run_with_degradation(self, task: str,\n                                    timeout_ms: float = 30000) -> str:\n        \"\"\"带降级的执行\"\"\"\n        try:\n            # 尝试完整执行\n            result = await asyncio.wait_for(\n                self.agent.run(task),\n                timeout=timeout_ms / 1000\n            )\n            return result\n        except asyncio.TimeoutError:\n            # 降级1：使用更快的模型\n            try:\n                self.agent.set_model(\"gpt-4o-mini\")\n                result = await asyncio.wait_for(\n                    self.agent.run(task),\n                    timeout=timeout_ms / 1000\n                )\n                return result\n            except (asyncio.TimeoutError, Exception):\n                # 降级2：返回预设回复\n                return self._fallback_response(task)\n        except Exception:\n            # 降级3：缓存或预设\n            return self._fallback_response(task)\n    \n    def _fallback_response(self, task: str) -> str:\n        return f\"抱歉，处理您的请求时遇到了问题。请稍后重试。\\n任务ID: {hash(task) % 100000:05d}\"",
      "section_ref": "24.9.2",
      "runnable": true,
      "dependencies": []
    }
  ],
  "tables": [
    {
      "headers": [
        "环节",
        "典型延迟",
        "占比",
        "优化空间"
      ],
      "data": [
        [
          "LLM推理",
          "1-10秒",
          "60-80%",
          "模型选择、Prompt优化"
        ],
        [
          "工具执行",
          "0.1-5秒",
          "10-30%",
          "并发、缓存"
        ],
        [
          "上下文构建",
          "0.05-0.5秒",
          "2-5%",
          "预构建、增量更新"
        ],
        [
          "网络传输",
          "0.01-0.5秒",
          "1-5%",
          "连接池、CDN"
        ],
        [
          "后处理",
          "0.01-0.1秒",
          "<1%",
          "流式输出"
        ]
      ]
    }
  ],
  "key_takeaways": [],
  "common_pitfalls": [],
  "related_chapters": [
    "ch04"
  ]
}