{
  "metadata": {
    "id": "appendixC",
    "title": "附录C：Agent部署指南",
    "volume": "vol6",
    "volume_title": "附录",
    "word_count": 2844,
    "difficulty": "beginner",
    "prerequisites": [],
    "key_concepts": [
      "C.1 部署概览",
      "C.1.1 部署模式对比",
      "C.1.2 部署前检查清单",
      "C.2 Docker容器化部署",
      "C.2.1 Dockerfile最佳实践",
      "C.2.3 镜像优化技巧",
      "C.3 Kubernetes编排",
      "C.3.1 基础部署配置",
      "C.3.2 Service与Ingress",
      "C.3.3 HPA自动伸缩",
      "C.3.4 GPU调度（本地模型场景）",
      "C.3.5 密钥管理",
      "C.4 Serverless部署",
      "C.4.2 Vercel部署（Python）",
      "C.5 边缘部署与离线推理"
    ],
    "learning_objectives": [],
    "estimated_tokens": 1706,
    "source_file": "vol6/appendixC_Agent部署指南.md"
  },
  "overview": "",
  "sections": [
    {
      "id": "C.1 部署概览",
      "title": "C.1 部署概览",
      "level": 2,
      "content": "Agent应用的部署与传统Web应用有显著差异。LLM调用的不确定性、工具执行的安全风险、长时间运行的工作流、以及不可预测的资源消耗，都对部署架构提出了独特的要求。",
      "subsections": [
        {
          "id": "C.1.1 部署模式对比",
          "title": "C.1.1 部署模式对比",
          "content": "| 模式 | 适用规模 | 延迟 | 可用性 | 运维复杂度 | 成本结构 |\n|------|---------|------|--------|-----------|---------|\n| **单容器** | < 1K QPS | < 100ms | 99% | 低 | 固定月费 |\n| **K8s集群** | 1K-100K QPS | < 50ms | 99.9% | 高 | 弹性+固定 |\n| **Serverless** | 弹性负载 | 100-500ms | 99.5% | 低 | 按调用计费 |\n| **边缘部署** | 离线/低延迟 | < 10ms | 99%+ | 中 | 一次性投入 |"
        },
        {
          "id": "C.1.2 部署前检查清单",
          "title": "C.1.2 部署前检查清单",
          "content": "在将Agent推向生产环境之前，请确认以下事项：\n\n- [ ] **API密钥安全**：所有密钥通过环境变量或密钥管理服务注入，不得硬编码\n- [ ] **错误处理**：所有LLM调用、工具执行、数据库操作都有超时和重试机制\n- [ ] **日志记录**：结构化日志，包含请求ID、耗时、token用量等关键指标\n- [ ] **速率限制**：防止API滥用和成本失控\n- [ ] **内容过滤**：输入输出内容安全过滤\n- [ ] **健康检查**：`/health` 端点返回服务状态\n- [ ] **优雅关闭**：处理完进行中的请求后再关闭\n- [ ] **配置外部化**：环境区分（dev/staging/prod）通过配置切换\n\n---"
        }
      ]
    },
    {
      "id": "C.2 Docker容器化部署",
      "title": "C.2 Docker容器化部署",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "C.2.1 Dockerfile最佳实践",
          "title": "C.2.1 Dockerfile最佳实践",
          "content": "**多阶段构建 — Python Agent应用**\n\n\n**多阶段构建 — Rust Agent应用（如edict）**"
        },
        {
          "id": "C.2.2 Docker Compose",
          "title": "C.2.2 Docker Compose — 完整开发/部署环境",
          "content": ""
        },
        {
          "id": "C.2.3 镜像优化技巧",
          "title": "C.2.3 镜像优化技巧",
          "content": "| 技巧 | 效果 | 示例 |\n|------|------|------|\n| 多阶段构建 | 镜像体积减少50-80% | 如上Dockerfile |\n| `.dockerignore` | 减少构建上下文 | 忽略`.git`, `venv`, `__pycache__` |\n| Alpine基础镜像 | 减少基础层大小 | `python:3.12-alpine` |\n| 层缓存优化 | 加速重复构建 | 先COPY依赖文件 |\n| 合并RUN指令 | 减少层数 | `RUN apt-get update && apt-get install -y ...` |\n| `--no-cache-dir` | 避免pip缓存 | `pip install --no-cache-dir` |\n\n\n---"
        }
      ]
    },
    {
      "id": "C.3 Kubernetes编排",
      "title": "C.3 Kubernetes编排",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "C.3.1 基础部署配置",
          "title": "C.3.1 基础部署配置",
          "content": ""
        },
        {
          "id": "C.3.2 Service与Ingres",
          "title": "C.3.2 Service与Ingress",
          "content": ""
        },
        {
          "id": "C.3.3 HPA自动伸缩",
          "title": "C.3.3 HPA自动伸缩",
          "content": "Agent应用的负载通常波动较大（取决于用户使用频率），HPA是必不可少的：\n\n\n💡 **Agent专属伸缩策略**：\n\n- **LLM Token队列**是比QPS更好的伸缩指标——因为每个请求的token消耗差异很大\n- **扩容要快**（60秒稳定窗口），**缩容要慢**（300秒），避免频繁伸缩\n- **预留缓冲**：最小副本数至少为2，避免单点故障"
        },
        {
          "id": "C.3.4 GPU调度（本地模型场景）",
          "title": "C.3.4 GPU调度（本地模型场景）",
          "content": ""
        },
        {
          "id": "C.3.5 密钥管理",
          "title": "C.3.5 密钥管理",
          "content": "---"
        }
      ]
    },
    {
      "id": "C.4 Serverless部署",
      "title": "C.4 Serverless部署",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "C.4.1 AWS Lambda + A",
          "title": "C.4.1 AWS Lambda + API Gateway",
          "content": "⚠️ **限制**：Lambda有15分钟超时、10GB内存限制，不适合长时间运行的Agent工作流。"
        },
        {
          "id": "C.4.2 Vercel部署（Pytho",
          "title": "C.4.2 Vercel部署（Python）",
          "content": ""
        },
        {
          "id": "C.4.3 Cloudflare Wor",
          "title": "C.4.3 Cloudflare Workers（边缘计算）",
          "content": "---"
        }
      ]
    },
    {
      "id": "C.5 边缘部署与离线推理",
      "title": "C.5 边缘部署与离线推理",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "C.5.1 本地模型部署架构",
          "title": "C.5.1 本地模型部署架构",
          "content": ""
        },
        {
          "id": "C.5.2 模型量化部署",
          "title": "C.5.2 模型量化部署",
          "content": "**GGUF格式 — llama.cpp**\n\n\n**量化等级对比**\n\n| 量化等级 | 模型大小（7B参数） | 显存需求 | 质量损失 | 速度 |\n|---------|------------------|---------|---------|------|\n| FP16（无量化） | ~14GB | 16GB | 0% | 1x |\n| Q8_0 | ~7.5GB | 8GB | <1% | 1.5x |\n| Q5_K_M | ~5.2GB | 6GB | 1-2% | 1.8x |\n| Q4_K_M | ~4.4GB | 5GB | 2-3% | 2.0x |\n| Q3_K_M | ~3.5GB | 4GB | 3-5% | 2.2x |\n| Q2_K | ~2.9GB | 3.5GB | 5-10% | 2.5x |\n\n🎯 **推荐**：Q4_K_M 是质量和速度的最佳平衡点，7B模型约需5GB显存。"
        },
        {
          "id": "C.5.3 ONNX Runtime —",
          "title": "C.5.3 ONNX Runtime — 嵌入模型加速",
          "content": "---"
        }
      ]
    },
    {
      "id": "C.6 性能调优",
      "title": "C.6 性能调优",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "C.6.1 LLM调用优化",
          "title": "C.6.1 LLM调用优化",
          "content": "**请求批处理**\n\n\n**连接池配置**"
        },
        {
          "id": "C.6.2 缓存策略",
          "title": "C.6.2 缓存策略",
          "content": "**语义缓存 — 避免重复LLM调用**\n\n\n**Redis缓存 — 高性能分布式缓存**"
        },
        {
          "id": "C.6.3 异步并发模式",
          "title": "C.6.3 异步并发模式",
          "content": ""
        },
        {
          "id": "C.6.4 流式输出优化",
          "title": "C.6.4 流式输出优化",
          "content": "---"
        }
      ]
    },
    {
      "id": "C.7 成本优化",
      "title": "C.7 成本优化",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "C.7.1 模型路由策略",
          "title": "C.7.1 模型路由策略",
          "content": "**预期节省**：合理使用模型路由可以节省 **40-60%** 的API成本。"
        },
        {
          "id": "C.7.2 Token用量控制",
          "title": "C.7.2 Token用量控制",
          "content": ""
        },
        {
          "id": "C.7.3 按需伸缩成本模型",
          "title": "C.7.3 按需伸缩成本模型",
          "content": "| 策略 | 固定月费 | 弹性费用 | 月总成本（估算） | 节省 |\n|------|---------|---------|----------------|------|\n| 恒定3副本 | $300 | $500 | **$800** | 基准 |\n| HPA 2-10副本 | $200 | $300 | **$500** | 37% |\n| Serverless | $0 | $400 | **$400** | 50% |\n| 混合（本地+云端） | $100（本地硬件） | $200 | **$300** | 62% |\n\n💡 **推荐**：大多数场景下，K8s HPA + 模型路由是最优选择，兼顾性能和成本。\n\n---"
        }
      ]
    },
    {
      "id": "C.8 监控与告警",
      "title": "C.8 监控与告警",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "C.8.1 核心监控指标",
          "title": "C.8.1 核心监控指标",
          "content": "| 指标 | 说明 | 告警阈值 |\n|------|------|---------|\n| **请求延迟 (P50/P99)** | 端到端延迟 | P99 > 10s |\n| **LLM API延迟** | 外部LLM调用延迟 | > 30s |\n| **Token用量/小时** | LLM Token消耗速率 | > 预算的120% |\n| **错误率** | 5xx错误占比 | > 1% |\n| **工具调用成功率** | 工具执行成功率 | < 95% |\n| **队列深度** | 待处理请求队列 | > 1000 |\n| **缓存命中率** | 语义缓存命中 | < 30%（过低） |"
        },
        {
          "id": "C.8.2 Prometheus指标埋点",
          "title": "C.8.2 Prometheus指标埋点",
          "content": ""
        },
        {
          "id": "C.8.3 Grafana Dashbo",
          "title": "C.8.3 Grafana Dashboard配置要点",
          "content": "推荐的Dashboard面板：\n\n1. **概览面板**：QPS、错误率、P50/P99延迟\n2. **LLM成本面板**：Token用量趋势、费用估算、模型分布\n3. **工具调用面板**：各工具调用次数、成功率、延迟\n4. **资源面板**：CPU/内存/GPU利用率、副本数\n5. **业务面板**：用户满意度、对话轮数、任务完成率\n\n---\n\n*附录C完*"
        }
      ]
    }
  ],
  "code_blocks": [
    {
      "id": "code-1",
      "language": "dockerfile",
      "description": "多阶段构建 — Python Agent应用",
      "code": "# ==================== 阶段1：依赖安装 ====================\nFROM python:3.12-slim AS builder\n\nWORKDIR /build\n\n# 先复制依赖文件，利用Docker层缓存\nCOPY requirements.txt .\nRUN pip install --no-cache-dir --prefix=/install -r requirements.txt\n\n# ==================== 阶段2：运行时 ====================\nFROM python:3.12-slim AS runtime\n\n# 安全：创建非root用户\nRUN groupadd -r agent && useradd -r -g agent -d /app agent\n\nWORKDIR /app\n\n# 从builder阶段复制已安装的依赖\nCOPY --from=builder /install /usr/local\n\n# 复制应用代码\nCOPY --chown=agent:agent . .\n\n# 安全：设置环境变量\nENV PYTHONUNBUFFERED=1 \\\n    PYTHONDONTWRITEBYTECODE=1 \\\n    PATH=\"/app:${PATH}\"\n\n# 切换到非root用户\nUSER agent\n\n# 健康检查\nHEALTHCHECK --interval=30s --timeout=10s --retries=3 \\\n    CMD python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8000/health')\"\n\n# 暴露端口\nEXPOSE 8000\n\n# 启动命令\nCMD [\"uvicorn\", \"main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\", \"--workers\", \"4\"]",
      "section_ref": "C.2.1 Dockerfile最佳实践",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-2",
      "language": "dockerfile",
      "description": "多阶段构建 — Rust Agent应用（如edict）",
      "code": "# ==================== 阶段1：编译 ====================\nFROM rust:1.83-slim AS builder\n\nWORKDIR /build\n\n# 依赖缓存\nCOPY Cargo.toml Cargo.lock ./\nRUN mkdir src && echo \"fn main(){}\" > src/main.rs\nRUN cargo build --release && rm -rf src\n\n# 真正编译\nCOPY . .\nRUN cargo build --release\n\n# ==================== 阶段2：运行时 ====================\nFROM debian:bookworm-slim AS runtime\n\nRUN apt-get update && \\\n    apt-get install -y --no-install-recommends ca-certificates libssl3 && \\\n    rm -rf /var/lib/apt/lists/*\n\nWORKDIR /app\n\nCOPY --from=builder /build/target/release/agent-server /app/agent-server\nCOPY --from=builder /build/config /app/config\n\nRUN groupadd -r agent && useradd -r -g agent agent\nUSER agent\n\nHEALTHCHECK --interval=30s --timeout=5s --retries=3 \\\n    CMD [\"/app/agent-server\", \"--health-check\"]\n\nEXPOSE 8080\n\nCMD [\"/app/agent-server\", \"--config\", \"/app/config/production.toml\"]",
      "section_ref": "C.2.1 Dockerfile最佳实践",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-3",
      "language": "yaml",
      "description": "",
      "code": "# docker-compose.yml\nversion: \"3.8\"\n\nservices:\n  # Agent应用\n  agent:\n    build:\n      context: .\n      dockerfile: Dockerfile\n    ports:\n      - \"8000:8000\"\n    environment:\n      - DATABASE_URL=postgresql://agent:password@postgres:5432/agent_db\n      - REDIS_URL=redis://redis:6379/0\n      - OPENAI_API_KEY=${OPENAI_API_KEY}\n      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}\n      - LOG_LEVEL=info\n    depends_on:\n      postgres:\n        condition: service_healthy\n      redis:\n        condition: service_healthy\n    restart: unless-stopped\n    deploy:\n      resources:\n        limits:\n          memory: 2G\n          cpus: \"2.0\"\n        reservations:\n          memory: 512M\n\n  # PostgreSQL - 持久化存储\n  postgres:\n    image: postgres:16-alpine\n    environment:\n      POSTGRES_DB: agent_db\n      POSTGRES_USER: agent\n      POSTGRES_PASSWORD: password\n    volumes:\n      - postgres_data:/var/lib/postgresql/data\n    healthcheck:\n      test: [\"CMD-SHELL\", \"pg_isready -U agent\"]\n      interval: 10s\n      timeout: 5s\n      retries: 5\n    restart: unless-stopped\n\n  # Redis - 缓存和会话\n  redis:\n    image: redis:7-alpine\n    command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru\n    volumes:\n      - redis_data:/data\n    healthcheck:\n      test: [\"CMD\", \"redis-cli\", \"ping\"]\n      interval: 10s\n      timeout: 5s\n      retries: 5\n    restart: unless-stopped\n\n  # 向量数据库（Chroma嵌入式）或独立Milvus\n  # 如果用Chroma嵌入式，无需单独容器\n\n  # Nginx反向代理\n  nginx:\n    image: nginx:alpine\n    ports:\n      - \"80:80\"\n      - \"443:443\"\n    volumes:\n      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro\n      - ./nginx/ssl:/etc/nginx/ssl:ro\n    depends_on:\n      - agent\n    restart: unless-stopped\n\nvolumes:\n  postgres_data:\n  redis_data:",
      "section_ref": "C.2.2 Docker Compose",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-4",
      "language": "dockerfile",
      "description": "| --no-cache-dir | 避免pip缓存 | pip install --no-cache-dir |",
      "code": "# .dockerignore\n.git\n.github\n__pycache__\n*.pyc\n.env\n.venv\nvenv\nnode_modules\n*.md\ntests/\ndocs/\n.mypy_cache\n.pytest_cache",
      "section_ref": "C.2.3 镜像优化技巧",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-5",
      "language": "yaml",
      "description": "",
      "code": "# deployment.yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: agent-server\n  labels:\n    app: agent-server\n    version: v1.2.0\nspec:\n  replicas: 3\n  strategy:\n    type: RollingUpdate\n    rollingUpdate:\n      maxSurge: 1\n      maxUnavailable: 0  # 零停机\n  selector:\n    matchLabels:\n      app: agent-server\n  template:\n    metadata:\n      labels:\n        app: agent-server\n    spec:\n      # 安全：非root用户\n      securityContext:\n        runAsNonRoot: true\n        runAsUser: 1000\n        fsGroup: 1000\n      \n      containers:\n      - name: agent-server\n        image: registry.example.com/agent-server:v1.2.0\n        ports:\n        - containerPort: 8000\n          name: http\n        \n        # 环境变量\n        env:\n        - name: DATABASE_URL\n          valueFrom:\n            secretKeyRef:\n              name: agent-secrets\n              key: database-url\n        - name: OPENAI_API_KEY\n          valueFrom:\n            secretKeyRef:\n              name: agent-secrets\n              key: openai-api-key\n        - name: LOG_LEVEL\n          value: \"info\"\n        \n        # 资源限制\n        resources:\n          requests:\n            cpu: \"500m\"\n            memory: \"512Mi\"\n          limits:\n            cpu: \"2000m\"\n            memory: \"2Gi\"\n        \n        # 健康检查\n        livenessProbe:\n          httpGet:\n            path: /health\n            port: 8000\n          initialDelaySeconds: 30\n          periodSeconds: 30\n          timeoutSeconds: 5\n          failureThreshold: 3\n        \n        readinessProbe:\n          httpGet:\n            path: /ready\n            port: 8000\n          initialDelaySeconds: 10\n          periodSeconds: 10\n          timeoutSeconds: 3\n          failureThreshold: 3\n        \n        # 优雅关闭\n        lifecycle:\n          preStop:\n            exec:\n              command: [\"/bin/sh\", \"-c\", \"sleep 10\"]\n      \n      # 终止宽限期（与优雅关闭配合）\n      terminationGracePeriodSeconds: 30",
      "section_ref": "C.3.1 基础部署配置",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-6",
      "language": "yaml",
      "description": "",
      "code": "# service.yaml\napiVersion: v1\nkind: Service\nmetadata:\n  name: agent-service\nspec:\n  selector:\n    app: agent-server\n  ports:\n  - port: 80\n    targetPort: 8000\n  type: ClusterIP\n\n---\n# ingress.yaml\napiVersion: networking.k8s.io/v1\nkind: Ingress\nmetadata:\n  name: agent-ingress\n  annotations:\n    nginx.ingress.kubernetes.io/rate-limit: \"100\"\n    nginx.ingress.kubernetes.io/proxy-body-size: \"10m\"\n    nginx.ingress.kubernetes.io/proxy-read-timeout: \"300\"\n    cert-manager.io/cluster-issuer: \"letsencrypt-prod\"\nspec:\n  tls:\n  - hosts:\n    - agent.example.com\n    secretName: agent-tls\n  rules:\n  - host: agent.example.com\n    http:\n      paths:\n      - path: /\n        pathType: Prefix\n        backend:\n          service:\n            name: agent-service\n            port:\n              number: 80",
      "section_ref": "C.3.2 Service与Ingres",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-7",
      "language": "yaml",
      "description": "Agent应用的负载通常波动较大（取决于用户使用频率），HPA是必不可少的：",
      "code": "# hpa.yaml - 基于CPU和自定义指标的伸缩\napiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n  name: agent-hpa\nspec:\n  scaleTargetRef:\n    apiVersion: apps/v1\n    kind: Deployment\n    name: agent-server\n  minReplicas: 2\n  maxReplicas: 20\n  metrics:\n  # CPU使用率\n  - type: Resource\n    resource:\n      name: cpu\n      target:\n        type: Utilization\n        averageUtilization: 60\n  # 自定义指标：每秒请求数\n  - type: Pods\n    pods:\n      metric:\n        name: http_requests_per_second\n      target:\n        type: AverageValue\n        averageValue: \"50\"\n  # 自定义指标：LLM Token队列长度\n  - type: Pods\n    pods:\n      metric:\n        name: llm_token_queue_depth\n      target:\n        type: AverageValue\n        averageValue: \"10000\"\n  behavior:\n    scaleUp:\n      stabilizationWindowSeconds: 60\n      policies:\n      - type: Percent\n        value: 100  # 每次最多翻倍\n        periodSeconds: 60\n    scaleDown:\n      stabilizationWindowSeconds: 300  # 缩容更保守\n      policies:\n      - type: Percent\n        value: 25  # 每次最多缩减25%\n        periodSeconds: 60",
      "section_ref": "C.3.3 HPA自动伸缩",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-8",
      "language": "yaml",
      "description": "- 预留缓冲：最小副本数至少为2，避免单点故障",
      "code": "# GPU节点配置\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: embedding-server\nspec:\n  template:\n    spec:\n      containers:\n      - name: embedding-server\n        image: registry.example.com/embedding:v1.0.0\n        resources:\n          limits:\n            nvidia.com/gpu: 1  # 请求1块GPU\n            memory: \"8Gi\"\n      # GPU节点选择器\n      nodeSelector:\n        gpu-type: nvidia-a10g\n      tolerations:\n      - key: nvidia.com/gpu\n        operator: Exists\n        effect: NoSchedule",
      "section_ref": "C.3.4 GPU调度（本地模型场景）",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-9",
      "language": "yaml",
      "description": "",
      "code": "# Secret创建（生产环境请使用External Secrets Operator）\napiVersion: v1\nkind: Secret\nmetadata:\n  name: agent-secrets\ntype: Opaque\nstringData:\n  database-url: postgresql://agent:xxx@postgres:5432/agent_db\n  openai-api-key: sk-...\n  anthropic-api-key: sk-ant-...\n  jwt-secret: $(openssl rand -hex 32)",
      "section_ref": "C.3.5 密钥管理",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-10",
      "language": "bash",
      "description": "jwt-secret: $(openssl rand -hex 32)",
      "code": "# 使用kubectl创建\nkubectl create secret generic agent-secrets \\\n  --from-literal=database-url=\"postgresql://...\" \\\n  --from-literal=openai-api-key=\"sk-...\" \\\n  --dry-run=client -o yaml | kubectl apply -f -",
      "section_ref": "C.3.5 密钥管理",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-11",
      "language": "python",
      "description": "⚠️ 限制：Lambda有15分钟超时、10GB内存限制，不适合长时间运行的Agent工作流。",
      "code": "# lambda_handler.py\nimport json\nimport httpx\nimport os\n\n# 在Lambda外初始化客户端（冷启动优化）\nOPENAI_API_KEY = os.environ[\"OPENAI_API_KEY\"]\nBASE_URL = \"https://api.openai.com/v1\"\n\n# HTTP客户端复用（避免每次请求创建）\nhttp_client = httpx.Client(timeout=60.0)\n\ndef handler(event, context):\n    \"\"\"Lambda入口函数\"\"\"\n    try:\n        # 解析请求\n        body = json.loads(event.get(\"body\", \"{}\"))\n        message = body.get(\"message\", \"\")\n        \n        if not message:\n            return {\n                \"statusCode\": 400,\n                \"body\": json.dumps({\"error\": \"message is required\"})\n            }\n        \n        # 调用LLM\n        response = http_client.post(\n            f\"{BASE_URL}/chat/completions\",\n            headers={\"Authorization\": f\"Bearer {OPENAI_API_KEY}\"},\n            json={\n                \"model\": \"gpt-4o-mini\",\n                \"messages\": [{\"role\": \"user\", \"content\": message}],\n                \"max_tokens\": 500,\n            },\n            timeout=30.0,\n        )\n        result = response.json()\n        \n        return {\n            \"statusCode\": 200,\n            \"headers\": {\"Content-Type\": \"application/json\"},\n            \"body\": json.dumps({\n                \"response\": result[\"choices\"][0][\"message\"][\"content\"],\n                \"tokens\": result[\"usage\"]\n            })\n        }\n        \n    except httpx.TimeoutException:\n        return {\"statusCode\": 504, \"body\": json.dumps({\"error\": \"LLM timeout\"})}\n    except Exception as e:\n        return {\"statusCode\": 500, \"body\": json.dumps({\"error\": str(e)})}",
      "section_ref": "C.4.1 AWS Lambda + A",
      "runnable": true,
      "dependencies": [
        "httpx"
      ]
    },
    {
      "id": "code-12",
      "language": "yaml",
      "description": "return {\"statusCode\": 500, \"body\": json.dumps({\"error\": str(e)})}",
      "code": "# serverless.yml\nservice: agent-api\n\nframeworkVersion: \"3\"\n\nprovider:\n  name: aws\n  runtime: python3.12\n  region: ap-east-1\n  timeout: 60  # 秒（最大900）\n  memorySize: 512  # MB（最大10240）\n  environment:\n    OPENAI_API_KEY: ${param:openai_api_key}\n\nfunctions:\n  chat:\n    handler: lambda_handler.handler\n    events:\n      - http:\n          path: chat\n          method: post\n          cors: true\n    provisionedConcurrency: 5  # 预留并发，减少冷启动\n\nplugins:\n  - serverless-python-requirements\n\npackage:\n  individually: false\n  patterns:\n    - \"!tests/**\"\n    - \"!docs/**\"",
      "section_ref": "C.4.1 AWS Lambda + A",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-13",
      "language": "python",
      "description": "",
      "code": "# api/chat.py\nfrom fastapi import FastAPI, Request\nfrom fastapi.responses import StreamingResponse\nimport httpx\nimport os\n\napp = FastAPI()\n\n@app.post(\"/api/chat\")\nasync def chat(request: Request):\n    body = await request.json()\n    message = body.get(\"message\", \"\")\n    \n    async with httpx.AsyncClient() as client:\n        async with client.stream(\n            \"POST\",\n            \"https://api.openai.com/v1/chat/completions\",\n            headers={\"Authorization\": f\"Bearer {os.environ['OPENAI_API_KEY']}\"},\n            json={\n                \"model\": \"gpt-4o-mini\",\n                \"messages\": [{\"role\": \"user\", \"content\": message}],\n                \"stream\": True,\n            },\n            timeout=60.0,\n        ) as response:\n            async def generate():\n                async for line in response.aiter_lines():\n                    if line.startswith(\"data: \") and line != \"data: [DONE]\":\n                        yield line + \"\\n\\n\"\n            \n            return StreamingResponse(\n                generate(),\n                media_type=\"text/event-stream\"\n            )",
      "section_ref": "C.4.2 Vercel部署（Pytho",
      "runnable": true,
      "dependencies": [
        "fastapi",
        "httpx"
      ]
    },
    {
      "id": "code-14",
      "language": "json",
      "description": ")",
      "code": "// vercel.json\n{\n  \"builds\": [\n    {\n      \"src\": \"api/**/*.py\",\n      \"use\": \"@vercel/python\"\n    }\n  ],\n  \"routes\": [\n    {\n      \"src\": \"/api/(.*)\",\n      \"dest\": \"/api/$1\"\n    }\n  ]\n}",
      "section_ref": "C.4.2 Vercel部署（Pytho",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-15",
      "language": "javascript",
      "description": "",
      "code": "// wrangler.toml\nname = \"agent-edge\"\nmain = \"src/index.js\"\ncompatibility_date = \"2025-01-01\"\n\n[vars]\nENVIRONMENT = \"production\"\n\n# 密钥通过 wrangler secret put OPENAI_API_KEY 设置",
      "section_ref": "C.4.3 Cloudflare Wor",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-16",
      "language": "javascript",
      "description": "",
      "code": "// src/index.js\nexport default {\n  async fetch(request, env) {\n    const { pathname } = new URL(request.url);\n    \n    if (pathname === \"/api/chat\" && request.method === \"POST\") {\n      const { message } = await request.json();\n      \n      const response = await fetch(\"https://api.openai.com/v1/chat/completions\", {\n        method: \"POST\",\n        headers: {\n          \"Authorization\": `Bearer ${env.OPENAI_API_KEY}`,\n          \"Content-Type\": \"application/json\",\n        },\n        body: JSON.stringify({\n          model: \"gpt-4o-mini\",\n          messages: [{ role: \"user\", content: message }],\n          max_tokens: 500,\n        }),\n      });\n      \n      return new Response(response.body, {\n        headers: { \"Content-Type\": \"application/json\" },\n      });\n    }\n    \n    return new Response(\"Agent Edge API\", { status: 404 });\n  },\n};",
      "section_ref": "C.4.3 Cloudflare Wor",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-17",
      "language": "text",
      "description": "",
      "code": "┌──────────────────────────────────────────────────┐\n│                   边缘设备                        │\n│  ┌────────────┐  ┌────────────┐  ┌────────────┐  │\n│  │ Agent      │  │ 嵌入模型   │  │ LLM推理    │  │\n│  │ 应用层     │→ │ (ONNX)    │→ │ (llama.cpp) │  │\n│  └────────────┘  └────────────┘  └────────────┘  │\n│         ↓                                      │\n│  ┌────────────────────────────────────────────┐  │\n│  │ 本地向量数据库 (Chroma / FAISS)            │  │\n│  └────────────────────────────────────────────┘  │\n│         ↓                                      │\n│  ┌────────────────────────────────────────────┐  │\n│  │ GPU / NPU 加速层                           │  │\n│  │ (CUDA / Metal / OpenVINO)                  │  │\n│  └────────────────────────────────────────────┘  │\n└──────────────────────────────────────────────────┘\n         ↕ (可选：云端同步)\n┌──────────────────────────────────────────────────┐\n│  云端：模型更新、知识库同步、日志上报              │\n└──────────────────────────────────────────────────┘",
      "section_ref": "C.5.1 本地模型部署架构",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-18",
      "language": "bash",
      "description": "GGUF格式 — llama.cpp",
      "code": "# 下载量化模型\nollama pull qwen2.5:7b-q4_K_M  # 4-bit量化\n\n# 使用llama.cpp直接部署\n./server -m qwen2.5-7b-q4_k_m.gguf \\\n    --port 8080 \\\n    --host 0.0.0.0 \\\n    --n-gpu-layers 99 \\  # 全部层放GPU\n    --ctx-size 4096 \\     # 上下文长度\n    --threads 8           # CPU线程数",
      "section_ref": "C.5.2 模型量化部署",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-19",
      "language": "python",
      "description": "🎯 推荐：Q4KM 是质量和速度的最佳平衡点，7B模型约需5GB显存。",
      "code": "import onnxruntime as ort\nimport numpy as np\n\n# 创建推理会话\nsession = ort.InferenceSession(\n    \"bge-large-zh-v1.5.onnx\",\n    providers=[\"CUDAExecutionProvider\", \"CPUExecutionProvider\"]  # GPU优先\n)\n\ndef embed(texts: list[str]) -> np.ndarray:\n    \"\"\"批量嵌入\"\"\"\n    # Tokenization（简化示例）\n    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors=\"np\")\n    \n    # ONNX推理\n    outputs = session.run(\n        None,\n        {\"input_ids\": inputs[\"input_ids\"], \"attention_mask\": inputs[\"attention_mask\"]}\n    )\n    \n    return outputs[0]  # (batch_size, hidden_dim)",
      "section_ref": "C.5.3 ONNX Runtime —",
      "runnable": true,
      "dependencies": [
        "onnxruntime",
        "numpy"
      ]
    },
    {
      "id": "code-20",
      "language": "python",
      "description": "请求批处理",
      "code": "import asyncio\nfrom openai import AsyncOpenAI\n\nclient = AsyncOpenAI()\n\nasync def process_batch(messages_list: list[list[dict]]) -> list[str]:\n    \"\"\"并发处理多个请求，带信号量控制\"\"\"\n    semaphore = asyncio.Semaphore(10)  # 最多10个并发\n    \n    async def single_call(messages):\n        async with semaphore:\n            response = await client.chat.completions.create(\n                model=\"gpt-4o\",\n                messages=messages,\n                max_tokens=500,\n            )\n            return response.choices[0].message.content\n    \n    results = await asyncio.gather(\n        *[single_call(msgs) for msgs in messages_list],\n        return_exceptions=True\n    )\n    return results",
      "section_ref": "C.6.1 LLM调用优化",
      "runnable": true,
      "dependencies": [
        "openai"
      ]
    },
    {
      "id": "code-21",
      "language": "python",
      "description": "连接池配置",
      "code": "import httpx\n\n# 复用HTTP连接，避免每次请求的TCP握手开销\nclient = httpx.AsyncClient(\n    timeout=60.0,\n    limits=httpx.Limits(\n        max_connections=100,        # 最大连接数\n        max_keepalive_connections=20,  # 最大保持连接\n        keepalive_expiry=300,       # 保持连接超时（秒）\n    ),\n    http2=True,  # 启用HTTP/2（如果服务端支持）\n)",
      "section_ref": "C.6.1 LLM调用优化",
      "runnable": true,
      "dependencies": [
        "httpx"
      ]
    },
    {
      "id": "code-22",
      "language": "python",
      "description": "语义缓存 — 避免重复LLM调用",
      "code": "import hashlib\nimport json\nfrom typing import Optional\n\nclass SemanticCache:\n    \"\"\"基于向量相似度的LLM响应缓存\"\"\"\n    \n    def __init__(self, similarity_threshold: float = 0.95):\n        self.cache = {}  # {embedding: response}\n        self.threshold = similarity_threshold\n    \n    def _get_cache_key(self, prompt: str, model: str, params: dict) -> str:\n        \"\"\"生成缓存键\"\"\"\n        data = {\n            \"prompt\": prompt,\n            \"model\": model,\n            \"temperature\": params.get(\"temperature\", 0),\n            \"max_tokens\": params.get(\"max_tokens\", 1000),\n        }\n        return hashlib.sha256(json.dumps(data, sort_keys=True).encode()).hexdigest()\n    \n    def get(self, prompt: str, model: str, params: dict) -> Optional[str]:\n        \"\"\"查询缓存\"\"\"\n        key = self._get_cache_key(prompt, model, params)\n        return self.cache.get(key)\n    \n    def set(self, prompt: str, model: str, params: dict, response: str):\n        \"\"\"写入缓存\"\"\"\n        key = self._get_cache_key(prompt, model, params)\n        self.cache[key] = response",
      "section_ref": "C.6.2 缓存策略",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-23",
      "language": "python",
      "description": "Redis缓存 — 高性能分布式缓存",
      "code": "import redis\nimport json\n\nredis_client = redis.Redis(host=\"localhost\", port=6379, db=0)\n\ndef cache_llm_response(\n    prompt: str,\n    response: str,\n    ttl: int = 3600  # 1小时\n):\n    \"\"\"缓存LLM响应\"\"\"\n    key = f\"llm:cache:{hashlib.md5(prompt.encode()).hexdigest()}\"\n    redis_client.setex(key, ttl, json.dumps(response))\n\ndef get_cached_response(prompt: str) -> Optional[str]:\n    \"\"\"获取缓存的LLM响应\"\"\"\n    key = f\"llm:cache:{hashlib.md5(prompt.encode()).hexdigest()}\"\n    cached = redis_client.get(key)\n    if cached:\n        return json.loads(cached)\n    return None",
      "section_ref": "C.6.2 缓存策略",
      "runnable": true,
      "dependencies": [
        "redis"
      ]
    },
    {
      "id": "code-24",
      "language": "python",
      "description": "",
      "code": "# 工具调用并发执行\nimport asyncio\n\nasync def execute_tools_concurrently(tools: list[dict]) -> list[dict]:\n    \"\"\"并发执行多个无依赖的工具调用\"\"\"\n    \n    async def run_tool(tool_call: dict) -> dict:\n        try:\n            result = await call_tool(\n                tool_call[\"name\"],\n                tool_call[\"arguments\"]\n            )\n            return {\"tool_call_id\": tool_call[\"id\"], \"result\": result}\n        except Exception as e:\n            return {\"tool_call_id\": tool_call[\"id\"], \"error\": str(e)}\n    \n    # 并发执行所有工具（假设工具之间无依赖）\n    results = await asyncio.gather(*[run_tool(t) for t in tools])\n    return list(results)",
      "section_ref": "C.6.3 异步并发模式",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-25",
      "language": "python",
      "description": "",
      "code": "from fastapi.responses import StreamingResponse\nimport json\n\nasync def stream_chat(message: str):\n    \"\"\"流式输出，减少首字延迟\"\"\"\n    async with httpx.AsyncClient() as client:\n        async with client.stream(\n            \"POST\",\n            \"https://api.openai.com/v1/chat/completions\",\n            json={\n                \"model\": \"gpt-4o\",\n                \"messages\": [{\"role\": \"user\", \"content\": message}],\n                \"stream\": True,\n            },\n            headers={\"Authorization\": f\"Bearer {api_key}\"},\n            timeout=60.0,\n        ) as response:\n            async for line in response.aiter_lines():\n                if line.startswith(\"data: \") and line != \"data: [DONE]\":\n                    data = json.loads(line[6:])\n                    delta = data[\"choices\"][0].get(\"delta\", {})\n                    content = delta.get(\"content\", \"\")\n                    if content:\n                        yield f\"data: {json.dumps({'content': content})}\\n\\n\"\n            yield \"data: [DONE]\\n\\n\"\n\n# FastAPI路由\n@app.post(\"/chat/stream\")\nasync def chat_stream(request: Request):\n    body = await request.json()\n    return StreamingResponse(\n        stream_chat(body[\"message\"]),\n        media_type=\"text/event-stream\",\n        headers={\n            \"Cache-Control\": \"no-cache\",\n            \"X-Accel-Buffering\": \"no\",  # 禁用Nginx缓冲\n        }\n    )",
      "section_ref": "C.6.4 流式输出优化",
      "runnable": true,
      "dependencies": [
        "fastapi"
      ]
    },
    {
      "id": "code-26",
      "language": "python",
      "description": "",
      "code": "class ModelRouter:\n    \"\"\"根据任务复杂度自动选择模型\"\"\"\n    \n    def __init__(self):\n        self.routes = {\n            \"simple\": {\"model\": \"gpt-4o-mini\", \"max_tokens\": 500},\n            \"medium\": {\"model\": \"gpt-4o\", \"max_tokens\": 1000},\n            \"complex\": {\"model\": \"gpt-4o\", \"max_tokens\": 4000},\n            \"reasoning\": {\"model\": \"o3-mini\", \"max_tokens\": 4000},\n        }\n    \n    def classify_complexity(self, prompt: str, conversation_length: int) -> str:\n        \"\"\"启发式分类\"\"\"\n        if conversation_length > 10:\n            return \"complex\"\n        if any(kw in prompt for kw in [\"分析\", \"比较\", \"推导\", \"证明\"]):\n            return \"reasoning\"\n        if len(prompt) > 500:\n            return \"medium\"\n        return \"simple\"\n    \n    def get_model_config(self, prompt: str, history: list = None) -> dict:\n        \"\"\"获取最优模型配置\"\"\"\n        length = len(history) if history else 0\n        complexity = self.classify_complexity(prompt, length)\n        return self.routes[complexity]",
      "section_ref": "C.7.1 模型路由策略",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-27",
      "language": "python",
      "description": "预期节省：合理使用模型路由可以节省 40-60% 的API成本。",
      "code": "# 上下文窗口管理\nclass ContextManager:\n    def __init__(self, max_tokens: int = 120000, reserve_for_response: int = 4000):\n        self.max_tokens = max_tokens\n        self.reserve = reserve_for_response\n        self.available = max_tokens - reserve_for_response\n    \n    def trim_messages(self, messages: list[dict]) -> list[dict]:\n        \"\"\"裁剪消息历史以适配上下文窗口\"\"\"\n        token_count = sum(estimate_tokens(msg[\"content\"]) for msg in messages)\n        \n        if token_count <= self.available:\n            return messages\n        \n        # 保留system prompt\n        system = [m for m in messages if m[\"role\"] == \"system\"]\n        non_system = [m for m in messages if m[\"role\"] != \"system\"]\n        \n        # 从最旧的消息开始移除\n        trimmed = []\n        used = sum(estimate_tokens(m[\"content\"]) for m in system)\n        \n        for msg in reversed(non_system):\n            msg_tokens = estimate_tokens(msg[\"content\"])\n            if used + msg_tokens <= self.available:\n                trimmed.insert(0, msg)\n                used += msg_tokens\n        \n        return system + trimmed\n\ndef estimate_tokens(text: str) -> int:\n    \"\"\"粗略估算token数：中文约1.5token/字，英文约0.25token/word\"\"\"\n    chinese_chars = sum(1 for c in text if '\\u4e00' <= c <= '\\u9fff')\n    other_chars = len(text) - chinese_chars\n    return int(chinese_chars * 1.5 + other_chars * 0.25)",
      "section_ref": "C.7.2 Token用量控制",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-28",
      "language": "python",
      "description": "| 缓存命中率 | 语义缓存命中 | < 30%（过低） |",
      "code": "from prometheus_client import Counter, Histogram, Gauge\n\n# 指标定义\nREQUEST_COUNT = Counter(\n    \"agent_requests_total\",\n    \"Total agent requests\",\n    [\"model\", \"status\"]\n)\n\nREQUEST_LATENCY = Histogram(\n    \"agent_request_duration_seconds\",\n    \"Request latency\",\n    [\"model\"],\n    buckets=[0.5, 1, 2, 5, 10, 30, 60]\n)\n\nTOKEN_USAGE = Counter(\n    \"agent_tokens_total\",\n    \"Token usage\",\n    [\"model\", \"type\"]  # type: input/output\n)\n\nACTIVE_WORKFLOWS = Gauge(\n    \"agent_active_workflows\",\n    \"Currently active workflow count\"\n)\n\n# 使用示例\n@app.post(\"/chat\")\nasync def chat(request: Request):\n    model = \"gpt-4o\"\n    REQUEST_COUNT.labels(model=model, status=\"started\").inc()\n    ACTIVE_WORKFLOWS.inc()\n    \n    with REQUEST_LATENCY.labels(model=model).time():\n        try:\n            result = await call_llm(...)\n            REQUEST_COUNT.labels(model=model, status=\"success\").inc()\n            TOKEN_USAGE.labels(model=model, type=\"input\").inc(input_tokens)\n            TOKEN_USAGE.labels(model=model, type=\"output\").inc(output_tokens)\n            return result\n        except Exception:\n            REQUEST_COUNT.labels(model=model, status=\"error\").inc()\n            raise\n        finally:\n            ACTIVE_WORKFLOWS.dec()",
      "section_ref": "C.8.2 Prometheus指标埋点",
      "runnable": true,
      "dependencies": [
        "prometheus_client"
      ]
    }
  ],
  "tables": [
    {
      "headers": [
        "模式",
        "适用规模",
        "延迟",
        "可用性",
        "运维复杂度",
        "成本结构"
      ],
      "data": [
        [
          "**单容器**",
          "< 1K QPS",
          "< 100ms",
          "99%",
          "低",
          "固定月费"
        ],
        [
          "**K8s集群**",
          "1K-100K QPS",
          "< 50ms",
          "99.9%",
          "高",
          "弹性+固定"
        ],
        [
          "**Serverless**",
          "弹性负载",
          "100-500ms",
          "99.5%",
          "低",
          "按调用计费"
        ],
        [
          "**边缘部署**",
          "离线/低延迟",
          "< 10ms",
          "99%+",
          "中",
          "一次性投入"
        ]
      ]
    },
    {
      "headers": [
        "技巧",
        "效果",
        "示例"
      ],
      "data": [
        [
          "多阶段构建",
          "镜像体积减少50-80%",
          "如上Dockerfile"
        ],
        [
          "`.dockerignore`",
          "减少构建上下文",
          "忽略`.git`, `venv`, `__pycache__`"
        ],
        [
          "Alpine基础镜像",
          "减少基础层大小",
          "`python:3.12-alpine`"
        ],
        [
          "层缓存优化",
          "加速重复构建",
          "先COPY依赖文件"
        ],
        [
          "合并RUN指令",
          "减少层数",
          "`RUN apt-get update && apt-get install -y ...`"
        ],
        [
          "`--no-cache-dir`",
          "避免pip缓存",
          "`pip install --no-cache-dir`"
        ]
      ]
    },
    {
      "headers": [
        "量化等级",
        "模型大小（7B参数）",
        "显存需求",
        "质量损失",
        "速度"
      ],
      "data": [
        [
          "FP16（无量化）",
          "~14GB",
          "16GB",
          "0%",
          "1x"
        ],
        [
          "Q8_0",
          "~7.5GB",
          "8GB",
          "<1%",
          "1.5x"
        ],
        [
          "Q5_K_M",
          "~5.2GB",
          "6GB",
          "1-2%",
          "1.8x"
        ],
        [
          "Q4_K_M",
          "~4.4GB",
          "5GB",
          "2-3%",
          "2.0x"
        ],
        [
          "Q3_K_M",
          "~3.5GB",
          "4GB",
          "3-5%",
          "2.2x"
        ],
        [
          "Q2_K",
          "~2.9GB",
          "3.5GB",
          "5-10%",
          "2.5x"
        ]
      ]
    },
    {
      "headers": [
        "策略",
        "固定月费",
        "弹性费用",
        "月总成本（估算）",
        "节省"
      ],
      "data": [
        [
          "恒定3副本",
          "$300",
          "$500",
          "**$800**",
          "基准"
        ],
        [
          "HPA 2-10副本",
          "$200",
          "$300",
          "**$500**",
          "37%"
        ],
        [
          "Serverless",
          "$0",
          "$400",
          "**$400**",
          "50%"
        ],
        [
          "混合（本地+云端）",
          "$100（本地硬件）",
          "$200",
          "**$300**",
          "62%"
        ]
      ]
    },
    {
      "headers": [
        "指标",
        "说明",
        "告警阈值"
      ],
      "data": [
        [
          "**请求延迟 (P50/P99)**",
          "端到端延迟",
          "P99 > 10s"
        ],
        [
          "**LLM API延迟**",
          "外部LLM调用延迟",
          "> 30s"
        ],
        [
          "**Token用量/小时**",
          "LLM Token消耗速率",
          "> 预算的120%"
        ],
        [
          "**错误率**",
          "5xx错误占比",
          "> 1%"
        ],
        [
          "**工具调用成功率**",
          "工具执行成功率",
          "< 95%"
        ],
        [
          "**队列深度**",
          "待处理请求队列",
          "> 1000"
        ],
        [
          "**缓存命中率**",
          "语义缓存命中",
          "< 30%（过低）"
        ]
      ]
    }
  ],
  "key_takeaways": [],
  "common_pitfalls": [],
  "related_chapters": []
}