{ "metadata": { "id": "appendixC", "title": "附录C：Agent部署指南", "volume": "vol6", "volume_title": "附录", "word_count": 2844, "difficulty": "beginner", "prerequisites": [], "key_concepts": [ "C.1 部署概览", "C.1.1 部署模式对比", "C.1.2 部署前检查清单", "C.2 Docker容器化部署", "C.2.1 Dockerfile最佳实践", "C.2.3 镜像优化技巧", "C.3 Kubernetes编排", "C.3.1 基础部署配置", "C.3.2 Service与Ingress", "C.3.3 HPA自动伸缩", "C.3.4 GPU调度（本地模型场景）", "C.3.5 密钥管理", "C.4 Serverless部署", "C.4.2 Vercel部署（Python）", "C.5 边缘部署与离线推理" ], "learning_objectives": [], "estimated_tokens": 1706, "source_file": "vol6/appendixC_Agent部署指南.md" }, "overview": "", "sections": [ { "id": "C.1 部署概览", "title": "C.1 部署概览", "level": 2, "content": "Agent应用的部署与传统Web应用有显著差异。LLM调用的不确定性、工具执行的安全风险、长时间运行的工作流、以及不可预测的资源消耗，都对部署架构提出了独特的要求。", "subsections": [ { "id": "C.1.1 部署模式对比", "title": "C.1.1 部署模式对比", "content": "| 模式 | 适用规模 | 延迟 | 可用性 | 运维复杂度 | 成本结构 |\n|------|---------|------|--------|-----------|---------|\n| **单容器** | < 1K QPS | < 100ms | 99% | 低 | 固定月费 |\n| **K8s集群** | 1K-100K QPS | < 50ms | 99.9% | 高 | 弹性+固定 |\n| **Serverless** | 弹性负载 | 100-500ms | 99.5% | 低 | 按调用计费 |\n| **边缘部署** | 离线/低延迟 | < 10ms | 99%+ | 中 | 一次性投入 |" }, { "id": "C.1.2 部署前检查清单", "title": "C.1.2 部署前检查清单", "content": "在将Agent推向生产环境之前，请确认以下事项：\n\n- [ ] **API密钥安全**：所有密钥通过环境变量或密钥管理服务注入，不得硬编码\n- [ ] **错误处理**：所有LLM调用、工具执行、数据库操作都有超时和重试机制\n- [ ] **日志记录**：结构化日志，包含请求ID、耗时、token用量等关键指标\n- [ ] **速率限制**：防止API滥用和成本失控\n- [ ] **内容过滤**：输入输出内容安全过滤\n- [ ] **健康检查**：`/health` 端点返回服务状态\n- [ ] **优雅关闭**：处理完进行中的请求后再关闭\n- [ ] **配置外部化**：环境区分（dev/staging/prod）通过配置切换\n\n---" } ] }, { "id": "C.2 Docker容器化部署", "title": "C.2 Docker容器化部署", "level": 2, "content": "", "subsections": [ { "id": "C.2.1 Dockerfile最佳实践", "title": "C.2.1 Dockerfile最佳实践", "content": "**多阶段构建 — Python Agent应用**\n\n\n**多阶段构建 — Rust Agent应用（如edict）**" }, { "id": "C.2.2 Docker Compose", "title": "C.2.2 Docker Compose — 完整开发/部署环境", "content": "" }, { "id": "C.2.3 镜像优化技巧", "title": "C.2.3 镜像优化技巧", "content": "| 技巧 | 效果 | 示例 |\n|------|------|------|\n| 多阶段构建 | 镜像体积减少50-80% | 如上Dockerfile |\n| `.dockerignore` | 减少构建上下文 | 忽略`.git`, `venv`, `__pycache__` |\n| Alpine基础镜像 | 减少基础层大小 | `python:3.12-alpine` |\n| 层缓存优化 | 加速重复构建 | 先COPY依赖文件 |\n| 合并RUN指令 | 减少层数 | `RUN apt-get update && apt-get install -y ...` |\n| `--no-cache-dir` | 避免pip缓存 | `pip install --no-cache-dir` |\n\n\n---" } ] }, { "id": "C.3 Kubernetes编排", "title": "C.3 Kubernetes编排", "level": 2, "content": "", "subsections": [ { "id": "C.3.1 基础部署配置", "title": "C.3.1 基础部署配置", "content": "" }, { "id": "C.3.2 Service与Ingres", "title": "C.3.2 Service与Ingress", "content": "" }, { "id": "C.3.3 HPA自动伸缩", "title": "C.3.3 HPA自动伸缩", "content": "Agent应用的负载通常波动较大（取决于用户使用频率），HPA是必不可少的：\n\n\n💡 **Agent专属伸缩策略**：\n\n- **LLM Token队列**是比QPS更好的伸缩指标——因为每个请求的token消耗差异很大\n- **扩容要快**（60秒稳定窗口），**缩容要慢**（300秒），避免频繁伸缩\n- **预留缓冲**：最小副本数至少为2，避免单点故障" }, { "id": "C.3.4 GPU调度（本地模型场景）", "title": "C.3.4 GPU调度（本地模型场景）", "content": "" }, { "id": "C.3.5 密钥管理", "title": "C.3.5 密钥管理", "content": "---" } ] }, { "id": "C.4 Serverless部署", "title": "C.4 Serverless部署", "level": 2, "content": "", "subsections": [ { "id": "C.4.1 AWS Lambda + A", "title": "C.4.1 AWS Lambda + API Gateway", "content": "⚠️ **限制**：Lambda有15分钟超时、10GB内存限制，不适合长时间运行的Agent工作流。" }, { "id": "C.4.2 Vercel部署（Pytho", "title": "C.4.2 Vercel部署（Python）", "content": "" }, { "id": "C.4.3 Cloudflare Wor", "title": "C.4.3 Cloudflare Workers（边缘计算）", "content": "---" } ] }, { "id": "C.5 边缘部署与离线推理", "title": "C.5 边缘部署与离线推理", "level": 2, "content": "", "subsections": [ { "id": "C.5.1 本地模型部署架构", "title": "C.5.1 本地模型部署架构", "content": "" }, { "id": "C.5.2 模型量化部署", "title": "C.5.2 模型量化部署", "content": "**GGUF格式 — llama.cpp**\n\n\n**量化等级对比**\n\n| 量化等级 | 模型大小（7B参数） | 显存需求 | 质量损失 | 速度 |\n|---------|------------------|---------|---------|------|\n| FP16（无量化） | ~14GB | 16GB | 0% | 1x |\n| Q8_0 | ~7.5GB | 8GB | <1% | 1.5x |\n| Q5_K_M | ~5.2GB | 6GB | 1-2% | 1.8x |\n| Q4_K_M | ~4.4GB | 5GB | 2-3% | 2.0x |\n| Q3_K_M | ~3.5GB | 4GB | 3-5% | 2.2x |\n| Q2_K | ~2.9GB | 3.5GB | 5-10% | 2.5x |\n\n🎯 **推荐**：Q4_K_M 是质量和速度的最佳平衡点，7B模型约需5GB显存。" }, { "id": "C.5.3 ONNX Runtime —", "title": "C.5.3 ONNX Runtime — 嵌入模型加速", "content": "---" } ] }, { "id": "C.6 性能调优", "title": "C.6 性能调优", "level": 2, "content": "", "subsections": [ { "id": "C.6.1 LLM调用优化", "title": "C.6.1 LLM调用优化", "content": "**请求批处理**\n\n\n**连接池配置**" }, { "id": "C.6.2 缓存策略", "title": "C.6.2 缓存策略", "content": "**语义缓存 — 避免重复LLM调用**\n\n\n**Redis缓存 — 高性能分布式缓存**" }, { "id": "C.6.3 异步并发模式", "title": "C.6.3 异步并发模式", "content": "" }, { "id": "C.6.4 流式输出优化", "title": "C.6.4 流式输出优化", "content": "---" } ] }, { "id": "C.7 成本优化", "title": "C.7 成本优化", "level": 2, "content": "", "subsections": [ { "id": "C.7.1 模型路由策略", "title": "C.7.1 模型路由策略", "content": "**预期节省**：合理使用模型路由可以节省 **40-60%** 的API成本。" }, { "id": "C.7.2 Token用量控制", "title": "C.7.2 Token用量控制", "content": "" }, { "id": "C.7.3 按需伸缩成本模型", "title": "C.7.3 按需伸缩成本模型", "content": "| 策略 | 固定月费 | 弹性费用 | 月总成本（估算） | 节省 |\n|------|---------|---------|----------------|------|\n| 恒定3副本 | $300 | $500 | **$800** | 基准 |\n| HPA 2-10副本 | $200 | $300 | **$500** | 37% |\n| Serverless | $0 | $400 | **$400** | 50% |\n| 混合（本地+云端） | $100（本地硬件） | $200 | **$300** | 62% |\n\n💡 **推荐**：大多数场景下，K8s HPA + 模型路由是最优选择，兼顾性能和成本。\n\n---" } ] }, { "id": "C.8 监控与告警", "title": "C.8 监控与告警", "level": 2, "content": "", "subsections": [ { "id": "C.8.1 核心监控指标", "title": "C.8.1 核心监控指标", "content": "| 指标 | 说明 | 告警阈值 |\n|------|------|---------|\n| **请求延迟 (P50/P99)** | 端到端延迟 | P99 > 10s |\n| **LLM API延迟** | 外部LLM调用延迟 | > 30s |\n| **Token用量/小时** | LLM Token消耗速率 | > 预算的120% |\n| **错误率** | 5xx错误占比 | > 1% |\n| **工具调用成功率** | 工具执行成功率 | < 95% |\n| **队列深度** | 待处理请求队列 | > 1000 |\n| **缓存命中率** | 语义缓存命中 | < 30%（过低） |" }, { "id": "C.8.2 Prometheus指标埋点", "title": "C.8.2 Prometheus指标埋点", "content": "" }, { "id": "C.8.3 Grafana Dashbo", "title": "C.8.3 Grafana Dashboard配置要点", "content": "推荐的Dashboard面板：\n\n1. **概览面板**：QPS、错误率、P50/P99延迟\n2. **LLM成本面板**：Token用量趋势、费用估算、模型分布\n3. **工具调用面板**：各工具调用次数、成功率、延迟\n4. **资源面板**：CPU/内存/GPU利用率、副本数\n5. **业务面板**：用户满意度、对话轮数、任务完成率\n\n---\n\n*附录C完*" } ] } ], "code_blocks": [ { "id": "code-1", "language": "dockerfile", "description": "多阶段构建 — Python Agent应用", "code": "# ==================== 阶段1：依赖安装 ====================\nFROM python:3.12-slim AS builder\n\nWORKDIR /build\n\n# 先复制依赖文件，利用Docker层缓存\nCOPY requirements.txt .\nRUN pip install --no-cache-dir --prefix=/install -r requirements.txt\n\n# ==================== 阶段2：运行时 ====================\nFROM python:3.12-slim AS runtime\n\n# 安全：创建非root用户\nRUN groupadd -r agent && useradd -r -g agent -d /app agent\n\nWORKDIR /app\n\n# 从builder阶段复制已安装的依赖\nCOPY --from=builder /install /usr/local\n\n# 复制应用代码\nCOPY --chown=agent:agent . .\n\n# 安全：设置环境变量\nENV PYTHONUNBUFFERED=1 \\\n PYTHONDONTWRITEBYTECODE=1 \\\n PATH=\"/app:${PATH}\"\n\n# 切换到非root用户\nUSER agent\n\n# 健康检查\nHEALTHCHECK --interval=30s --timeout=10s --retries=3 \\\n CMD python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8000/health')\"\n\n# 暴露端口\nEXPOSE 8000\n\n# 启动命令\nCMD [\"uvicorn\", \"main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\", \"--workers\", \"4\"]", "section_ref": "C.2.1 Dockerfile最佳实践", "runnable": false, "dependencies": [] }, { "id": "code-2", "language": "dockerfile", "description": "多阶段构建 — Rust Agent应用（如edict）", "code": "# ==================== 阶段1：编译 ====================\nFROM rust:1.83-slim AS builder\n\nWORKDIR /build\n\n# 依赖缓存\nCOPY Cargo.toml Cargo.lock ./\nRUN mkdir src && echo \"fn main(){}\" > src/main.rs\nRUN cargo build --release && rm -rf src\n\n# 真正编译\nCOPY . .\nRUN cargo build --release\n\n# ==================== 阶段2：运行时 ====================\nFROM debian:bookworm-slim AS runtime\n\nRUN apt-get update && \\\n apt-get install -y --no-install-recommends ca-certificates libssl3 && \\\n rm -rf /var/lib/apt/lists/*\n\nWORKDIR /app\n\nCOPY --from=builder /build/target/release/agent-server /app/agent-server\nCOPY --from=builder /build/config /app/config\n\nRUN groupadd -r agent && useradd -r -g agent agent\nUSER agent\n\nHEALTHCHECK --interval=30s --timeout=5s --retries=3 \\\n CMD [\"/app/agent-server\", \"--health-check\"]\n\nEXPOSE 8080\n\nCMD [\"/app/agent-server\", \"--config\", \"/app/config/production.toml\"]", "section_ref": "C.2.1 Dockerfile最佳实践", "runnable": false, "dependencies": [] }, { "id": "code-3", "language": "yaml", "description": "", "code": "# docker-compose.yml\nversion: \"3.8\"\n\nservices:\n # Agent应用\n agent:\n build:\n context: .\n dockerfile: Dockerfile\n ports:\n - \"8000:8000\"\n environment:\n - DATABASE_URL=postgresql://agent:password@postgres:5432/agent_db\n - REDIS_URL=redis://redis:6379/0\n - OPENAI_API_KEY=${OPENAI_API_KEY}\n - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}\n - LOG_LEVEL=info\n depends_on:\n postgres:\n condition: service_healthy\n redis:\n condition: service_healthy\n restart: unless-stopped\n deploy:\n resources:\n limits:\n memory: 2G\n cpus: \"2.0\"\n reservations:\n memory: 512M\n\n # PostgreSQL - 持久化存储\n postgres:\n image: postgres:16-alpine\n environment:\n POSTGRES_DB: agent_db\n POSTGRES_USER: agent\n POSTGRES_PASSWORD: password\n volumes:\n - postgres_data:/var/lib/postgresql/data\n healthcheck:\n test: [\"CMD-SHELL\", \"pg_isready -U agent\"]\n interval: 10s\n timeout: 5s\n retries: 5\n restart: unless-stopped\n\n # Redis - 缓存和会话\n redis:\n image: redis:7-alpine\n command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru\n volumes:\n - redis_data:/data\n healthcheck:\n test: [\"CMD\", \"redis-cli\", \"ping\"]\n interval: 10s\n timeout: 5s\n retries: 5\n restart: unless-stopped\n\n # 向量数据库（Chroma嵌入式）或独立Milvus\n # 如果用Chroma嵌入式，无需单独容器\n\n # Nginx反向代理\n nginx:\n image: nginx:alpine\n ports:\n - \"80:80\"\n - \"443:443\"\n volumes:\n - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro\n - ./nginx/ssl:/etc/nginx/ssl:ro\n depends_on:\n - agent\n restart: unless-stopped\n\nvolumes:\n postgres_data:\n redis_data:", "section_ref": "C.2.2 Docker Compose", "runnable": false, "dependencies": [] }, { "id": "code-4", "language": "dockerfile", "description": "| --no-cache-dir | 避免pip缓存 | pip install --no-cache-dir |", "code": "# .dockerignore\n.git\n.github\n__pycache__\n*.pyc\n.env\n.venv\nvenv\nnode_modules\n*.md\ntests/\ndocs/\n.mypy_cache\n.pytest_cache", "section_ref": "C.2.3 镜像优化技巧", "runnable": false, "dependencies": [] }, { "id": "code-5", "language": "yaml", "description": "", "code": "# deployment.yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: agent-server\n labels:\n app: agent-server\n version: v1.2.0\nspec:\n replicas: 3\n strategy:\n type: RollingUpdate\n rollingUpdate:\n maxSurge: 1\n maxUnavailable: 0 # 零停机\n selector:\n matchLabels:\n app: agent-server\n template:\n metadata:\n labels:\n app: agent-server\n spec:\n # 安全：非root用户\n securityContext:\n runAsNonRoot: true\n runAsUser: 1000\n fsGroup: 1000\n \n containers:\n - name: agent-server\n image: registry.example.com/agent-server:v1.2.0\n ports:\n - containerPort: 8000\n name: http\n \n # 环境变量\n env:\n - name: DATABASE_URL\n valueFrom:\n secretKeyRef:\n name: agent-secrets\n key: database-url\n - name: OPENAI_API_KEY\n valueFrom:\n secretKeyRef:\n name: agent-secrets\n key: openai-api-key\n - name: LOG_LEVEL\n value: \"info\"\n \n # 资源限制\n resources:\n requests:\n cpu: \"500m\"\n memory: \"512Mi\"\n limits:\n cpu: \"2000m\"\n memory: \"2Gi\"\n \n # 健康检查\n livenessProbe:\n httpGet:\n path: /health\n port: 8000\n initialDelaySeconds: 30\n periodSeconds: 30\n timeoutSeconds: 5\n failureThreshold: 3\n \n readinessProbe:\n httpGet:\n path: /ready\n port: 8000\n initialDelaySeconds: 10\n periodSeconds: 10\n timeoutSeconds: 3\n failureThreshold: 3\n \n # 优雅关闭\n lifecycle:\n preStop:\n exec:\n command: [\"/bin/sh\", \"-c\", \"sleep 10\"]\n \n # 终止宽限期（与优雅关闭配合）\n terminationGracePeriodSeconds: 30", "section_ref": "C.3.1 基础部署配置", "runnable": false, "dependencies": [] }, { "id": "code-6", "language": "yaml", "description": "", "code": "# service.yaml\napiVersion: v1\nkind: Service\nmetadata:\n name: agent-service\nspec:\n selector:\n app: agent-server\n ports:\n - port: 80\n targetPort: 8000\n type: ClusterIP\n\n---\n# ingress.yaml\napiVersion: networking.k8s.io/v1\nkind: Ingress\nmetadata:\n name: agent-ingress\n annotations:\n nginx.ingress.kubernetes.io/rate-limit: \"100\"\n nginx.ingress.kubernetes.io/proxy-body-size: \"10m\"\n nginx.ingress.kubernetes.io/proxy-read-timeout: \"300\"\n cert-manager.io/cluster-issuer: \"letsencrypt-prod\"\nspec:\n tls:\n - hosts:\n - agent.example.com\n secretName: agent-tls\n rules:\n - host: agent.example.com\n http:\n paths:\n - path: /\n pathType: Prefix\n backend:\n service:\n name: agent-service\n port:\n number: 80", "section_ref": "C.3.2 Service与Ingres", "runnable": false, "dependencies": [] }, { "id": "code-7", "language": "yaml", "description": "Agent应用的负载通常波动较大（取决于用户使用频率），HPA是必不可少的：", "code": "# hpa.yaml - 基于CPU和自定义指标的伸缩\napiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n name: agent-hpa\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: agent-server\n minReplicas: 2\n maxReplicas: 20\n metrics:\n # CPU使用率\n - type: Resource\n resource:\n name: cpu\n target:\n type: Utilization\n averageUtilization: 60\n # 自定义指标：每秒请求数\n - type: Pods\n pods:\n metric:\n name: http_requests_per_second\n target:\n type: AverageValue\n averageValue: \"50\"\n # 自定义指标：LLM Token队列长度\n - type: Pods\n pods:\n metric:\n name: llm_token_queue_depth\n target:\n type: AverageValue\n averageValue: \"10000\"\n behavior:\n scaleUp:\n stabilizationWindowSeconds: 60\n policies:\n - type: Percent\n value: 100 # 每次最多翻倍\n periodSeconds: 60\n scaleDown:\n stabilizationWindowSeconds: 300 # 缩容更保守\n policies:\n - type: Percent\n value: 25 # 每次最多缩减25%\n periodSeconds: 60", "section_ref": "C.3.3 HPA自动伸缩", "runnable": false, "dependencies": [] }, { "id": "code-8", "language": "yaml", "description": "- 预留缓冲：最小副本数至少为2，避免单点故障", "code": "# GPU节点配置\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: embedding-server\nspec:\n template:\n spec:\n containers:\n - name: embedding-server\n image: registry.example.com/embedding:v1.0.0\n resources:\n limits:\n nvidia.com/gpu: 1 # 请求1块GPU\n memory: \"8Gi\"\n # GPU节点选择器\n nodeSelector:\n gpu-type: nvidia-a10g\n tolerations:\n - key: nvidia.com/gpu\n operator: Exists\n effect: NoSchedule", "section_ref": "C.3.4 GPU调度（本地模型场景）", "runnable": false, "dependencies": [] }, { "id": "code-9", "language": "yaml", "description": "", "code": "# Secret创建（生产环境请使用External Secrets Operator）\napiVersion: v1\nkind: Secret\nmetadata:\n name: agent-secrets\ntype: Opaque\nstringData:\n database-url: postgresql://agent:xxx@postgres:5432/agent_db\n openai-api-key: sk-...\n anthropic-api-key: sk-ant-...\n jwt-secret: $(openssl rand -hex 32)", "section_ref": "C.3.5 密钥管理", "runnable": false, "dependencies": [] }, { "id": "code-10", "language": "bash", "description": "jwt-secret: $(openssl rand -hex 32)", "code": "# 使用kubectl创建\nkubectl create secret generic agent-secrets \\\n --from-literal=database-url=\"postgresql://...\" \\\n --from-literal=openai-api-key=\"sk-...\" \\\n --dry-run=client -o yaml | kubectl apply -f -", "section_ref": "C.3.5 密钥管理", "runnable": false, "dependencies": [] }, { "id": "code-11", "language": "python", "description": "⚠️ 限制：Lambda有15分钟超时、10GB内存限制，不适合长时间运行的Agent工作流。", "code": "# lambda_handler.py\nimport json\nimport httpx\nimport os\n\n# 在Lambda外初始化客户端（冷启动优化）\nOPENAI_API_KEY = os.environ[\"OPENAI_API_KEY\"]\nBASE_URL = \"https://api.openai.com/v1\"\n\n# HTTP客户端复用（避免每次请求创建）\nhttp_client = httpx.Client(timeout=60.0)\n\ndef handler(event, context):\n \"\"\"Lambda入口函数\"\"\"\n try:\n # 解析请求\n body = json.loads(event.get(\"body\", \"{}\"))\n message = body.get(\"message\", \"\")\n \n if not message:\n return {\n \"statusCode\": 400,\n \"body\": json.dumps({\"error\": \"message is required\"})\n }\n \n # 调用LLM\n response = http_client.post(\n f\"{BASE_URL}/chat/completions\",\n headers={\"Authorization\": f\"Bearer {OPENAI_API_KEY}\"},\n json={\n \"model\": \"gpt-4o-mini\",\n \"messages\": [{\"role\": \"user\", \"content\": message}],\n \"max_tokens\": 500,\n },\n timeout=30.0,\n )\n result = response.json()\n \n return {\n \"statusCode\": 200,\n \"headers\": {\"Content-Type\": \"application/json\"},\n \"body\": json.dumps({\n \"response\": result[\"choices\"][0][\"message\"][\"content\"],\n \"tokens\": result[\"usage\"]\n })\n }\n \n except httpx.TimeoutException:\n return {\"statusCode\": 504, \"body\": json.dumps({\"error\": \"LLM timeout\"})}\n except Exception as e:\n return {\"statusCode\": 500, \"body\": json.dumps({\"error\": str(e)})}", "section_ref": "C.4.1 AWS Lambda + A", "runnable": true, "dependencies": [ "httpx" ] }, { "id": "code-12", "language": "yaml", "description": "return {\"statusCode\": 500, \"body\": json.dumps({\"error\": str(e)})}", "code": "# serverless.yml\nservice: agent-api\n\nframeworkVersion: \"3\"\n\nprovider:\n name: aws\n runtime: python3.12\n region: ap-east-1\n timeout: 60 # 秒（最大900）\n memorySize: 512 # MB（最大10240）\n environment:\n OPENAI_API_KEY: ${param:openai_api_key}\n\nfunctions:\n chat:\n handler: lambda_handler.handler\n events:\n - http:\n path: chat\n method: post\n cors: true\n provisionedConcurrency: 5 # 预留并发，减少冷启动\n\nplugins:\n - serverless-python-requirements\n\npackage:\n individually: false\n patterns:\n - \"!tests/**\"\n - \"!docs/**\"", "section_ref": "C.4.1 AWS Lambda + A", "runnable": false, "dependencies": [] }, { "id": "code-13", "language": "python", "description": "", "code": "# api/chat.py\nfrom fastapi import FastAPI, Request\nfrom fastapi.responses import StreamingResponse\nimport httpx\nimport os\n\napp = FastAPI()\n\n@app.post(\"/api/chat\")\nasync def chat(request: Request):\n body = await request.json()\n message = body.get(\"message\", \"\")\n \n async with httpx.AsyncClient() as client:\n async with client.stream(\n \"POST\",\n \"https://api.openai.com/v1/chat/completions\",\n headers={\"Authorization\": f\"Bearer {os.environ['OPENAI_API_KEY']}\"},\n json={\n \"model\": \"gpt-4o-mini\",\n \"messages\": [{\"role\": \"user\", \"content\": message}],\n \"stream\": True,\n },\n timeout=60.0,\n ) as response:\n async def generate():\n async for line in response.aiter_lines():\n if line.startswith(\"data: \") and line != \"data: [DONE]\":\n yield line + \"\\n\\n\"\n \n return StreamingResponse(\n generate(),\n media_type=\"text/event-stream\"\n )", "section_ref": "C.4.2 Vercel部署（Pytho", "runnable": true, "dependencies": [ "fastapi", "httpx" ] }, { "id": "code-14", "language": "json", "description": ")", "code": "// vercel.json\n{\n \"builds\": [\n {\n \"src\": \"api/**/*.py\",\n \"use\": \"@vercel/python\"\n }\n ],\n \"routes\": [\n {\n \"src\": \"/api/(.*)\",\n \"dest\": \"/api/$1\"\n }\n ]\n}", "section_ref": "C.4.2 Vercel部署（Pytho", "runnable": false, "dependencies": [] }, { "id": "code-15", "language": "javascript", "description": "", "code": "// wrangler.toml\nname = \"agent-edge\"\nmain = \"src/index.js\"\ncompatibility_date = \"2025-01-01\"\n\n[vars]\nENVIRONMENT = \"production\"\n\n# 密钥通过 wrangler secret put OPENAI_API_KEY 设置", "section_ref": "C.4.3 Cloudflare Wor", "runnable": true, "dependencies": [] }, { "id": "code-16", "language": "javascript", "description": "", "code": "// src/index.js\nexport default {\n async fetch(request, env) {\n const { pathname } = new URL(request.url);\n \n if (pathname === \"/api/chat\" && request.method === \"POST\") {\n const { message } = await request.json();\n \n const response = await fetch(\"https://api.openai.com/v1/chat/completions\", {\n method: \"POST\",\n headers: {\n \"Authorization\": `Bearer ${env.OPENAI_API_KEY}`,\n \"Content-Type\": \"application/json\",\n },\n body: JSON.stringify({\n model: \"gpt-4o-mini\",\n messages: [{ role: \"user\", content: message }],\n max_tokens: 500,\n }),\n });\n \n return new Response(response.body, {\n headers: { \"Content-Type\": \"application/json\" },\n });\n }\n \n return new Response(\"Agent Edge API\", { status: 404 });\n },\n};", "section_ref": "C.4.3 Cloudflare Wor", "runnable": true, "dependencies": [] }, { "id": "code-17", "language": "text", "description": "", "code": "┌──────────────────────────────────────────────────┐\n│ 边缘设备 │\n│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │\n│ │ Agent │ │ 嵌入模型 │ │ LLM推理 │ │\n│ │ 应用层 │→ │ (ONNX) │→ │ (llama.cpp) │ │\n│ └────────────┘ └────────────┘ └────────────┘ │\n│ ↓ │\n│ ┌────────────────────────────────────────────┐ │\n│ │ 本地向量数据库 (Chroma / FAISS) │ │\n│ └────────────────────────────────────────────┘ │\n│ ↓ │\n│ ┌────────────────────────────────────────────┐ │\n│ │ GPU / NPU 加速层 │ │\n│ │ (CUDA / Metal / OpenVINO) │ │\n│ └────────────────────────────────────────────┘ │\n└──────────────────────────────────────────────────┘\n ↕ (可选：云端同步)\n┌──────────────────────────────────────────────────┐\n│ 云端：模型更新、知识库同步、日志上报 │\n└──────────────────────────────────────────────────┘", "section_ref": "C.5.1 本地模型部署架构", "runnable": false, "dependencies": [] }, { "id": "code-18", "language": "bash", "description": "GGUF格式 — llama.cpp", "code": "# 下载量化模型\nollama pull qwen2.5:7b-q4_K_M # 4-bit量化\n\n# 使用llama.cpp直接部署\n./server -m qwen2.5-7b-q4_k_m.gguf \\\n --port 8080 \\\n --host 0.0.0.0 \\\n --n-gpu-layers 99 \\ # 全部层放GPU\n --ctx-size 4096 \\ # 上下文长度\n --threads 8 # CPU线程数", "section_ref": "C.5.2 模型量化部署", "runnable": false, "dependencies": [] }, { "id": "code-19", "language": "python", "description": "🎯 推荐：Q4KM 是质量和速度的最佳平衡点，7B模型约需5GB显存。", "code": "import onnxruntime as ort\nimport numpy as np\n\n# 创建推理会话\nsession = ort.InferenceSession(\n \"bge-large-zh-v1.5.onnx\",\n providers=[\"CUDAExecutionProvider\", \"CPUExecutionProvider\"] # GPU优先\n)\n\ndef embed(texts: list[str]) -> np.ndarray:\n \"\"\"批量嵌入\"\"\"\n # Tokenization（简化示例）\n inputs = tokenizer(texts, padding=True, truncation=True, return_tensors=\"np\")\n \n # ONNX推理\n outputs = session.run(\n None,\n {\"input_ids\": inputs[\"input_ids\"], \"attention_mask\": inputs[\"attention_mask\"]}\n )\n \n return outputs[0] # (batch_size, hidden_dim)", "section_ref": "C.5.3 ONNX Runtime —", "runnable": true, "dependencies": [ "onnxruntime", "numpy" ] }, { "id": "code-20", "language": "python", "description": "请求批处理", "code": "import asyncio\nfrom openai import AsyncOpenAI\n\nclient = AsyncOpenAI()\n\nasync def process_batch(messages_list: list[list[dict]]) -> list[str]:\n \"\"\"并发处理多个请求，带信号量控制\"\"\"\n semaphore = asyncio.Semaphore(10) # 最多10个并发\n \n async def single_call(messages):\n async with semaphore:\n response = await client.chat.completions.create(\n model=\"gpt-4o\",\n messages=messages,\n max_tokens=500,\n )\n return response.choices[0].message.content\n \n results = await asyncio.gather(\n *[single_call(msgs) for msgs in messages_list],\n return_exceptions=True\n )\n return results", "section_ref": "C.6.1 LLM调用优化", "runnable": true, "dependencies": [ "openai" ] }, { "id": "code-21", "language": "python", "description": "连接池配置", "code": "import httpx\n\n# 复用HTTP连接，避免每次请求的TCP握手开销\nclient = httpx.AsyncClient(\n timeout=60.0,\n limits=httpx.Limits(\n max_connections=100, # 最大连接数\n max_keepalive_connections=20, # 最大保持连接\n keepalive_expiry=300, # 保持连接超时（秒）\n ),\n http2=True, # 启用HTTP/2（如果服务端支持）\n)", "section_ref": "C.6.1 LLM调用优化", "runnable": true, "dependencies": [ "httpx" ] }, { "id": "code-22", "language": "python", "description": "语义缓存 — 避免重复LLM调用", "code": "import hashlib\nimport json\nfrom typing import Optional\n\nclass SemanticCache:\n \"\"\"基于向量相似度的LLM响应缓存\"\"\"\n \n def __init__(self, similarity_threshold: float = 0.95):\n self.cache = {} # {embedding: response}\n self.threshold = similarity_threshold\n \n def _get_cache_key(self, prompt: str, model: str, params: dict) -> str:\n \"\"\"生成缓存键\"\"\"\n data = {\n \"prompt\": prompt,\n \"model\": model,\n \"temperature\": params.get(\"temperature\", 0),\n \"max_tokens\": params.get(\"max_tokens\", 1000),\n }\n return hashlib.sha256(json.dumps(data, sort_keys=True).encode()).hexdigest()\n \n def get(self, prompt: str, model: str, params: dict) -> Optional[str]:\n \"\"\"查询缓存\"\"\"\n key = self._get_cache_key(prompt, model, params)\n return self.cache.get(key)\n \n def set(self, prompt: str, model: str, params: dict, response: str):\n \"\"\"写入缓存\"\"\"\n key = self._get_cache_key(prompt, model, params)\n self.cache[key] = response", "section_ref": "C.6.2 缓存策略", "runnable": true, "dependencies": [] }, { "id": "code-23", "language": "python", "description": "Redis缓存 — 高性能分布式缓存", "code": "import redis\nimport json\n\nredis_client = redis.Redis(host=\"localhost\", port=6379, db=0)\n\ndef cache_llm_response(\n prompt: str,\n response: str,\n ttl: int = 3600 # 1小时\n):\n \"\"\"缓存LLM响应\"\"\"\n key = f\"llm:cache:{hashlib.md5(prompt.encode()).hexdigest()}\"\n redis_client.setex(key, ttl, json.dumps(response))\n\ndef get_cached_response(prompt: str) -> Optional[str]:\n \"\"\"获取缓存的LLM响应\"\"\"\n key = f\"llm:cache:{hashlib.md5(prompt.encode()).hexdigest()}\"\n cached = redis_client.get(key)\n if cached:\n return json.loads(cached)\n return None", "section_ref": "C.6.2 缓存策略", "runnable": true, "dependencies": [ "redis" ] }, { "id": "code-24", "language": "python", "description": "", "code": "# 工具调用并发执行\nimport asyncio\n\nasync def execute_tools_concurrently(tools: list[dict]) -> list[dict]:\n \"\"\"并发执行多个无依赖的工具调用\"\"\"\n \n async def run_tool(tool_call: dict) -> dict:\n try:\n result = await call_tool(\n tool_call[\"name\"],\n tool_call[\"arguments\"]\n )\n return {\"tool_call_id\": tool_call[\"id\"], \"result\": result}\n except Exception as e:\n return {\"tool_call_id\": tool_call[\"id\"], \"error\": str(e)}\n \n # 并发执行所有工具（假设工具之间无依赖）\n results = await asyncio.gather(*[run_tool(t) for t in tools])\n return list(results)", "section_ref": "C.6.3 异步并发模式", "runnable": true, "dependencies": [] }, { "id": "code-25", "language": "python", "description": "", "code": "from fastapi.responses import StreamingResponse\nimport json\n\nasync def stream_chat(message: str):\n \"\"\"流式输出，减少首字延迟\"\"\"\n async with httpx.AsyncClient() as client:\n async with client.stream(\n \"POST\",\n \"https://api.openai.com/v1/chat/completions\",\n json={\n \"model\": \"gpt-4o\",\n \"messages\": [{\"role\": \"user\", \"content\": message}],\n \"stream\": True,\n },\n headers={\"Authorization\": f\"Bearer {api_key}\"},\n timeout=60.0,\n ) as response:\n async for line in response.aiter_lines():\n if line.startswith(\"data: \") and line != \"data: [DONE]\":\n data = json.loads(line[6:])\n delta = data[\"choices\"][0].get(\"delta\", {})\n content = delta.get(\"content\", \"\")\n if content:\n yield f\"data: {json.dumps({'content': content})}\\n\\n\"\n yield \"data: [DONE]\\n\\n\"\n\n# FastAPI路由\n@app.post(\"/chat/stream\")\nasync def chat_stream(request: Request):\n body = await request.json()\n return StreamingResponse(\n stream_chat(body[\"message\"]),\n media_type=\"text/event-stream\",\n headers={\n \"Cache-Control\": \"no-cache\",\n \"X-Accel-Buffering\": \"no\", # 禁用Nginx缓冲\n }\n )", "section_ref": "C.6.4 流式输出优化", "runnable": true, "dependencies": [ "fastapi" ] }, { "id": "code-26", "language": "python", "description": "", "code": "class ModelRouter:\n \"\"\"根据任务复杂度自动选择模型\"\"\"\n \n def __init__(self):\n self.routes = {\n \"simple\": {\"model\": \"gpt-4o-mini\", \"max_tokens\": 500},\n \"medium\": {\"model\": \"gpt-4o\", \"max_tokens\": 1000},\n \"complex\": {\"model\": \"gpt-4o\", \"max_tokens\": 4000},\n \"reasoning\": {\"model\": \"o3-mini\", \"max_tokens\": 4000},\n }\n \n def classify_complexity(self, prompt: str, conversation_length: int) -> str:\n \"\"\"启发式分类\"\"\"\n if conversation_length > 10:\n return \"complex\"\n if any(kw in prompt for kw in [\"分析\", \"比较\", \"推导\", \"证明\"]):\n return \"reasoning\"\n if len(prompt) > 500:\n return \"medium\"\n return \"simple\"\n \n def get_model_config(self, prompt: str, history: list = None) -> dict:\n \"\"\"获取最优模型配置\"\"\"\n length = len(history) if history else 0\n complexity = self.classify_complexity(prompt, length)\n return self.routes[complexity]", "section_ref": "C.7.1 模型路由策略", "runnable": true, "dependencies": [] }, { "id": "code-27", "language": "python", "description": "预期节省：合理使用模型路由可以节省 40-60% 的API成本。", "code": "# 上下文窗口管理\nclass ContextManager:\n def __init__(self, max_tokens: int = 120000, reserve_for_response: int = 4000):\n self.max_tokens = max_tokens\n self.reserve = reserve_for_response\n self.available = max_tokens - reserve_for_response\n \n def trim_messages(self, messages: list[dict]) -> list[dict]:\n \"\"\"裁剪消息历史以适配上下文窗口\"\"\"\n token_count = sum(estimate_tokens(msg[\"content\"]) for msg in messages)\n \n if token_count <= self.available:\n return messages\n \n # 保留system prompt\n system = [m for m in messages if m[\"role\"] == \"system\"]\n non_system = [m for m in messages if m[\"role\"] != \"system\"]\n \n # 从最旧的消息开始移除\n trimmed = []\n used = sum(estimate_tokens(m[\"content\"]) for m in system)\n \n for msg in reversed(non_system):\n msg_tokens = estimate_tokens(msg[\"content\"])\n if used + msg_tokens <= self.available:\n trimmed.insert(0, msg)\n used += msg_tokens\n \n return system + trimmed\n\ndef estimate_tokens(text: str) -> int:\n \"\"\"粗略估算token数：中文约1.5token/字，英文约0.25token/word\"\"\"\n chinese_chars = sum(1 for c in text if '\\u4e00' <= c <= '\\u9fff')\n other_chars = len(text) - chinese_chars\n return int(chinese_chars * 1.5 + other_chars * 0.25)", "section_ref": "C.7.2 Token用量控制", "runnable": true, "dependencies": [] }, { "id": "code-28", "language": "python", "description": "| 缓存命中率 | 语义缓存命中 | < 30%（过低） |", "code": "from prometheus_client import Counter, Histogram, Gauge\n\n# 指标定义\nREQUEST_COUNT = Counter(\n \"agent_requests_total\",\n \"Total agent requests\",\n [\"model\", \"status\"]\n)\n\nREQUEST_LATENCY = Histogram(\n \"agent_request_duration_seconds\",\n \"Request latency\",\n [\"model\"],\n buckets=[0.5, 1, 2, 5, 10, 30, 60]\n)\n\nTOKEN_USAGE = Counter(\n \"agent_tokens_total\",\n \"Token usage\",\n [\"model\", \"type\"] # type: input/output\n)\n\nACTIVE_WORKFLOWS = Gauge(\n \"agent_active_workflows\",\n \"Currently active workflow count\"\n)\n\n# 使用示例\n@app.post(\"/chat\")\nasync def chat(request: Request):\n model = \"gpt-4o\"\n REQUEST_COUNT.labels(model=model, status=\"started\").inc()\n ACTIVE_WORKFLOWS.inc()\n \n with REQUEST_LATENCY.labels(model=model).time():\n try:\n result = await call_llm(...)\n REQUEST_COUNT.labels(model=model, status=\"success\").inc()\n TOKEN_USAGE.labels(model=model, type=\"input\").inc(input_tokens)\n TOKEN_USAGE.labels(model=model, type=\"output\").inc(output_tokens)\n return result\n except Exception:\n REQUEST_COUNT.labels(model=model, status=\"error\").inc()\n raise\n finally:\n ACTIVE_WORKFLOWS.dec()", "section_ref": "C.8.2 Prometheus指标埋点", "runnable": true, "dependencies": [ "prometheus_client" ] } ], "tables": [ { "headers": [ "模式", "适用规模", "延迟", "可用性", "运维复杂度", "成本结构" ], "data": [ [ "**单容器**", "< 1K QPS", "< 100ms", "99%", "低", "固定月费" ], [ "**K8s集群**", "1K-100K QPS", "< 50ms", "99.9%", "高", "弹性+固定" ], [ "**Serverless**", "弹性负载", "100-500ms", "99.5%", "低", "按调用计费" ], [ "**边缘部署**", "离线/低延迟", "< 10ms", "99%+", "中", "一次性投入" ] ] }, { "headers": [ "技巧", "效果", "示例" ], "data": [ [ "多阶段构建", "镜像体积减少50-80%", "如上Dockerfile" ], [ "`.dockerignore`", "减少构建上下文", "忽略`.git`, `venv`, `__pycache__`" ], [ "Alpine基础镜像", "减少基础层大小", "`python:3.12-alpine`" ], [ "层缓存优化", "加速重复构建", "先COPY依赖文件" ], [ "合并RUN指令", "减少层数", "`RUN apt-get update && apt-get install -y ...`" ], [ "`--no-cache-dir`", "避免pip缓存", "`pip install --no-cache-dir`" ] ] }, { "headers": [ "量化等级", "模型大小（7B参数）", "显存需求", "质量损失", "速度" ], "data": [ [ "FP16（无量化）", "~14GB", "16GB", "0%", "1x" ], [ "Q8_0", "~7.5GB", "8GB", "<1%", "1.5x" ], [ "Q5_K_M", "~5.2GB", "6GB", "1-2%", "1.8x" ], [ "Q4_K_M", "~4.4GB", "5GB", "2-3%", "2.0x" ], [ "Q3_K_M", "~3.5GB", "4GB", "3-5%", "2.2x" ], [ "Q2_K", "~2.9GB", "3.5GB", "5-10%", "2.5x" ] ] }, { "headers": [ "策略", "固定月费", "弹性费用", "月总成本（估算）", "节省" ], "data": [ [ "恒定3副本", "$300", "$500", "**$800**", "基准" ], [ "HPA 2-10副本", "$200", "$300", "**$500**", "37%" ], [ "Serverless", "$0", "$400", "**$400**", "50%" ], [ "混合（本地+云端）", "$100（本地硬件）", "$200", "**$300**", "62%" ] ] }, { "headers": [ "指标", "说明", "告警阈值" ], "data": [ [ "**请求延迟 (P50/P99)**", "端到端延迟", "P99 > 10s" ], [ "**LLM API延迟**", "外部LLM调用延迟", "> 30s" ], [ "**Token用量/小时**", "LLM Token消耗速率", "> 预算的120%" ], [ "**错误率**", "5xx错误占比", "> 1%" ], [ "**工具调用成功率**", "工具执行成功率", "< 95%" ], [ "**队列深度**", "待处理请求队列", "> 1000" ], [ "**缓存命中率**", "语义缓存命中", "< 30%（过低）" ] ] } ], "key_takeaways": [], "common_pitfalls": [], "related_chapters": [] }