{
  "metadata": {
    "id": "ch38",
    "title": "第38章 监控与告警体系",
    "volume": "vol10",
    "volume_title": "生产级Agent平台",
    "word_count": 2174,
    "difficulty": "advanced",
    "prerequisites": [
      "ch15",
      "ch36"
    ],
    "key_concepts": [
      "概述：可观测性三支柱",
      "指标体系设计",
      "指标分类",
      "核心 Agent 指标定义",
      "指标采集中间件",
      "Prometheus 配置",
      "分布式追踪",
      "OpenTelemetry 集成",
      "Agent 追踪 Span 设计",
      "追踪采样策略",
      "日志聚合",
      "结构化日志",
      "日志级别规范",
      "ELK 日志聚合配置",
      "告警规则设计"
    ],
    "learning_objectives": [],
    "estimated_tokens": 1304,
    "source_file": "vol10/ch38_监控与告警体系.md"
  },
  "overview": "",
  "sections": [
    {
      "id": "38.1",
      "title": "38.1 概述：可观测性三支柱",
      "level": 2,
      "content": "监控是生产环境的\"眼睛和耳朵\"。对于 Agent 平台而言，监控不仅要覆盖传统的系统指标（CPU、内存、延迟），还要覆盖 Agent 特有的指标（Token 消耗、模型调用延迟、Prompt 成功率、工具执行成功率）。\n\n可观测性（Observability）由三支柱组成：\n\n\n| 支柱 | 回答的问题 | 工具 | 数据格式 |\n|------|-----------|------|----------|\n| 日志 | 发生了什么？ | ELK / Loki | 非结构化/结构化文本 |\n| 指标 | 发生了多少/多快？ | Prometheus / VictoriaMetrics | 时间序列数据 |\n| 追踪 | 在哪里发生？ | Jaeger / Zipkin / Tempo | Span 树 |",
      "subsections": []
    },
    {
      "id": "38.2",
      "title": "38.2 指标体系设计",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "38.2.1",
          "title": "38.2.1 指标分类",
          "content": "Agent 平台的指标应该按照 USE 方法（Utilization, Saturation, Errors）和 RED 方法（Rate, Errors, Duration）来组织："
        },
        {
          "id": "38.2.2",
          "title": "38.2.2 核心 Agent 指标定义",
          "content": ""
        },
        {
          "id": "38.2.3",
          "title": "38.2.3 指标采集中间件",
          "content": ""
        },
        {
          "id": "38.2.4",
          "title": "38.2.4 Prometheus 配置",
          "content": ""
        }
      ]
    },
    {
      "id": "38.3",
      "title": "38.3 分布式追踪",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "38.3.1",
          "title": "38.3.1 OpenTelemetry 集成",
          "content": "Agent 平台的一次请求可能经过多个服务，分布式追踪是定位性能瓶颈和故障的关键。"
        },
        {
          "id": "38.3.2",
          "title": "38.3.2 Agent 追踪 Span 设计",
          "content": ""
        },
        {
          "id": "38.3.3",
          "title": "38.3.3 追踪采样策略",
          "content": ""
        }
      ]
    },
    {
      "id": "38.4",
      "title": "38.4 日志聚合",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "38.4.1",
          "title": "38.4.1 结构化日志",
          "content": ""
        },
        {
          "id": "38.4.2",
          "title": "38.4.2 日志级别规范",
          "content": "| 级别 | 用途 | 示例 |\n|------|------|------|\n| ERROR | 影响用户的功能故障 | LLM 调用连续失败、数据库写入失败 |\n| WARN | 可能影响用户体验的异常 | LLM 响应超时（重试成功）、缓存未命中率高 |\n| INFO | 关键业务事件 | 会话创建、消息发送、工具调用 |\n| DEBUG | 调试信息 | 请求参数、内部决策过程 |\n| TRACE | 详细追踪 | 每一步的中间结果（仅开发环境） |"
        },
        {
          "id": "38.4.3",
          "title": "38.4.3 ELK 日志聚合配置",
          "content": ""
        }
      ]
    },
    {
      "id": "38.5",
      "title": "38.5 告警规则设计",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "38.5.1",
          "title": "38.5.1 告警分级",
          "content": "| 级别 | 响应时间 | 通知方式 | 示例 |\n|------|----------|----------|------|\n| P0 - 紧急 | 5 分钟内 | 电话 + 短信 + IM | 服务完全不可用 |\n| P1 - 严重 | 15 分钟内 | 短信 + IM | LLM Provider 全面故障 |\n| P2 - 警告 | 1 小时内 | IM + 邮件 | 错误率升高但未中断 |\n| P3 - 通知 | 工作时间内 | 邮件 | 磁盘使用率 > 80% |"
        },
        {
          "id": "38.5.2",
          "title": "38.5.2 告警规则（Prometheus AlertManager）",
          "content": ""
        },
        {
          "id": "38.5.3",
          "title": "38.5.3 告警聚合与抑制",
          "content": ""
        }
      ]
    },
    {
      "id": "38.6",
      "title": "38.6 监控面板",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "38.6.1",
          "title": "38.6.1 Grafana 面板设计",
          "content": "Agent 平台需要以下核心监控面板：\n\n**1. 系统概览 Dashboard**\n\n\n**2. LLM 成本 Dashboard**\n\n关键面板：\n- Token 消耗趋势（按 Provider / Model / 租户）\n- LLM API 调用延迟分布\n- 每日/每月费用趋势\n- 错误率与重试次数\n- 成本预测（基于当前趋势）\n\n**3. Agent 质量 Dashboard**\n\n关键面板：\n- Prompt 成功率（按 Agent 类型）\n- 工具调用成功率（按工具名称）\n- RAG 检索相关性评分\n- 用户反馈统计（👍/👎 比例）\n- 平均对话轮次"
        },
        {
          "id": "38.6.2",
          "title": "38.6.2 SLO 仪表板",
          "content": ""
        }
      ]
    },
    {
      "id": "38.7",
      "title": "38.7 SLO/SLI/SLA 管理",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "38.7.1",
          "title": "38.7.1 概念定义",
          "content": "| 概念 | 定义 | 示例 |\n|------|------|------|\n| SLI (Service Level Indicator) | 可量化的服务质量指标 | 请求成功率、P95 延迟 |\n| SLO (Service Level Objective) | 基于 SLI 设定的目标值 | API 可用性 ≥ 99.95% |\n| SLA (Service Level Agreement) | 与客户约定的正式协议 | 月度可用性 SLA，违约赔偿 |\n| Error Budget | SLO 允许的故障余量 | 30 天内允许 21.9 分钟故障 |"
        },
        {
          "id": "38.7.2",
          "title": "38.7.2 Error Budget 策略",
          "content": ""
        },
        {
          "id": "38.7.3",
          "title": "38.7.3 SLO 达成率报告",
          "content": ""
        }
      ]
    },
    {
      "id": "38.8",
      "title": "38.8 本章小结",
      "level": 2,
      "content": "本章全面介绍了 Agent 平台的监控与告警体系：\n\n1. **指标体系**：覆盖系统指标、服务指标和 Agent 业务指标的完整指标体系\n2. **分布式追踪**：基于 OpenTelemetry 的全链路追踪，精准定位性能瓶颈\n3. **日志聚合**：结构化日志 + ELK/Loki 实现集中化日志管理\n4. **告警规则**：分级告警（P0-P3）+ 智能聚合 + 抑制策略\n5. **监控面板**：面向不同角色的 Grafana Dashboard\n6. **SLO 管理**：基于 Error Budget 的可靠性管理框架\n\n监控的目标不是为了追求数字，而是为了建立对系统行为的深入理解，从而做出更好的工程决策。下一章我们将讨论安全与权限管理。",
      "subsections": []
    }
  ],
  "code_blocks": [
    {
      "id": "code-1",
      "language": "mermaid",
      "description": "可观测性（Observability）由三支柱组成：",
      "code": "graph TB\n    Obs[可观测性]\n    Obs --> Logs[日志 Logging<br/>\"发生了什么？\"]\n    Obs --> Metrics[指标 Metrics<br/>\"发生了多少？\"]\n    Obs --> Traces[追踪 Tracing<br/>\"在哪里发生的？\"]\n    \n    Logs --> ELK[ELK Stack<br/>Elasticsearch +<br/>Logstash + Kibana]\n    Metrics --> Prometheus[Prometheus +<br/>Grafana]\n    Traces --> Jaeger[Jaeger /<br/>OpenTelemetry]",
      "section_ref": "38.1",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-2",
      "language": "text",
      "description": "Agent 平台的指标应该按照 USE 方法（Utilization, Saturation, Errors）和 RED 方法（Rate, Errors, Duration）来组织：",
      "code": "指标体系\n├── 系统指标（USE）\n│   ├── CPU 利用率\n│   ├── 内存利用率\n│   ├── 磁盘 I/O\n│   └── 网络带宽\n│\n├── 服务指标（RED）\n│   ├── 请求速率 (QPS)\n│   ├── 错误率 (%)\n│   ├── 延迟分布 (P50/P95/P99)\n│   └── 并发连接数\n│\n└── Agent 业务指标\n    ├── Token 消耗量（按模型/用户/租户）\n    ├── LLM 调用延迟（按 Provider/模型）\n    ├── Prompt 成功率\n    ├── 工具调用成功率\n    ├── RAG 检索召回率\n    ├── 会话完成率\n    └── 用户满意度评分",
      "section_ref": "38.2.1",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-3",
      "language": "python",
      "description": "",
      "code": "# metrics_definitions.py\nfrom prometheus_client import Counter, Histogram, Gauge, Summary\nfrom prometheus_client.registry import CollectorRegistry\n\nregistry = CollectorRegistry()\n\n# === 请求指标（RED） ===\nREQUEST_COUNT = Counter(\n    'agent_http_requests_total',\n    'Total HTTP requests',\n    ['method', 'endpoint', 'status_code', 'service'],\n    registry=registry\n)\n\nREQUEST_DURATION = Histogram(\n    'agent_http_request_duration_seconds',\n    'HTTP request duration in seconds',\n    ['method', 'endpoint', 'service'],\n    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0],\n    registry=registry\n)\n\nACTIVE_REQUESTS = Gauge(\n    'agent_http_requests_in_flight',\n    'Current number of HTTP requests being processed',\n    ['service'],\n    registry=registry\n)\n\n# === LLM 调用指标 ===\nLLM_REQUESTS = Counter(\n    'agent_llm_requests_total',\n    'Total LLM API requests',\n    ['provider', 'model', 'status'],\n    registry=registry\n)\n\nLLM_TOKENS = Counter(\n    'agent_llm_tokens_total',\n    'Total tokens consumed',\n    ['provider', 'model', 'token_type'],  # token_type: prompt/completion\n    registry=registry\n)\n\nLLM_LATENCY = Histogram(\n    'agent_llm_request_duration_seconds',\n    'LLM API request duration',\n    ['provider', 'model'],\n    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 20.0, 30.0, 60.0, 120.0],\n    registry=registry\n)\n\nLLM_COST = Counter(\n    'agent_llm_cost_usd_total',\n    'Total LLM cost in USD',\n    ['provider', 'model', 'user_tier', 'tenant_id'],\n    registry=registry\n)\n\n# === Agent 业务指标 ===\nAGENT_SESSIONS = Counter(\n    'agent_sessions_total',\n    'Total sessions created',\n    ['agent_type', 'tenant_id'],\n    registry=registry\n)\n\nAGENT_MESSAGES = Counter(\n    'agent_messages_total',\n    'Total messages processed',\n    ['role', 'agent_type', 'tenant_id'],\n    registry=registry\n)\n\nTOOL_CALLS = Counter(\n    'agent_tool_calls_total',\n    'Total tool invocations',\n    ['tool_name', 'status'],\n    registry=registry\n)\n\nTOOL_DURATION = Histogram(\n    'agent_tool_call_duration_seconds',\n    'Tool call duration',\n    ['tool_name'],\n    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],\n    registry=registry\n)\n\nRAG_RETRIEVAL = Histogram(\n    'agent_rag_retrieval_duration_seconds',\n    'RAG retrieval duration',\n    ['collection_name'],\n    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.0],\n    registry=registry\n)\n\nRAG_TOP_K = Histogram(\n    'agent_rag_top_k_documents',\n    'Number of documents retrieved',\n    ['collection_name'],\n    buckets=[1, 3, 5, 10, 20, 50],\n    registry=registry\n)\n\n# === 资源指标 ===\nCONTEXT_WINDOW_USAGE = Gauge(\n    'agent_context_window_usage_ratio',\n    'Context window usage ratio (used/total)',\n    ['session_id', 'model'],\n    registry=registry\n)\n\nEMBEDDING_CACHE_HIT_RATE = Gauge(\n    'agent_embedding_cache_hit_rate',\n    'Embedding cache hit rate',\n    registry=registry\n)\n\n# === 质量指标 ===\nPROMPT_SUCCESS_RATE = Counter(\n    'agent_prompt_success_total',\n    'Prompt execution success/failure count',\n    ['agent_type', 'failure_reason'],\n    registry=registry\n)\n\nUSER_FEEDBACK = Counter(\n    'agent_user_feedback_total',\n    'User feedback (thumbs up/down)',\n    ['session_id', 'message_id', 'feedback_type'],\n    registry=registry\n)",
      "section_ref": "38.2.2",
      "runnable": true,
      "dependencies": [
        "prometheus_client"
      ]
    },
    {
      "id": "code-4",
      "language": "python",
      "description": "",
      "code": "# metrics_middleware.py\nimport time\nfrom functools import wraps\nfrom prometheus_client import generate_latest, CONTENT_TYPE_LATEST\nfrom starlette.middleware.base import BaseHTTPMiddleware\nfrom starlette.requests import Request\nfrom starlette.responses import Response\n\nclass MetricsMiddleware(BaseHTTPMiddleware):\n    \"\"\"HTTP 指标采集中间件\"\"\"\n    \n    async def dispatch(self, request: Request, call_next):\n        start = time.time()\n        \n        # 记录活跃请求数\n        ACTIVE_REQUESTS.labels(\n            service='agent-service'\n        ).inc()\n        \n        try:\n            response = await call_next(request)\n            status_code = response.status_code\n            \n            # 记录请求计数和延迟\n            REQUEST_COUNT.labels(\n                method=request.method,\n                endpoint=request.url.path,\n                status_code=str(status_code),\n                service='agent-service'\n            ).inc()\n            \n            REQUEST_DURATION.labels(\n                method=request.method,\n                endpoint=request.url.path,\n                service='agent-service'\n            ).observe(time.time() - start)\n            \n            return response\n        except Exception as e:\n            REQUEST_COUNT.labels(\n                method=request.method,\n                endpoint=request.url.path,\n                status_code='500',\n                service='agent-service'\n            ).inc()\n            raise\n        finally:\n            ACTIVE_REQUESTS.labels(\n                service='agent-service'\n            ).dec()\n\nclass LLMMetricsDecorator:\n    \"\"\"LLM 调用指标装饰器\"\"\"\n    \n    @staticmethod\n    def track(provider: str, model: str):\n        def decorator(func):\n            @wraps(func)\n            async def wrapper(*args, **kwargs):\n                start = time.time()\n                status = \"success\"\n                prompt_tokens = 0\n                completion_tokens = 0\n                cost = 0.0\n                \n                try:\n                    result = await func(*args, **kwargs)\n                    \n                    # 提取 Token 信息\n                    usage = getattr(result, 'usage', None)\n                    if usage:\n                        prompt_tokens = usage.prompt_tokens\n                        completion_tokens = usage.completion_tokens\n                        cost = calculate_cost(model, prompt_tokens, completion_tokens)\n                    \n                    return result\n                except Exception as e:\n                    status = \"error\"\n                    raise\n                finally:\n                    duration = time.time() - start\n                    \n                    LLM_REQUESTS.labels(\n                        provider=provider, model=model, status=status\n                    ).inc()\n                    \n                    LLM_TOKENS.labels(\n                        provider=provider, model=model, token_type='prompt'\n                    )._value += prompt_tokens\n                    \n                    LLM_TOKENS.labels(\n                        provider=provider, model=model, token_type='completion'\n                    )._value += completion_tokens\n                    \n                    LLM_LATENCY.labels(\n                        provider=provider, model=model\n                    ).observe(duration)\n                    \n                    if cost > 0:\n                        LLM_COST.labels(\n                            provider=provider, model=model,\n                            user_tier=kwargs.get('user_tier', 'unknown'),\n                            tenant_id=kwargs.get('tenant_id', 'default')\n                        )._value += cost\n            \n            return wrapper\n        return decorator",
      "section_ref": "38.2.3",
      "runnable": true,
      "dependencies": [
        "prometheus_client",
        "starlette"
      ]
    },
    {
      "id": "code-5",
      "language": "yaml",
      "description": "",
      "code": "# prometheus.yml\nglobal:\n  scrape_interval: 15s\n  evaluation_interval: 15s\n  \n  external_labels:\n    cluster: 'agent-platform-prod'\n    region: 'east-china'\n\nrule_files:\n  - 'alerts/*.yml'\n  - 'recording_rules/*.yml'\n\nscrape_configs:\n  - job_name: 'agent-services'\n    kubernetes_sd_configs:\n      - role: pod\n        namespaces:\n          names: ['agent-platform']\n    relabel_configs:\n      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]\n        action: keep\n        regex: true\n      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]\n        action: replace\n        target_label: __address__\n        regex: (.+)\n        replacement: ${1}:9090\n      - source_labels: [__meta_kubernetes_pod_label_app]\n        action: replace\n        target_label: job\n\n  - job_name: 'redis'\n    static_configs:\n      - targets: ['redis-exporter:9121']\n\n  - job_name: 'postgresql'\n    static_configs:\n      - targets: ['postgres-exporter:9187']\n\n  - job_name: 'node-exporter'\n    kubernetes_sd_configs:\n      - role: node\n    relabel_configs:\n      - source_labels: [__address__]\n        regex: '(.*):9100'\n        replacement: '${1}:9100'\n        target_label: __address__",
      "section_ref": "38.2.4",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-6",
      "language": "python",
      "description": "Agent 平台的一次请求可能经过多个服务，分布式追踪是定位性能瓶颈和故障的关键。",
      "code": "# tracing_setup.py\nfrom opentelemetry import trace\nfrom opentelemetry.sdk.trace import TracerProvider\nfrom opentelemetry.sdk.trace.export import BatchSpanProcessor\nfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter\nfrom opentelemetry.sdk.resources import Resource\nfrom opentelemetry.instrumentation.auto_instrumentation import sitecustomize\n\ndef setup_tracing(service_name: str, otlp_endpoint: str):\n    \"\"\"初始化 OpenTelemetry 追踪\"\"\"\n    resource = Resource.create({\n        \"service.name\": service_name,\n        \"service.version\": \"2.1.0\",\n        \"deployment.environment\": \"production\"\n    })\n    \n    provider = TracerProvider(resource=resource)\n    \n    # 使用 OTLP 导出器（发送到 Jaeger/Tempo）\n    otlp_exporter = OTLPSpanExporter(\n        endpoint=otlp_endpoint,\n        insecure=True\n    )\n    \n    provider.add_span_processor(\n        BatchSpanProcessor(otlp_exporter, max_queue_size=2048)\n    )\n    \n    trace.set_tracer_provider(provider)\n    return trace.get_tracer(service_name)",
      "section_ref": "38.3.1",
      "runnable": true,
      "dependencies": [
        "opentelemetry"
      ]
    },
    {
      "id": "code-7",
      "language": "python",
      "description": "",
      "code": "# agent_tracing.py\nimport time\nfrom opentelemetry import trace\n\ntracer = trace.get_tracer(\"agent-service\")\n\nasync def process_chat_request(session_id, user_message, context):\n    \"\"\"处理聊天请求，包含完整的追踪链路\"\"\"\n    \n    with tracer.start_as_current_span(\n        \"chat.request\",\n        attributes={\n            \"session.id\": session_id,\n            \"message.length\": len(user_message),\n        }\n    ) as parent_span:\n        \n        # 1. 检索会话历史\n        with tracer.start_as_current_span(\"session.load_history\"):\n            history = await load_session_history(session_id)\n            parent_span.set_attribute(\n                \"session.history_length\", len(history)\n            )\n        \n        # 2. RAG 检索（如果启用）\n        if context.get(\"enable_rag\"):\n            with tracer.start_as_current_span(\n                \"rag.retrieval\",\n                attributes={\"collection\": context.get(\"collection\", \"default\")}\n            ) as rag_span:\n                start = time.time()\n                documents = await rag_search(user_message, top_k=5)\n                rag_span.set_attribute(\n                    \"rag.documents_retrieved\", len(documents)\n                )\n                rag_span.set_attribute(\"rag.duration_ms\", \n                    (time.time() - start) * 1000)\n        \n        # 3. 构建 Prompt\n        with tracer.start_as_current_span(\"prompt.build\") as prompt_span:\n            prompt = build_prompt(history, user_message, documents)\n            prompt_span.set_attribute(\"prompt.length\", len(prompt))\n            prompt_span.set_attribute(\"prompt.token_count\", \n                estimate_tokens(prompt))\n        \n        # 4. 调用 LLM\n        with tracer.start_as_current_span(\n            \"llm.inference\",\n            attributes={\n                \"llm.provider\": context.get(\"provider\", \"openai\"),\n                \"llm.model\": context.get(\"model\", \"gpt-4o\"),\n            }\n        ) as llm_span:\n            start = time.time()\n            response = await call_llm(prompt, context)\n            latency = time.time() - start\n            \n            llm_span.set_attribute(\"llm.duration_ms\", latency * 1000)\n            llm_span.set_attribute(\"llm.prompt_tokens\",\n                response.usage.prompt_tokens)\n            llm_span.set_attribute(\"llm.completion_tokens\",\n                response.usage.completion_tokens)\n            llm_span.set_attribute(\"llm.total_tokens\",\n                response.usage.total_tokens)\n        \n        # 5. 后处理（工具调用、格式化）\n        with tracer.start_as_current_span(\"response.post_process\"):\n            final_response = post_process(response, context)\n        \n        # 6. 保存消息\n        with tracer.start_as_current_span(\"message.save\"):\n            await save_message(session_id, user_message, final_response)\n        \n        # 设置根 Span 属性\n        parent_span.set_attribute(\"chat.total_duration_ms\",\n            (time.time() - parent_span.start_time / 1e9) * 1000)\n        \n        return final_response",
      "section_ref": "38.3.2",
      "runnable": true,
      "dependencies": [
        "opentelemetry"
      ]
    },
    {
      "id": "code-8",
      "language": "yaml",
      "description": "",
      "code": "# otel-collector-config.yaml\nprocessors:\n  # 概率采样：生产环境采样 10%\n  prob_sampler:\n    type: probabilistic_sampler\n    hashing_seed: 22\n    sampling_percentage: 10\n\n  # 自适应采样：基于流量动态调整\n  adaptive_sampler:\n    type: adaptive_sampler\n    options:\n      throughput_target: 100  # 每秒最多 100 个 trace\n\n  # 尾部采样：保留错误和高延迟的 trace\n  tail_sampling:\n    decision_wait: 10s\n    num_traces: 100000\n    policies:\n      [\n        {\n          name: errors-policy,\n          type: status_code,\n          status_code: { status_codes: [ERROR] }\n        },\n        {\n          name: slow-requests-policy,\n          type: latency,\n          latency: { threshold_ms: 5000 }\n        },\n        {\n          name: llm-calls-policy,\n          type: string_attribute,\n          string_attribute:\n            { key: \"llm.model\", values: [\"gpt-4o\", \"claude-3-opus\"] }\n        }\n      ]",
      "section_ref": "38.3.3",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-9",
      "language": "python",
      "description": "",
      "code": "# structured_logging.py\nimport json\nimport logging\nimport sys\nfrom datetime import datetime, timezone\nfrom contextvars import ContextVar\nimport uuid\n\n# 使用 ContextVar 传递请求上下文到日志\nrequest_id_var: ContextVar[str] = ContextVar('request_id', default='')\nsession_id_var: ContextVar[str] = ContextVar('session_id', default='')\ntenant_id_var: ContextVar[str] = ContextVar('tenant_id', default='')\n\nclass JSONFormatter(logging.Formatter):\n    \"\"\"JSON 结构化日志格式\"\"\"\n    \n    def format(self, record):\n        log_entry = {\n            \"timestamp\": datetime.now(timezone.utc).isoformat(),\n            \"level\": record.levelname,\n            \"logger\": record.name,\n            \"message\": record.getMessage(),\n            \"service\": \"agent-service\",\n            \"version\": \"2.1.0\",\n            # 注入请求上下文\n            \"request_id\": request_id_var.get(),\n            \"session_id\": session_id_var.get(),\n            \"tenant_id\": tenant_id_var.get(),\n        }\n        \n        # 添加异常信息\n        if record.exc_info:\n            log_entry[\"exception\"] = {\n                \"type\": record.exc_info[0].__name__,\n                \"message\": str(record.exc_info[1]),\n                \"traceback\": self.formatException(record.exc_info)\n            }\n        \n        # 添加额外的结构化字段\n        if hasattr(record, 'extra_fields'):\n            log_entry.update(record.extra_fields)\n        \n        return json.dumps(log_entry, ensure_ascii=False)\n\ndef setup_logging(level=logging.INFO):\n    \"\"\"初始化日志系统\"\"\"\n    handler = logging.StreamHandler(sys.stdout)\n    handler.setFormatter(JSONFormatter())\n    \n    logger = logging.getLogger()\n    logger.handlers.clear()\n    logger.addHandler(handler)\n    logger.setLevel(level)\n    \n    return logger\n\n# 使用示例\nlogger = setup_logging()\n\n# 普通日志\nlogger.info(\"Session created\", extra={\n    \"extra_fields\": {\n        \"session_id\": \"sess_abc123\",\n        \"agent_type\": \"rag_agent\",\n        \"user_tier\": \"pro\"\n    }\n})\n\n# 带 Token 信息的日志\nlogger.info(\"LLM call completed\", extra={\n    \"extra_fields\": {\n        \"provider\": \"openai\",\n        \"model\": \"gpt-4o\",\n        \"prompt_tokens\": 1523,\n        \"completion_tokens\": 847,\n        \"total_tokens\": 2370,\n        \"cost_usd\": 0.047,\n        \"duration_ms\": 2340\n    }\n})",
      "section_ref": "38.4.1",
      "runnable": true,
      "dependencies": [
        "contextvars"
      ]
    },
    {
      "id": "code-10",
      "language": "yaml",
      "description": "| TRACE | 详细追踪 | 每一步的中间结果（仅开发环境） |",
      "code": "# filebeat.yml\nfilebeat.inputs:\n  - type: container\n    paths:\n      - /var/log/containers/agent-*_*.log\n    processors:\n      - decode_json_fields:\n          fields: [\"message\"]\n          target: \"\"\n          overwrite_keys: true\n      - add_kubernetes_metadata:\n          host: ${NODE_NAME}\n          matchers:\n            - logs_path:\n                logs_path: \"/var/log/containers/\"\n\noutput.elasticsearch:\n  hosts: [\"elasticsearch:9200\"]\n  indices:\n    - index: \"agent-platform-%{[agent.version]}-%{+yyyy.MM.dd}\"\n\n# ILM 生命周期管理\nsetup.ilm.enabled: true\nsetup.ilm.rollover_alias: \"agent-platform\"\nsetup.ilm.pattern: \"{now/d}-000001\"\n\n# 日志保留策略\nsetup.ilm.policy_name: \"agent-platform-policy\"",
      "section_ref": "38.4.3",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-11",
      "language": "yaml",
      "description": "| P3 - 通知 | 工作时间内 | 邮件 | 磁盘使用率 > 80% |",
      "code": "# alerts/agent_platform.yml\ngroups:\n  - name: agent_platform_critical\n    rules:\n      # P0: 服务完全不可用\n      - alert: ServiceDown\n        expr: up{job=~\"agent-.*\"} == 0\n        for: 1m\n        labels:\n          severity: P0\n          team: platform\n        annotations:\n          summary: \"服务 {{ $labels.job }} 完全不可用\"\n          description: \"{{ $labels.instance }} 已经持续 1 分钟无法访问\"\n          runbook: \"https://wiki.internal/runbooks/service-down\"\n\n      # P0: 错误率飙升\n      - alert: HighErrorRate\n        expr: |\n          (\n            sum(rate(agent_http_requests_total{status_code=~\"5..\"}[5m]))\n            / sum(rate(agent_http_requests_total[5m]))\n          ) > 0.1\n        for: 2m\n        labels:\n          severity: P0\n          team: platform\n        annotations:\n          summary: \"HTTP 5xx 错误率超过 10%\"\n          description: \"当前错误率: {{ $value | humanizePercentage }}\"\n\n  - name: agent_platform_llm\n    rules:\n      # P1: LLM Provider 故障\n      - alert: LLMProviderDown\n        expr: |\n          sum(rate(agent_llm_requests_total{status=\"error\"}[5m]))\n          by (provider) > 10\n        for: 3m\n        labels:\n          severity: P1\n          team: ai\n        annotations:\n          summary: \"LLM Provider {{ $labels.provider }} 大量失败\"\n          description: \"每秒 {{ $value }} 次失败请求\"\n          runbook: \"https://wiki.internal/runbooks/llm-provider-down\"\n\n      # P1: LLM 延迟异常\n      - alert: LLMLatencyHigh\n        expr: |\n          histogram_quantile(0.95,\n            sum(rate(agent_llm_request_duration_seconds_bucket[5m]))\n            by (le, provider, model)\n          ) > 30\n        for: 5m\n        labels:\n          severity: P1\n          team: ai\n        annotations:\n          summary: \"LLM P95 延迟超过 30 秒\"\n          description: \"{{ $labels.provider }}/{{ $labels.model }} P95: {{ $value }}s\"\n\n      # P2: Token 消耗突增\n      - alert: TokenUsageSpike\n        expr: |\n          sum(rate(agent_llm_tokens_total[1h])) \n          > 3 * sum(rate(agent_llm_tokens_total[1h] offset 1h))\n        for: 30m\n        labels:\n          severity: P2\n          team: billing\n        annotations:\n          summary: \"Token 消耗量在过去 1 小时内突增超过 3 倍\"\n          description: \"当前消耗率: {{ $value }} tokens/s\"\n\n  - name: agent_platform_resources\n    rules:\n      # P2: 上下文窗口使用率过高\n      - alert: ContextWindowNearFull\n        expr: agent_context_window_usage_ratio > 0.9\n        for: 10m\n        labels:\n          severity: P2\n          team: ai\n        annotations:\n          summary: \"多个会话上下文窗口即将满\"\n          description: \"上下文使用率: {{ $value }}%\"\n      \n      # P3: 缓存命中率过低\n      - alert: LowCacheHitRate\n        expr: agent_embedding_cache_hit_rate < 0.5\n        for: 30m\n        labels:\n          severity: P3\n          team: platform\n        annotations:\n          summary: \"Embedding 缓存命中率低于 50%\"",
      "section_ref": "38.5.2",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-12",
      "language": "yaml",
      "description": "",
      "code": "# alertmanager.yml\nglobal:\n  resolve_timeout: 5m\n  smtp_smarthost: 'smtp.internal:587'\n  smtp_from: 'alerts@agent-platform.internal'\n\nroute:\n  receiver: 'default'\n  group_by: ['alertname', 'severity', 'team']\n  group_wait: 30s        # 等待 30s 聚合同组告警\n  group_interval: 5m     # 同组告警间隔 5m\n  repeat_interval: 4h    # 重复告警间隔 4h\n  \n  routes:\n    # P0 告警 → 立即通知\n    - match:\n        severity: P0\n      receiver: 'emergency'\n      group_wait: 10s\n      repeat_interval: 5m\n    \n    # P1 告警 → 紧急通知\n    - match:\n        severity: P1\n      receiver: 'urgent'\n      group_wait: 30s\n      repeat_interval: 30m\n    \n    # AI 团队告警\n    - match:\n        team: ai\n      receiver: 'ai-team'\n    \n    # 平台团队告警\n    - match:\n        team: platform\n      receiver: 'platform-team'\n\n# 抑制规则：P0 告警时抑制同服务的 P2/P3\ninhibit_rules:\n  - source_match:\n      severity: P0\n    target_match_re:\n      severity: 'P2|P3'\n    equal: ['alertname', 'instance']\n\nreceivers:\n  - name: 'default'\n    webhook_configs:\n      - url: 'http://alert-gateway:8080/webhook/default'\n  \n  - name: 'emergency'\n    webhook_configs:\n      - url: 'http://alert-gateway:8080/webhook/emergency'\n    # 电话告警集成\n    # pushover_configs:\n    #   - user_key: 'xxx'\n    #     token: 'xxx'\n    #     priority: 2  # 紧急优先级\n\n  - name: 'urgent'\n    webhook_configs:\n      - url: 'http://alert-gateway:8080/webhook/urgent'\n  \n  - name: 'ai-team'\n    webhook_configs:\n      - url: 'http://alert-gateway:8080/webhook/ai-team'\n  \n  - name: 'platform-team'\n    webhook_configs:\n      - url: 'http://alert-gateway:8080/webhook/platform-team'",
      "section_ref": "38.5.3",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-13",
      "language": "json",
      "description": "1. 系统概览 Dashboard",
      "code": "{\n  \"dashboard\": {\n    \"title\": \"Agent Platform Overview\",\n    \"panels\": [\n      {\n        \"title\": \"请求 QPS\",\n        \"type\": \"timeseries\",\n        \"targets\": [\n          {\n            \"expr\": \"sum(rate(agent_http_requests_total[5m]))\",\n            \"legendFormat\": \"Total QPS\"\n          }\n        ]\n      },\n      {\n        \"title\": \"P95 延迟\",\n        \"type\": \"timeseries\",\n        \"targets\": [\n          {\n            \"expr\": \"histogram_quantile(0.95, sum(rate(agent_http_request_duration_seconds_bucket[5m])) by (le))\",\n            \"legendFormat\": \"P95\"\n          }\n        ]\n      },\n      {\n        \"title\": \"错误率\",\n        \"type\": \"stat\",\n        \"targets\": [\n          {\n            \"expr\": \"sum(rate(agent_http_requests_total{status_code=~\\\"5..\\\"}[5m])) / sum(rate(agent_http_requests_total[5m]))\",\n            \"legendFormat\": \"Error Rate\"\n          }\n        ],\n        \"thresholds\": {\n          \"steps\": [\n            {\"color\": \"green\", \"value\": 0},\n            {\"color\": \"yellow\", \"value\": 0.01},\n            {\"color\": \"red\", \"value\": 0.05}\n          ]\n        }\n      },\n      {\n        \"title\": \"活跃会话数\",\n        \"type\": \"stat\",\n        \"targets\": [\n          {\n            \"expr\": \"sum(agent_http_requests_in_flight)\",\n            \"legendFormat\": \"Active Sessions\"\n          }\n        ]\n      }\n    ]\n  }\n}",
      "section_ref": "38.6.1",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-14",
      "language": "yaml",
      "description": "- 平均对话轮次",
      "code": "# SLO 定义\nslos:\n  - name: \"API 可用性\"\n    target: 99.95\n    description: \"Agent API 请求成功率\"\n    indicator: SLI\n    sli:\n      metric: |\n        sum(rate(agent_http_requests_total{status_code!~\"5..\"}[30d]))\n        / sum(rate(agent_http_requests_total[30d]))\n    error_budget:\n      total: 0.0005  # 0.05% 允许失败率\n      per_day: 21.9  # 每天允许约 22 分钟的故障\n\n  - name: \"LLM 调用延迟\"\n    target: 95\n    description: \"LLM 调用在 10 秒内完成的百分比\"\n    indicator: SLI\n    sli:\n      metric: |\n        sum(rate(agent_llm_request_duration_seconds_bucket{le=\"10\"}[30d]))\n        / sum(rate(agent_llm_request_duration_seconds_count[30d]))\n    window: 30d\n\n  - name: \"对话完成率\"\n    target: 99.0\n    description: \"成功完成的对话占开始对话的百分比\"\n    indicator: SLI\n    sli:\n      metric: |\n        sum(rate(agent_sessions_completed_total[30d]))\n        / sum(rate(agent_sessions_total[30d]))",
      "section_ref": "38.6.2",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-15",
      "language": "mermaid",
      "description": "",
      "code": "graph LR\n    SLA[SLA<br/>服务等级协议<br/>对外承诺] --> SLO[SLO<br/>服务等级目标<br/>内部目标]\n    SLO --> SLI[SLI<br/>服务等级指标<br/>可度量数据]\n    \n    SLA -.->|法律约束| Customer[客户]\n    SLO -.->|工程目标| Team[工程团队]\n    SLI -.->|监控数据| Dashboard[Grafana]",
      "section_ref": "38.7.1",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-16",
      "language": "python",
      "description": "| Error Budget | SLO 允许的故障余量 | 30 天内允许 21.9 分钟故障 |",
      "code": "# error_budget.py\nfrom datetime import datetime, timedelta\nfrom dataclasses import dataclass\n\n@dataclass\nclass ErrorBudget:\n    \"\"\"错误预算管理\"\"\"\n    slo_target: float        # e.g., 0.9995 (99.95%)\n    window_days: int = 30\n    name: str = \"\"\n    \n    @property\n    def error_rate_budget(self) -> float:\n        \"\"\"允许的错误率\"\"\"\n        return 1 - self.slo_target\n    \n    @property\n    def error_budget_minutes(self) -> float:\n        \"\"\"允许的故障分钟数\"\"\"\n        return self.window_days * 24 * 60 * (1 - self.slo_target)\n    \n    def remaining_budget(self, total_requests: int, failed_requests: int) -> dict:\n        \"\"\"计算剩余错误预算\"\"\"\n        actual_error_rate = failed_requests / total_requests if total_requests > 0 else 0\n        budget_consumed = actual_error_rate / self.error_rate_budget\n        budget_remaining = max(0, 1 - budget_consumed)\n        \n        return {\n            \"slo_name\": self.name,\n            \"slo_target\": f\"{self.slo_target * 100:.2f}%\",\n            \"error_budget_total_minutes\": round(self.error_budget_minutes, 1),\n            \"budget_consumed_pct\": round(budget_consumed * 100, 2),\n            \"budget_remaining_pct\": round(budget_remaining * 100, 2),\n            \"actual_error_rate\": f\"{actual_error_rate * 100:.4f}%\",\n            \"recommendation\": self._get_recommendation(budget_remaining)\n        }\n    \n    def _get_recommendation(self, remaining: float) -> str:\n        \"\"\"根据剩余预算给出建议\"\"\"\n        if remaining > 0.5:\n            return \"🟢 预算充足，可正常推进新功能发布\"\n        elif remaining > 0.2:\n            return \"🟡 预算消耗过半，建议优先修复可靠性问题\"\n        elif remaining > 0:\n            return \"🔴 预算即将耗尽，暂停新功能发布，全力修复可靠性\"\n        else:\n            return \"⛔ 预算已耗尽，仅允许紧急修复\"\n\n# 使用示例\nbudget = ErrorBudget(slo_target=0.9995, window_days=30, name=\"API可用性\")\nresult = budget.remaining_budget(total_requests=10_000_000, failed_requests=800)\n# 输出：预算消耗 40%，建议优先修复可靠性",
      "section_ref": "38.7.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-17",
      "language": "python",
      "description": "",
      "code": "# slo_report.py\n\"\"\"SLO 月度报告生成器\"\"\"\n\ndef generate_monthly_slo_report(month: str, sli_data: dict) -> str:\n    \"\"\"生成月度 SLO 报告\"\"\"\n    report = f\"\"\"\n# SLO 月度报告 - {month}\n\n## 1. 概览\n\n| SLO | 目标 | 实际 | 状态 | 错误预算剩余 |\n|-----|------|------|------|-------------|\n\"\"\"\n    for slo_name, data in sli_data.items():\n        target = data[\"target\"]\n        actual = data[\"actual\"]\n        met = actual >= target\n        status = \"✅ 达标\" if met else \"❌ 未达标\"\n        budget_remaining = data.get(\"budget_remaining\", 0)\n        \n        report += (\n            f\"| {slo_name} | {target}% | {actual}% | \"\n            f\"{status} | {budget_remaining}% |\\n\"\n        )\n    \n    report += f\"\"\"\n## 2. 关键事件\n\n\"\"\"\n    for event in sli_data.get(\"incidents\", []):\n        report += f\"- **{event['date']}** {event['description']} \"\n        report += f\"(影响时长: {event['duration']}, 根因: {event['root_cause']})\\n\"\n    \n    report += f\"\"\"\n## 3. 改进措施\n\n\"\"\"\n    for action in sli_data.get(\"actions\", []):\n        report += f\"- [ ] {action}\\n\"\n    \n    return report",
      "section_ref": "38.7.3",
      "runnable": true,
      "dependencies": []
    }
  ],
  "tables": [
    {
      "headers": [
        "支柱",
        "回答的问题",
        "工具",
        "数据格式"
      ],
      "data": [
        [
          "日志",
          "发生了什么？",
          "ELK / Loki",
          "非结构化/结构化文本"
        ],
        [
          "指标",
          "发生了多少/多快？",
          "Prometheus / VictoriaMetrics",
          "时间序列数据"
        ],
        [
          "追踪",
          "在哪里发生？",
          "Jaeger / Zipkin / Tempo",
          "Span 树"
        ]
      ]
    },
    {
      "headers": [
        "级别",
        "用途",
        "示例"
      ],
      "data": [
        [
          "ERROR",
          "影响用户的功能故障",
          "LLM 调用连续失败、数据库写入失败"
        ],
        [
          "WARN",
          "可能影响用户体验的异常",
          "LLM 响应超时（重试成功）、缓存未命中率高"
        ],
        [
          "INFO",
          "关键业务事件",
          "会话创建、消息发送、工具调用"
        ],
        [
          "DEBUG",
          "调试信息",
          "请求参数、内部决策过程"
        ],
        [
          "TRACE",
          "详细追踪",
          "每一步的中间结果（仅开发环境）"
        ]
      ]
    },
    {
      "headers": [
        "级别",
        "响应时间",
        "通知方式",
        "示例"
      ],
      "data": [
        [
          "P0 - 紧急",
          "5 分钟内",
          "电话 + 短信 + IM",
          "服务完全不可用"
        ],
        [
          "P1 - 严重",
          "15 分钟内",
          "短信 + IM",
          "LLM Provider 全面故障"
        ],
        [
          "P2 - 警告",
          "1 小时内",
          "IM + 邮件",
          "错误率升高但未中断"
        ],
        [
          "P3 - 通知",
          "工作时间内",
          "邮件",
          "磁盘使用率 > 80%"
        ]
      ]
    },
    {
      "headers": [
        "概念",
        "定义",
        "示例"
      ],
      "data": [
        [
          "SLI (Service Level Indicator)",
          "可量化的服务质量指标",
          "请求成功率、P95 延迟"
        ],
        [
          "SLO (Service Level Objective)",
          "基于 SLI 设定的目标值",
          "API 可用性 ≥ 99.95%"
        ],
        [
          "SLA (Service Level Agreement)",
          "与客户约定的正式协议",
          "月度可用性 SLA，违约赔偿"
        ],
        [
          "Error Budget",
          "SLO 允许的故障余量",
          "30 天内允许 21.9 分钟故障"
        ]
      ]
    }
  ],
  "key_takeaways": [],
  "common_pitfalls": [],
  "related_chapters": [
    "ch15",
    "ch36"
  ]
}