{
  "metadata": {
    "id": "ch37",
    "title": "第37章 可扩展性与高可用",
    "volume": "vol10",
    "volume_title": "生产级Agent平台",
    "word_count": 2636,
    "difficulty": "advanced",
    "prerequisites": [
      "ch36"
    ],
    "key_concepts": [
      "概述：为什么高可用对 Agent 系统尤其重要",
      "可用性目标定义",
      "水平扩展策略",
      "无状态设计原则",
      "有状态服务的特殊处理",
      "数据库扩展",
      "负载均衡",
      "多层负载均衡",
      "负载均衡算法选择",
      "健康检查配置",
      "自动扩缩容",
      "HPA 配置",
      "基于预测的扩缩容",
      "扩缩容策略矩阵",
      "服务降级与熔断"
    ],
    "learning_objectives": [],
    "estimated_tokens": 1582,
    "source_file": "vol10/ch37_可扩展性与高可用.md"
  },
  "overview": "",
  "sections": [
    {
      "id": "37.1",
      "title": "37.1 概述：为什么高可用对 Agent 系统尤其重要",
      "level": 2,
      "content": "Agent 系统与传统的 Web 应用有一个根本区别：它依赖于外部 AI 模型服务（如 OpenAI、Anthropic 等）。这意味着你的系统不仅要面对自身基础设施的故障，还要应对上游供应商的不确定性。一个典型的 Agent 请求可能涉及 3-5 个内部服务 + 1-2 个外部 LLM 调用，任何一个环节的故障都会导致用户体验受损。\n\n本章将系统地讨论如何构建一个可扩展、高可用的 Agent 平台。",
      "subsections": [
        {
          "id": "37.1.1",
          "title": "37.1.1 可用性目标定义",
          "content": "| 级别 | 可用性 | 年停机时间 | 适用场景 |\n|------|--------|-----------|----------|\n| 基础 | 99% | 3.65 天 | 内部工具、开发环境 |\n| 标准 | 99.9% | 8.76 小时 | 一般业务系统 |\n| 高可用 | 99.95% | 4.38 小时 | 核心业务系统 |\n| 极高可用 | 99.99% | 52.56 分钟 | 金融级 Agent 平台 |\n\n**Agent 平台推荐目标：99.95%（月停机 < 22 分钟）**"
        }
      ]
    },
    {
      "id": "37.2",
      "title": "37.2 水平扩展策略",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "37.2.1",
          "title": "37.2.1 无状态设计原则",
          "content": "水平扩展的前提是服务必须是无状态（Stateless）的。所有的会话状态、用户上下文都应该存储在外部（Redis、数据库），而不是服务进程内存中。"
        },
        {
          "id": "37.2.2",
          "title": "37.2.2 有状态服务的特殊处理",
          "content": "某些服务天然是有状态的，需要特殊的扩展策略：\n\n| 服务类型 | 状态类型 | 扩展策略 |\n|----------|----------|----------|\n| Redis | 缓存数据 | Redis Cluster（分片 + 副本） |\n| PostgreSQL | 持久化数据 | 读写分离 + 分库分表 |\n| 向量数据库 | Embedding | Milvus 分布式集群 |\n| WebSocket | 连接状态 | Sticky Session + Pub/Sub |"
        },
        {
          "id": "37.2.3",
          "title": "37.2.3 数据库扩展",
          "content": "**读写分离配置：**\n\n\n**读写分离实现（Python）：**"
        }
      ]
    },
    {
      "id": "37.3",
      "title": "37.3 负载均衡",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "37.3.1",
          "title": "37.3.1 多层负载均衡",
          "content": "| 层级 | 技术 | 职责 |\n|------|------|------|\n| L4 | Nginx / AWS NLB | TCP 连接分发、TLS 终止 |\n| L7 | Nginx / AWS ALB | HTTP 路由、gRPC 代理、限流 |\n| L7 Intra | Envoy / K8s Service | 集群内部服务间负载均衡 |"
        },
        {
          "id": "37.3.2",
          "title": "37.3.2 负载均衡算法选择",
          "content": "| 算法 | 适用场景 | 优势 | 劣势 |\n|------|----------|------|------|\n| Round Robin | 同构服务 | 简单公平 | 不考虑实例负载差异 |\n| Weighted Round Robin | 异构实例 | 考虑实例能力 | 权重需要手动调整 |\n| Least Connections | 长连接服务 | 动态均衡 | 需要连接数监控 |\n| Consistent Hashing | 有状态服务 | 亲和性 | 不适合均匀分布 |\n| Adaptive | 混合负载 | 最优 | 实现复杂 |\n\n**推荐**：L4 使用 Round Robin，L7 使用 Least Connections + Health Check。"
        },
        {
          "id": "37.3.3",
          "title": "37.3.3 健康检查配置",
          "content": ""
        }
      ]
    },
    {
      "id": "37.4",
      "title": "37.4 自动扩缩容",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "37.4.1",
          "title": "37.4.1 HPA 配置",
          "content": "Kubernetes Horizontal Pod Autoscaler 是最常见的自动扩缩容方案："
        },
        {
          "id": "37.4.2",
          "title": "37.4.2 基于预测的扩缩容",
          "content": "简单的阈值触发往往不够智能。基于历史数据的预测扩缩容可以提前应对流量高峰："
        },
        {
          "id": "37.4.3",
          "title": "37.4.3 扩缩容策略矩阵",
          "content": "| 指标 | 扩容触发 | 缩容触发 | 冷却期 |\n|------|----------|----------|--------|\n| CPU > 70% | 立即 | CPU < 30% 持续 5 分钟 | 扩容 60s / 缩容 300s |\n| 并发请求数 > 100/pod | 立即 | < 50/pod 持续 5 分钟 | 同上 |\n| 队列积压 > 50 | 立即 | < 10 持续 10 分钟 | 扩容 30s / 缩容 600s |\n| 预测未来1h流量 > 容量80% | 提前 30 分钟 | - | - |"
        }
      ]
    },
    {
      "id": "37.5",
      "title": "37.5 服务降级与熔断",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "37.5.1",
          "title": "37.5.1 降级策略设计",
          "content": "当系统面临异常流量或依赖服务故障时，降级是保障核心功能可用的最后防线。\n\n\n**降级层级定义：**\n\n| 级别 | 触发条件 | 降级行为 | 用户体验 |\n|------|----------|----------|----------|\n| L0 | 正常 | 完整功能 | ★★★★★ |\n| L1 | LLM 响应超时 | 切换备用模型 | ★★★★☆ |\n| L2 | 所有 LLM 不可用 | 返回缓存响应 | ★★★☆☆ |\n| L3 | RAG 服务不可用 | 跳过检索，纯对话 | ★★★☆☆ |\n| L4 | 所有依赖故障 | 静态维护页面 | ★★☆☆☆ |"
        },
        {
          "id": "37.5.2",
          "title": "37.5.2 熔断器实现",
          "content": ""
        },
        {
          "id": "37.5.3",
          "title": "37.5.3 超时与重试策略",
          "content": ""
        }
      ]
    },
    {
      "id": "37.6",
      "title": "37.6 多区域部署",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "37.6.1",
          "title": "37.6.1 部署架构",
          "content": ""
        },
        {
          "id": "37.6.2",
          "title": "37.6.2 跨区域数据同步",
          "content": ""
        }
      ]
    },
    {
      "id": "37.7",
      "title": "37.7 灾难恢复",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "37.7.1",
          "title": "37.7.1 灾备等级",
          "content": "| 等级 | RPO（数据丢失） | RTO（恢复时间） | 成本 | 方案 |\n|------|-----------------|-----------------|------|------|\n| 冷备 | 24小时 | 4-24小时 | 低 | 定期备份 + 异地存储 |\n| 温备 | 1小时 | 1-4小时 | 中 | 异步复制 + 热待机 |\n| 热备 | 分钟级 | 分钟级 | 高 | 同步复制 + 自动故障转移 |\n| 多活 | 0 | 0（秒级切换） | 极高 | 多区域活跃-活跃 |\n\n**推荐**：Agent 平台采用\"温备\"方案，核心数据 RPO < 1小时，RTO < 1小时。"
        },
        {
          "id": "37.7.2",
          "title": "37.7.2 备份策略",
          "content": ""
        },
        {
          "id": "37.7.3",
          "title": "37.7.3 故障转移演练",
          "content": ""
        }
      ]
    },
    {
      "id": "37.8",
      "title": "37.8 混沌工程",
      "level": 2,
      "content": "",
      "subsections": [
        {
          "id": "37.8.1",
          "title": "37.8.1 Chaos Monkey 实践",
          "content": ""
        }
      ]
    },
    {
      "id": "37.9",
      "title": "37.9 本章小结",
      "level": 2,
      "content": "本章系统地介绍了 Agent 平台的可扩展性与高可用策略：\n\n1. **无状态设计**：水平扩展的基础，所有状态外置\n2. **负载均衡**：多层 LB 架构，智能健康检查\n3. **自动扩缩容**：基于 HPA + 预测的智能扩缩容\n4. **服务降级与熔断**：多层降级策略 + 熔断器模式\n5. **多区域部署**：跨区域流量调度和数据同步\n6. **灾难恢复**：备份策略 + 故障转移 + 混沌工程\n\n高可用不是一个目标，而是一个持续的过程。它需要设计、实现、测试、演练的闭环。下一章我们将讨论监控与告警体系——没有监控的可用性只是盲目的自信。",
      "subsections": []
    }
  ],
  "code_blocks": [
    {
      "id": "code-1",
      "language": "python",
      "description": "水平扩展的前提是服务必须是无状态（Stateless）的。所有的会话状态、用户上下文都应该存储在外部（Redis、数据库），而不是服务进程内存中。",
      "code": "# ❌ 错误：有状态设计\nclass ChatService:\n    def __init__(self):\n        self.sessions = {}  # 状态存储在内存中\n    \n    def chat(self, session_id, message):\n        if session_id not in self.sessions:\n            self.sessions[session_id] = []\n        self.sessions[session_id].append(message)\n        # 问题：多实例间状态不共享\n\n# ✅ 正确：无状态设计\nclass ChatService:\n    def __init__(self, redis_client, db):\n        self.redis = redis_client\n        self.db = db\n    \n    async def chat(self, session_id, message):\n        # 状态存储在 Redis 中，所有实例可访问\n        context = await self.redis.get(f\"ctx:{session_id}\")\n        context = json.loads(context) if context else []\n        context.append(message)\n        await self.redis.setex(\n            f\"ctx:{session_id}\", 3600, json.dumps(context)\n        )",
      "section_ref": "37.2.1",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-2",
      "language": "yaml",
      "description": "读写分离配置：",
      "code": "# database_config.yaml\nprimary:\n  host: pg-primary.internal\n  port: 5432\n  pool_size: 20\n  \nreplicas:\n  - host: pg-replica-1.internal\n    port: 5432\n    pool_size: 30\n    weight: 1  # 读权重\n  - host: pg-replica-2.internal\n    port: 5432\n    pool_size: 30\n    weight: 1\n\nrouting_rules:\n  # 写操作路由到主库\n  write: primary\n  # 读操作路由到从库（负载均衡）\n  read: replicas\n  # 强一致性读（写后立即读）\n  read_after_write: primary\n  # 事务内所有操作走主库\n  transaction: primary",
      "section_ref": "37.2.3",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-3",
      "language": "python",
      "description": "读写分离实现（Python）：",
      "code": "# db_router.py\nimport threading\nfrom contextvars import ContextVar\nfrom typing import Optional\nimport psycopg\nfrom psycopg_pool import PoolConnection\n\n# 使用 ContextVar 跟踪当前线程/协程的路由决策\n_route_context: ContextVar[str] = ContextVar('db_route', default='read')\n\nclass DatabaseRouter:\n    \"\"\"数据库读写分离路由\"\"\"\n    \n    def __init__(self, primary_config, replica_configs):\n        self.primary_pool = psycopg_pool.ConnectionPool(\n            **primary_config, min_size=5, max_size=20\n        )\n        self.replica_pools = [\n            psycopg_pool.ConnectionPool(\n                **cfg, min_size=5, max_size=15\n            ) for cfg in replica_configs\n        ]\n        self._replica_index = 0\n        self._lock = threading.Lock()\n    \n    def get_connection(self) -> PoolConnection:\n        \"\"\"根据路由规则获取连接\"\"\"\n        route = _route_context.get()\n        \n        if route == 'write' or route == 'transaction':\n            return self.primary_pool.connection()\n        \n        if route == 'read_after_write':\n            # 写后立即读，强制走主库（确保数据一致性）\n            return self.primary_pool.connection()\n        \n        # 普通读，轮询从库\n        with self._lock:\n            idx = self._replica_index\n            self._replica_index = (\n                (idx + 1) % len(self.replica_pools)\n            )\n        return self.replica_pools[idx].connection()\n    \n    async def execute_write(self, query, params=None):\n        \"\"\"执行写操作\"\"\"\n        token = _route_context.set('write')\n        try:\n            async with self.get_connection() as conn:\n                return await conn.execute(query, params)\n        finally:\n            _route_context.reset(token)\n    \n    async def execute_read(self, query, params=None):\n        \"\"\"执行读操作\"\"\"\n        token = _route_context.set('read')\n        try:\n            async with self.get_connection() as conn:\n                return await conn.execute(query, params)\n        finally:\n            _route_context.reset(token)\n\n# 使用示例\ndb = DatabaseRouter(primary_config, replica_configs)\n\n# 写操作自动走主库\nawait db.execute_write(\n    \"INSERT INTO messages (session_id, role, content) VALUES (%s, %s, %s)\",\n    (session_id, 'user', message)\n)\n\n# 读操作自动走从库\nrows = await db.execute_read(\n    \"SELECT * FROM messages WHERE session_id = %s ORDER BY created_at\",\n    (session_id,)\n)",
      "section_ref": "37.2.3",
      "runnable": true,
      "dependencies": [
        "threading",
        "contextvars",
        "psycopg",
        "psycopg_pool"
      ]
    },
    {
      "id": "code-4",
      "language": "mermaid",
      "description": "",
      "code": "graph TB\n    Client[客户端] --> L4[L4 负载均衡<br/>TCP/UDP]\n    L4 --> L7[L7 负载均衡<br/>HTTP/gRPC]\n    L7 --> SvcA[Service A<br/>Pod 1-3]\n    L7 --> SvcB[Service B<br/>Pod 1-5]\n    L7 --> SvcC[Service C<br/>Pod 1-2]\n    \n    style L4 fill:#e3f2fd\n    style L7 fill:#fff3e0",
      "section_ref": "37.3.1",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-5",
      "language": "yaml",
      "description": "推荐：L4 使用 Round Robin，L7 使用 Least Connections + Health Check。",
      "code": "# health_check_config.yaml\nhealth_checks:\n  # L7 层健康检查\n  agent_service:\n    http:\n      path: /health\n      interval: 5s\n      timeout: 3s\n      healthy_threshold: 2\n      unhealthy_threshold: 3\n    \n    # 深度健康检查（包含依赖检查）\n    deep:\n      path: /health/deep\n      interval: 30s\n      timeout: 10s\n      # 深度检查失败不影响流量（仅告警）\n      affect_routing: false\n\n# 服务端健康检查实现\nhealth_check_endpoints:\n  /health:\n    # 浅层检查：仅检查进程状态\n    checks:\n      - type: process\n        description: \"进程存活\"\n  \n  /health/ready:\n    # 就绪检查：检查依赖是否可用\n    checks:\n      - type: database\n        description: \"数据库连接\"\n        timeout: 2s\n      - type: redis\n        description: \"Redis 连接\"\n        timeout: 1s\n  \n  /health/deep:\n    # 深度检查：检查所有依赖\n    checks:\n      - type: database\n        description: \"数据库连接\"\n        query: \"SELECT 1\"\n        timeout: 2s\n      - type: redis\n        description: \"Redis 连接\"\n        timeout: 1s\n      - type: vector_db\n        description: \"向量数据库连接\"\n        timeout: 3s\n      - type: llm_provider\n        description: \"LLM Provider 可达性\"\n        timeout: 5s",
      "section_ref": "37.3.3",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-6",
      "language": "yaml",
      "description": "Kubernetes Horizontal Pod Autoscaler 是最常见的自动扩缩容方案：",
      "code": "# hpa-config.yaml\napiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n  name: agent-service-hpa\n  namespace: agent-platform\nspec:\n  scaleTargetRef:\n    apiVersion: apps/v1\n    kind: Deployment\n    name: agent-service\n  minReplicas: 3\n  maxReplicas: 50\n  behavior:\n    scaleUp:\n      stabilizationWindowSeconds: 60   # 扩容稳定窗口\n      policies:\n      - type: Pods\n        value: 4                       # 每次最多增加4个Pod\n        periodSeconds: 60\n      - type: Percent\n        value: 100                     # 或者增加当前数量的100%\n        periodSeconds: 60\n      selectPolicy: Max\n    scaleDown:\n      stabilizationWindowSeconds: 300  # 缩容稳定窗口（5分钟）\n      policies:\n      - type: Pods\n        value: 2                       # 每次最多减少2个Pod\n        periodSeconds: 120\n  metrics:\n  # CPU 使用率\n  - type: Resource\n    resource:\n      name: cpu\n      target:\n        type: Utilization\n        averageUtilization: 70\n  \n  # 自定义指标：并发请求数\n  - type: Pods\n    pods:\n      metric:\n        name: http_requests_in_flight\n      target:\n        type: AverageValue\n        averageValue: \"100\"\n  \n  # 自定义指标：消息队列积压长度\n  - type: External\n    external:\n      metric:\n        name: rabbitmq_queue_messages_ready\n        selector:\n          matchLabels:\n            queue: agent.tasks\n      target:\n        type: AverageValue\n        averageValue: \"50\"",
      "section_ref": "37.4.1",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-7",
      "language": "python",
      "description": "简单的阈值触发往往不够智能。基于历史数据的预测扩缩容可以提前应对流量高峰：",
      "code": "# predictive_scaler.py\nimport numpy as np\nfrom datetime import datetime, timedelta\nfrom typing import List, Tuple\n\nclass PredictiveScaler:\n    \"\"\"基于时间序列预测的智能扩缩容\"\"\"\n    \n    def __init__(self, history_days: int = 14):\n        self.history_days = history_days\n    \n    def predict_load(self, historical_data: List[Tuple[datetime, float]]\n                     ) -> List[Tuple[datetime, float]]:\n        \"\"\"预测未来24小时的负载\"\"\"\n        # 使用加权移动平均 + 周期性检测\n        timestamps = np.array([d.timestamp() for d, _ in historical_data])\n        values = np.array([v for _, v in historical_data])\n        \n        predictions = []\n        now = datetime.utcnow()\n        \n        for hour in range(24):\n            future_time = now + timedelta(hours=hour)\n            \n            # 1. 趋势分量（线性回归）\n            trend = self._compute_trend(timestamps, values, future_time)\n            \n            # 2. 周期分量（同小时历史平均）\n            cycle = self._compute_cycle(historical_data, future_time)\n            \n            # 3. 预测值 = 趋势 + 周期\n            predicted = trend * 0.4 + cycle * 0.6\n            predictions.append((future_time, max(predicted, 0)))\n        \n        return predictions\n    \n    def recommend_replicas(\n        self,\n        predictions: List[Tuple[datetime, float]],\n        target_qps_per_pod: float = 500,\n        safety_margin: float = 1.3\n    ) -> int:\n        \"\"\"根据预测推荐副本数\"\"\"\n        if not predictions:\n            return 3  # 默认最小值\n        \n        peak_predicted = max(v for _, v in predictions)\n        recommended = int(np.ceil(\n            peak_predicted * safety_margin / target_qps_per_pod\n        ))\n        return max(3, recommended)  # 最少3个副本\n    \n    def _compute_trend(self, timestamps, values, future_time):\n        \"\"\"线性趋势预测\"\"\"\n        if len(timestamps) < 2:\n            return np.mean(values)\n        \n        coeffs = np.polyfit(timestamps, values, 1)\n        future_ts = future_time.timestamp()\n        return coeffs[0] * future_ts + coeffs[1]\n    \n    def _compute_cycle(self, historical_data, future_time):\n        \"\"\"周期性分量（同小时历史数据的加权平均）\"\"\"\n        future_hour = future_time.hour\n        future_dow = future_time.weekday()\n        \n        # 筛选相同时段的历史数据\n        matching = []\n        for dt, val in historical_data:\n            if dt.hour == future_hour:\n                # 越近的数据权重越高\n                age_days = (datetime.utcnow() - dt).days\n                weight = 1.0 / (1 + age_days)\n                matching.append((val, weight))\n        \n        if not matching:\n            return np.mean([v for _, v in historical_data])\n        \n        values = np.array([v for v, w in matching])\n        weights = np.array([w for v, w in matching])\n        return np.average(values, weights=weights)",
      "section_ref": "37.4.2",
      "runnable": true,
      "dependencies": [
        "numpy"
      ]
    },
    {
      "id": "code-8",
      "language": "mermaid",
      "description": "当系统面临异常流量或依赖服务故障时，降级是保障核心功能可用的最后防线。",
      "code": "graph TD\n    Request[用户请求] --> Check{依赖检查}\n    Check -->|所有正常| FullService[完整功能]\n    Check -->|LLM 不可用| FallbackLLM[降级：备用模型]\n    Check -->|RAG 不可用| FallbackRAG[降级：纯对话模式]\n    Check -->|工具服务不可用| FallbackTool[降级：仅对话]\n    Check -->|所有依赖不可用| StaticResponse[静态响应 + 友好提示]",
      "section_ref": "37.5.1",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-9",
      "language": "python",
      "description": "| L4 | 所有依赖故障 | 静态维护页面 | ★★☆☆☆ |",
      "code": "# circuit_breaker.py\nimport time\nimport threading\nfrom enum import Enum\nfrom typing import Callable, Any\nfrom dataclasses import dataclass\nfrom collections import deque\n\nclass CircuitState(Enum):\n    CLOSED = \"closed\"       # 正常\n    OPEN = \"open\"           # 熔断（拒绝请求）\n    HALF_OPEN = \"half_open\" # 半开（试探恢复）\n\n@dataclass\nclass CircuitConfig:\n    failure_threshold: int = 5       # 连续失败次数触发熔断\n    recovery_timeout: float = 30.0   # 熔断恢复超时（秒）\n    half_open_max_calls: int = 3     # 半开状态最大试探请求数\n    success_threshold: int = 2       # 半开状态成功次数恢复关闭\n    timeout: float = 10.0            # 单次调用超时\n\nclass CircuitBreaker:\n    \"\"\"熔断器\"\"\"\n    \n    def __init__(self, name: str, config: CircuitConfig = CircuitConfig()):\n        self.name = name\n        self.config = config\n        self._state = CircuitState.CLOSED\n        self._failure_count = 0\n        self._success_count = 0\n        self._last_failure_time = 0\n        self._half_open_calls = 0\n        self._lock = threading.Lock()\n        # 滑动窗口记录最近调用结果\n        self._window = deque(maxlen=100)\n    \n    def call(self, func: Callable, *args, **kwargs) -> Any:\n        \"\"\"通过熔断器执行函数\"\"\"\n        if not self.allow():\n            raise CircuitOpenError(\n                f\"Circuit '{self.name}' is open. \"\n                f\"Retry after {self._retry_after():.1f}s\"\n            )\n        \n        start = time.time()\n        try:\n            result = func(*args, **kwargs)\n            latency = time.time() - start\n            self._on_success(latency)\n            return result\n        except Exception as e:\n            latency = time.time() - start\n            self._on_failure(e, latency)\n            raise\n    \n    def allow(self) -> bool:\n        \"\"\"检查是否允许请求通过\"\"\"\n        with self._lock:\n            if self._state == CircuitState.CLOSED:\n                return True\n            \n            if self._state == CircuitState.OPEN:\n                if time.time() - self._last_failure_time > self.config.recovery_timeout:\n                    self._state = CircuitState.HALF_OPEN\n                    self._half_open_calls = 0\n                    return True\n                return False\n            \n            if self._state == CircuitState.HALF_OPEN:\n                if self._half_open_calls < self.config.half_open_max_calls:\n                    self._half_open_calls += 1\n                    return True\n                return False\n            \n            return False\n    \n    def _on_success(self, latency: float):\n        with self._lock:\n            self._window.append(('success', latency, time.time()))\n            self._failure_count = 0\n            \n            if self._state == CircuitState.HALF_OPEN:\n                self._success_count += 1\n                if self._success_count >= self.config.success_threshold:\n                    self._state = CircuitState.CLOSED\n                    self._success_count = 0\n    \n    def _on_failure(self, error: Exception, latency: float):\n        with self._lock:\n            self._window.append(('failure', latency, time.time()))\n            self._failure_count += 1\n            self._last_failure_time = time.time()\n            \n            if self._state == CircuitState.HALF_OPEN:\n                # 半开状态下失败，立即重新熔断\n                self._state = CircuitState.OPEN\n                self._success_count = 0\n            elif self._failure_count >= self.config.failure_threshold:\n                self._state = CircuitState.OPEN\n    \n    def _retry_after(self) -> float:\n        elapsed = time.time() - self._last_failure_time\n        return max(0, self.config.recovery_timeout - elapsed)\n    \n    def get_stats(self) -> dict:\n        \"\"\"获取熔断器统计信息\"\"\"\n        total = len(self._window)\n        failures = sum(1 for s, _, _ in self._window if s == 'failure')\n        latencies = [l for _, l, _ in self._window]\n        \n        return {\n            \"name\": self.name,\n            \"state\": self._state.value,\n            \"failure_count\": self._failure_count,\n            \"total_calls\": total,\n            \"failure_rate\": failures / total if total else 0,\n            \"avg_latency\": sum(latencies) / len(latencies) if latencies else 0,\n            \"retry_after\": self._retry_after() if self._state == CircuitState.OPEN else 0\n        }\n\nclass CircuitOpenError(Exception):\n    \"\"\"熔断器打开异常\"\"\"\n    pass\n\n# 使用示例\nllm_breaker = CircuitBreaker(\"llm_provider\", CircuitConfig(\n    failure_threshold=5,\n    recovery_timeout=30.0,\n    timeout=15.0\n))\n\nasync def call_llm_with_protection(prompt: str, model: str):\n    try:\n        return await llm_breaker.call(\n            llm_client.chat_completion,\n            model=model, messages=[{\"role\": \"user\", \"content\": prompt}]\n        )\n    except CircuitOpenError:\n        # 降级：使用缓存或备用方案\n        return get_cached_response(prompt) or \"系统繁忙，请稍后重试。\"",
      "section_ref": "37.5.2",
      "runnable": true,
      "dependencies": [
        "threading"
      ]
    },
    {
      "id": "code-10",
      "language": "python",
      "description": "",
      "code": "# retry_policy.py\nimport asyncio\nimport random\nfrom typing import Callable, TypeVar, List, Type\nfrom functools import wraps\n\nT = TypeVar('T')\n\nclass RetryPolicy:\n    \"\"\"智能重试策略\"\"\"\n    \n    def __init__(\n        self,\n        max_retries: int = 3,\n        base_delay: float = 1.0,\n        max_delay: float = 30.0,\n        exponential_base: float = 2.0,\n        jitter: bool = True,\n        retryable_exceptions: List[Type[Exception]] = None,\n        retryable_status_codes: List[int] = None,\n    ):\n        self.max_retries = max_retries\n        self.base_delay = base_delay\n        self.max_delay = max_delay\n        self.exponential_base = exponential_base\n        self.jitter = jitter\n        self.retryable_exceptions = retryable_exceptions or [\n            ConnectionError, TimeoutError\n        ]\n        self.retryable_status_codes = retryable_status_codes or [429, 502, 503, 504]\n    \n    def execute(self, func: Callable[..., T], *args, **kwargs) -> T:\n        \"\"\"执行带重试的函数\"\"\"\n        last_exception = None\n        \n        for attempt in range(self.max_retries + 1):\n            try:\n                return func(*args, **kwargs)\n            except tuple(self.retryable_exceptions) as e:\n                last_exception = e\n                if attempt < self.max_retries:\n                    delay = self._calculate_delay(attempt)\n                    self._log_retry(func.__name__, attempt, delay, e)\n                    time.sleep(delay)\n                else:\n                    raise\n            except Exception as e:\n                raise  # 非重试异常直接抛出\n        \n        raise last_exception\n    \n    async def execute_async(self, func: Callable, *args, **kwargs) -> T:\n        \"\"\"异步版本的执行\"\"\"\n        last_exception = None\n        \n        for attempt in range(self.max_retries + 1):\n            try:\n                return await func(*args, **kwargs)\n            except tuple(self.retryable_exceptions) as e:\n                last_exception = e\n                if attempt < self.max_retries:\n                    delay = self._calculate_delay(attempt)\n                    self._log_retry(func.__name__, attempt, delay, e)\n                    await asyncio.sleep(delay)\n                else:\n                    raise\n            except Exception as e:\n                raise\n        \n        raise last_exception\n    \n    def _calculate_delay(self, attempt: int) -> float:\n        \"\"\"指数退避 + 抖动\"\"\"\n        delay = self.base_delay * (self.exponential_base ** attempt)\n        delay = min(delay, self.max_delay)\n        \n        if self.jitter:\n            delay = delay * (0.5 + random.random() * 0.5)\n        \n        return delay\n    \n    def _log_retry(self, func_name, attempt, delay, error):\n        import logging\n        logging.warning(\n            f\"Retry {attempt + 1}/{self.max_retries} for {func_name} \"\n            f\"after {delay:.1f}s: {error}\"\n        )\n\n# 针对 LLM API 的专门重试策略\nllm_retry = RetryPolicy(\n    max_retries=3,\n    base_delay=1.0,\n    max_delay=30.0,\n    retryable_exceptions=[ConnectionError, TimeoutError],\n    retryable_status_codes=[429, 500, 502, 503],\n)",
      "section_ref": "37.5.3",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-11",
      "language": "mermaid",
      "description": "",
      "code": "graph TB\n    DNS[Global DNS<br/>Anycast/GSLB]\n    \n    subgraph RegionA[\"华东区域 (上海)\"]\n        GW_A[Gateway]\n        Core_A[Agent Core]\n        DB_A[(DB Primary)]\n        Redis_A[(Redis)]\n    end\n    \n    subgraph RegionB[\"华北区域 (北京)\"]\n        GW_B[Gateway]\n        Core_B[Agent Core]\n        DB_B[(DB Replica)]\n        Redis_B[(Redis)]\n    end\n    \n    subgraph RegionC[\"海外区域 (新加坡)\"]\n        GW_C[Gateway]\n        Core_C[Agent Core]\n        DB_C[(DB Replica)]\n        Redis_C[(Redis)]\n    end\n    \n    DNS -->|就近路由| GW_A & GW_B & GW_C\n    Core_A --> Core_B & Core_C\n    DB_A -->|异步复制| DB_B & DB_C",
      "section_ref": "37.6.1",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-12",
      "language": "python",
      "description": "",
      "code": "# cross_region_sync.py\nfrom enum import Enum\nfrom typing import Optional\nimport asyncio\n\nclass SyncMode(Enum):\n    SYNC = \"sync\"         # 同步复制（强一致性）\n    ASYNC = \"async\"       # 异步复制（最终一致性）\n    SEMI_SYNC = \"semi\"    # 半同步\n\nclass CrossRegionReplicator:\n    \"\"\"跨区域数据复制器\"\"\"\n    \n    def __init__(self, local_region: str, remote_regions: list):\n        self.local_region = local_region\n        self.remotes = {}\n        for region in remote_regions:\n            self.remotes[region] = self._connect(region)\n    \n    async def replicate_write(\n        self,\n        table: str,\n        operation: str,  # INSERT, UPDATE, DELETE\n        data: dict,\n        mode: SyncMode = SyncMode.ASYNC\n    ) -> bool:\n        \"\"\"跨区域写入复制\"\"\"\n        # 1. 本地写入（始终先成功）\n        await self._local_write(table, operation, data)\n        \n        if mode == SyncMode.SYNC:\n            # 同步模式：等待所有区域确认\n            tasks = [\n                self._remote_write(region, table, operation, data)\n                for region in self.remotes\n            ]\n            results = await asyncio.gather(*tasks, return_exceptions=True)\n            return all(not isinstance(r, Exception) for r in results)\n        \n        elif mode == SyncMode.SEMI_SYNC:\n            # 半同步：至少一个远程区域确认\n            tasks = [\n                self._remote_write(region, table, operation, data)\n                for region in self.remotes\n            ]\n            results = await asyncio.gather(*tasks, return_exceptions=True)\n            success_count = sum(\n                1 for r in results if not isinstance(r, Exception)\n            )\n            return success_count >= 1\n        \n        else:\n            # 异步模式：fire-and-forget（通过消息队列）\n            await self._queue_replication(table, operation, data)\n            return True\n    \n    def route_read(self, consistency: str = \"eventual\") -> str:\n        \"\"\"根据一致性要求路由读请求\"\"\"\n        if consistency == \"strong\":\n            return self.local_region  # 强一致性读走主区域\n        \n        # 最终一致性读：就近区域\n        return self._get_nearest_region()",
      "section_ref": "37.6.2",
      "runnable": true,
      "dependencies": []
    },
    {
      "id": "code-13",
      "language": "bash",
      "description": "推荐：Agent 平台采用\"温备\"方案，核心数据 RPO < 1小时，RTO < 1小时。",
      "code": "#!/bin/bash\n# backup_agent_platform.sh\n# Agent 平台自动化备份脚本\n\nset -euo pipefail\n\nBACKUP_DIR=\"/data/backups/$(date +%Y-%m-%d_%H%M%S)\"\nRETENTION_DAYS=30\n\nlog() { echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $*\"; }\n\n# 1. PostgreSQL 全量备份\nlog \"Starting PostgreSQL backup...\"\nmkdir -p \"${BACKUP_DIR}/postgres\"\npg_dump -Fc -f \"${BACKUP_DIR}/postgres/agent_platform.dump\" \\\n    --no-owner --no-privileges \\\n    postgresql://agent_app:PASSWORD@pg-primary:5432/agent_platform\n\n# 2. Redis RDB 备份\nlog \"Starting Redis backup...\"\nmkdir -p \"${BACKUP_DIR}/redis\"\nredis-cli -h redis-primary --rdb - > \"${BACKUP_DIR}/redis/dump.rdb\"\n\n# 3. 向量数据库备份\nlog \"Starting vector DB backup...\"\nmkdir -p \"${BACKUP_DIR}/vectors\"\n# Milvus 备份\npython3 - <<'EOF'\nfrom pymilvus import Collection, connections\nconnections.connect(host=\"milvus\", port=\"19530\")\n# 导出集合元数据和数据\nfor name in [\"documents\", \"embeddings\"]:\n    col = Collection(name)\n    col.dump(f\"{BACKUP_DIR}/vectors/{name}.json\")\nEOF\n\n# 4. 配置文件备份\nlog \"Starting config backup...\"\nmkdir -p \"${BACKUP_DIR}/configs\"\ncp -r /etc/agent-platform/* \"${BACKUP_DIR}/configs/\"\n\n# 5. 上传到对象存储（跨区域）\nlog \"Uploading to remote storage...\"\naws s3 sync \"${BACKUP_DIR}\" \\\n    \"s3://agent-platform-backups/$(date +%Y-%m)/\" \\\n    --storage-class STANDARD_IA\n\n# 6. 清理旧备份\nlog \"Cleaning old backups...\"\nfind /data/backups -type d -mtime +${RETENTION_DAYS} -exec rm -rf {} +\n\n# 7. 验证备份完整性\nlog \"Verifying backup integrity...\"\nif [ -f \"${BACKUP_DIR}/postgres/agent_platform.dump\" ]; then\n    pg_restore --list \"${BACKUP_DIR}/postgres/agent_platform.dump\" > /dev/null\n    log \"PostgreSQL backup verified ✓\"\nelse\n    log \"ERROR: PostgreSQL backup failed!\"\n    exit 1\nfi\n\nlog \"Backup completed successfully. Size: $(du -sh ${BACKUP_DIR} | cut -f1)\"",
      "section_ref": "37.7.2",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-14",
      "language": "yaml",
      "description": "",
      "code": "# disaster_recovery_playbook.yaml\n# 灾难恢复演练手册\n\nname: Agent Platform DR Drill\nversion: 1.0\nlast_drill: \"2026-03-15\"\nnext_drill: \"2026-04-15\"\n\nscenarios:\n  - name: \"区域级故障\"\n    description: \"华东区域完全不可用\"\n    steps:\n      1:\n        action: \"修改 DNS 将流量切换到华北区域\"\n        owner: SRE\n        timeout: 5m\n      2:\n        action: \"验证华北区域服务正常\"\n        owner: QA\n        timeout: 10m\n      3:\n        action: \"确认数据延迟在可接受范围\"\n        owner: DBA\n        timeout: 15m\n      4:\n        action: \"通知客户并更新状态页面\"\n        owner: Support\n        timeout: 30m\n    recovery_time_target: 30min\n    \n  - name: \"数据库主节点故障\"\n    description: \"PostgreSQL 主节点宕机\"\n    steps:\n      1:\n        action: \"Promote 只读副本为主节点\"\n        owner: DBA\n        timeout: 5m\n      2:\n        action: \"更新应用数据库连接配置\"\n        owner: SRE\n        timeout: 5m\n      3:\n        action: \"验证读写功能正常\"\n        owner: QA\n        timeout: 10m\n    recovery_time_target: 20min\n    \n  - name: \"LLM Provider 全面故障\"\n    description: \"所有 LLM 服务不可用\"\n    steps:\n      1:\n        action: \"熔断器自动打开\"\n        owner: System\n        timeout: 0m\n      2:\n        action: \"启用备用 LLM Provider\"\n        owner: AI Team\n        timeout: 10m\n      3:\n        action: \"降级为缓存响应模式\"\n        owner: System\n        timeout: 1m\n    recovery_time_target: 10min",
      "section_ref": "37.7.3",
      "runnable": false,
      "dependencies": []
    },
    {
      "id": "code-15",
      "language": "python",
      "description": "",
      "code": "# chaos_experiment.py\n\"\"\"\nAgent 平台混沌工程实验\n谨慎在生产环境中使用！建议先在预发布环境充分验证。\n\"\"\"\nimport random\nimport time\nimport logging\nfrom typing import List, Callable\n\nlogger = logging.getLogger(\"chaos\")\n\nclass ChaosExperiment:\n    \"\"\"混沌工程实验框架\"\"\"\n    \n    def __init__(self, target_services: List[str]):\n        self.targets = target_services\n        self.safety_hooks = []\n    \n    def add_safety_hook(self, hook: Callable):\n        \"\"\"添加安全钩子（实验前的检查条件）\"\"\"\n        self.safety_hooks.append(hook)\n    \n    def check_safety(self) -> bool:\n        \"\"\"执行所有安全检查\"\"\"\n        for hook in self.safety_hooks:\n            if not hook():\n                logger.warning(f\"Safety check failed: {hook.__name__}\")\n                return False\n        return True\n    \n    def inject_latency(self, service: str, duration_ms: int,\n                       probability: float = 0.1):\n        \"\"\"注入延迟\"\"\"\n        if random.random() > probability:\n            return\n        \n        logger.info(\n            f\"[CHAOS] Injecting {duration_ms}ms latency to {service}\"\n        )\n        time.sleep(duration_ms / 1000)\n    \n    def simulate_failure(self, service: str, probability: float = 0.05):\n        \"\"\"模拟服务故障\"\"\"\n        if random.random() > probability:\n            return\n        \n        logger.warning(f\"[CHAOS] Simulating failure of {service}\")\n        raise SimulatedFailure(\n            f\"Chaos experiment: {service} is unavailable\"\n        )\n\n# 安全钩子示例\ndef check_not_peak_hours():\n    \"\"\"确保不在高峰期执行\"\"\"\n    hour = time.localtime().tm_hour\n    return hour < 9 or hour > 22  # 非高峰时段\n\ndef check_error_rate_below_threshold():\n    \"\"\"确保当前错误率低于阈值\"\"\"\n    current_error_rate = get_current_error_rate()\n    return current_error_rate < 0.01  # 错误率低于1%",
      "section_ref": "37.8.1",
      "runnable": true,
      "dependencies": []
    }
  ],
  "tables": [
    {
      "headers": [
        "级别",
        "可用性",
        "年停机时间",
        "适用场景"
      ],
      "data": [
        [
          "基础",
          "99%",
          "3.65 天",
          "内部工具、开发环境"
        ],
        [
          "标准",
          "99.9%",
          "8.76 小时",
          "一般业务系统"
        ],
        [
          "高可用",
          "99.95%",
          "4.38 小时",
          "核心业务系统"
        ],
        [
          "极高可用",
          "99.99%",
          "52.56 分钟",
          "金融级 Agent 平台"
        ]
      ]
    },
    {
      "headers": [
        "服务类型",
        "状态类型",
        "扩展策略"
      ],
      "data": [
        [
          "Redis",
          "缓存数据",
          "Redis Cluster（分片 + 副本）"
        ],
        [
          "PostgreSQL",
          "持久化数据",
          "读写分离 + 分库分表"
        ],
        [
          "向量数据库",
          "Embedding",
          "Milvus 分布式集群"
        ],
        [
          "WebSocket",
          "连接状态",
          "Sticky Session + Pub/Sub"
        ]
      ]
    },
    {
      "headers": [
        "层级",
        "技术",
        "职责"
      ],
      "data": [
        [
          "L4",
          "Nginx / AWS NLB",
          "TCP 连接分发、TLS 终止"
        ],
        [
          "L7",
          "Nginx / AWS ALB",
          "HTTP 路由、gRPC 代理、限流"
        ],
        [
          "L7 Intra",
          "Envoy / K8s Service",
          "集群内部服务间负载均衡"
        ]
      ]
    },
    {
      "headers": [
        "算法",
        "适用场景",
        "优势",
        "劣势"
      ],
      "data": [
        [
          "Round Robin",
          "同构服务",
          "简单公平",
          "不考虑实例负载差异"
        ],
        [
          "Weighted Round Robin",
          "异构实例",
          "考虑实例能力",
          "权重需要手动调整"
        ],
        [
          "Least Connections",
          "长连接服务",
          "动态均衡",
          "需要连接数监控"
        ],
        [
          "Consistent Hashing",
          "有状态服务",
          "亲和性",
          "不适合均匀分布"
        ],
        [
          "Adaptive",
          "混合负载",
          "最优",
          "实现复杂"
        ]
      ]
    },
    {
      "headers": [
        "指标",
        "扩容触发",
        "缩容触发",
        "冷却期"
      ],
      "data": [
        [
          "CPU > 70%",
          "立即",
          "CPU < 30% 持续 5 分钟",
          "扩容 60s / 缩容 300s"
        ],
        [
          "并发请求数 > 100/pod",
          "立即",
          "< 50/pod 持续 5 分钟",
          "同上"
        ],
        [
          "队列积压 > 50",
          "立即",
          "< 10 持续 10 分钟",
          "扩容 30s / 缩容 600s"
        ],
        [
          "预测未来1h流量 > 容量80%",
          "提前 30 分钟",
          "-",
          "-"
        ]
      ]
    },
    {
      "headers": [
        "级别",
        "触发条件",
        "降级行为",
        "用户体验"
      ],
      "data": [
        [
          "L0",
          "正常",
          "完整功能",
          "★★★★★"
        ],
        [
          "L1",
          "LLM 响应超时",
          "切换备用模型",
          "★★★★☆"
        ],
        [
          "L2",
          "所有 LLM 不可用",
          "返回缓存响应",
          "★★★☆☆"
        ],
        [
          "L3",
          "RAG 服务不可用",
          "跳过检索，纯对话",
          "★★★☆☆"
        ],
        [
          "L4",
          "所有依赖故障",
          "静态维护页面",
          "★★☆☆☆"
        ]
      ]
    },
    {
      "headers": [
        "等级",
        "RPO（数据丢失）",
        "RTO（恢复时间）",
        "成本",
        "方案"
      ],
      "data": [
        [
          "冷备",
          "24小时",
          "4-24小时",
          "低",
          "定期备份 + 异地存储"
        ],
        [
          "温备",
          "1小时",
          "1-4小时",
          "中",
          "异步复制 + 热待机"
        ],
        [
          "热备",
          "分钟级",
          "分钟级",
          "高",
          "同步复制 + 自动故障转移"
        ],
        [
          "多活",
          "0",
          "0（秒级切换）",
          "极高",
          "多区域活跃-活跃"
        ]
      ]
    }
  ],
  "key_takeaways": [],
  "common_pitfalls": [],
  "related_chapters": [
    "ch36"
  ]
}