企业级多智能体系统部署与可观测性实战

Docker Compose + FastAPI + Prometheus + Grafana + Nginx 全面落地与排障清单。

TL;DR

指标端口改为 9100；API 独占 8000。
Redis/Postgres 用 Exporter；Prometheus 目标修正。
新增 FastAPI API（/chat、/tasks、/analysis、/health、/metrics）。
任务持久化到 Postgres，异步后台处理 + 实时查询。
LLM 自动选择 OpenAI/DeepSeek/Anthropic，失败回退。
Windows/PowerShell 统一 UTF‑8；服务端 application/json; charset=utf-8。
基础镜像参数化为 AWS Public ECR，解决 Docker Hub/apt 访问问题。

1. 背景与目标

多智能体 AI 系统生产落地的三大关键：稳定运行、可观测、可快速排障。
本文覆盖 Compose 编排、API 服务化、指标采集与可视化、LLM 接入与常见故障自查。

2. 架构与服务角色

multi-agent-system：主进程（Tracing、Message Bus、Workflows、Agents）。
api：FastAPI 服务（/chat、/tasks、/analysis、/health、/metrics）。
redis / postgres：缓存与持久化；exporter 提供 HTTP 指标。
prometheus / grafana：指标采集与可视化。
nginx：反代 /api/ 与 /streamlit/。
streamlit：占位 UI；jupyter：开发环境（可选）。

3. 监控与抓取修复

主应用指标：9100 暴露，Prometheus 目标 multi-agent-system:9100。
Exporter：Redis 用 redis-exporter:9121，Postgres 用 postgres-exporter:9187。
避免抓取未运行的 8501（Streamlit）。

4. API 服务化与持久化

将 /metrics 与业务 API 解耦：/metrics 走 9100；/chat、/tasks 等走 8000（api 服务）。
任务写入 Postgres（processing），后台协程处理；GET /tasks/{id} 实时返回结果。

5. LLM 接入与回退策略

按密钥优先级自动选择 provider：OPENAI > DEEPSEEK > ANTHROPIC。
openai 时忽略 base_url；deepseek 需 LLM_BASE_URL=https://api.deepseek.com/v1。
失败打印清晰日志并回退占位结果，保证系统稳定出参。

6. Windows/PowerShell 中文最佳实践

服务端：统一 application/json; charset=utf-8。
客户端：[Console]::OutputEncoding = [System.Text.Encoding]::UTF8 + Invoke-RestMethod。

7. 常见故障速查

connection refused/EOF：修正抓取目标；使用 Redis/Postgres Exporter；主指标改 9100。
405 Method Not Allowed：/metrics 不要占用 8000；8000 专供 API。
model_not_found：provider/model 混用（OpenAI 不存在 deepseek-chat）。
Docker Hub token/apt 100：FROM 参数化为 ECR 镜像；或预拉取再打标签。
PowerShell 中文“????”：服务端 charset + 客户端 UTF‑8。

8. 一键验证清单

docker compose up -d --build api multi-agent-system prometheus grafana nginx redis postgres redis-exporter postgres-exporter
open http://localhost:9090/targets
open http://localhost:3000
open http://localhost:8000/health
open http://localhost:8000/docs

PowerShell 示例：

[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
$task = @{ query="人工智能在医疗领域的应用前景"; priority="high"; agent_type="research" } | ConvertTo-Json
$c = Invoke-RestMethod -Method POST -Uri "http://localhost:8000/tasks" -ContentType "application/json; charset=utf-8" -Body $task
Start-Sleep -Seconds 3
Invoke-RestMethod -Method GET -Uri ("http://localhost:8000/tasks/{0}" -f $c.task_id)

9. 结语与下一步

将后台处理切换为真实 research_agent 流程（检索-分析-报告）。
Grafana 增加任务吞吐、延时、错误率与工作流指标面板。
增加 /debug/llm 与 /agents 端点；把 on_event 改为 lifespan 消除警告。

附录 A：prometheus.yml（推荐）

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # 主应用（Prometheus Python client 暴露在 9100）
  - job_name: 'multi-agent-system-metrics'
    metrics_path: /metrics
    static_configs:
      - targets: ['multi-agent-system:9100']

  # Redis 指标通过 Exporter 暴露
  - job_name: 'redis'
    metrics_path: /metrics
    static_configs:
      - targets: ['redis-exporter:9121']

  # Postgres 指标通过 Exporter 暴露
  - job_name: 'postgres'
    metrics_path: /metrics
    static_configs:
      - targets: ['postgres-exporter:9187']

  # Grafana 自带 /metrics
  - job_name: 'grafana'
    metrics_path: /metrics
    static_configs:
      - targets: ['grafana:3000']

说明：不要抓取 redis:6379 和 postgres:5432，它们不是 HTTP 指标端点；必须抓取各自的 Exporter。

附录 B：docker-compose.yml 关键片段（节选）

services:
  multi-agent-system:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
    container_name: multi-agent-system
    environment:
      - PYTHONPATH=/app
      - REDIS_HOST=redis
      - POSTGRES_HOST=postgres
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
      - LLM_PROVIDER=${LLM_PROVIDER:-}
      - LLM_MODEL=${LLM_MODEL:-}
      - LLM_BASE_URL=${LLM_BASE_URL:-}
    volumes:
      - ./logs:/app/logs
      - ./data:/app/data
    networks:
      - multi-agent-network
    restart: unless-stopped
    # 健康检查改为抓取 9100/metrics
    healthcheck:
      test: ["CMD-SHELL", "curl -fsS http://localhost:9100/metrics > /dev/null || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  api:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
    container_name: multi-agent-api
    command: ["python", "api/server.py"]
    ports:
      - "8000:8000"
    environment:
      - PYTHONPATH=/app
      - POSTGRES_HOST=postgres
      - POSTGRES_DB=multi_agent_system
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
      - LLM_PROVIDER=${LLM_PROVIDER:-}
      - LLM_MODEL=${LLM_MODEL:-}
      - LLM_BASE_URL=${LLM_BASE_URL:-}
    depends_on:
      - redis
      - postgres
    networks:
      - multi-agent-network
    restart: unless-stopped

  streamlit:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
    container_name: multi-agent-streamlit
    command: ["streamlit", "run", "ui/streamlit_app.py", "--server.port", "8501", "--server.address", "0.0.0.0"]
    ports:
      - "8501:8501"
    networks:
      - multi-agent-network

  redis:
    image: redis:7-alpine
    container_name: multi-agent-redis
    ports:
      - "6379:6379"
    networks: [multi-agent-network]

  postgres:
    image: postgres:15-alpine
    container_name: multi-agent-postgres
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_DB=multi_agent_system
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./scripts/init.sql:/docker-entrypoint-initdb.d/init.sql
    networks: [multi-agent-network]

  redis-exporter:
    image: oliver006/redis_exporter:v1.63.0
    container_name: multi-agent-redis-exporter
    environment:
      - REDIS_ADDR=redis://redis:6379
    ports:
      - "9121:9121"
    depends_on: [redis]
    networks: [multi-agent-network]

  postgres-exporter:
    image: quay.io/prometheuscommunity/postgres-exporter:v0.15.0
    container_name: multi-agent-postgres-exporter
    environment:
      - DATA_SOURCE_NAME=postgresql://postgres:postgres@postgres:5432/multi_agent_system?sslmode=disable
    ports:
      - "9187:9187"
    depends_on: [postgres]
    networks: [multi-agent-network]

  prometheus:
    image: prom/prometheus:latest
    container_name: multi-agent-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    networks: [multi-agent-network]

  grafana:
    image: grafana/grafana:latest
    container_name: multi-agent-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_METRICS_ENABLED=true
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources
    networks: [multi-agent-network]

  nginx:
    image: nginx:alpine
    container_name: multi-agent-nginx
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - multi-agent-system
      - streamlit
      - api
    networks: [multi-agent-network]

networks:
  multi-agent-network:
    driver: bridge

volumes:
  postgres_data:
  prometheus_data:
  grafana_data:

附录 C：Nginx 反代（nginx/nginx.conf 节选）

http {
  server {
    listen 80;
    server_name localhost;

    # API -> api:8000
    location /api/ {
      proxy_pass http://api:8000/;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Streamlit -> streamlit:8501
    location /streamlit/ {
      proxy_pass http://streamlit:8501/;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;
    }
  }
}

附录 D：主应用指标端口（Python 节选）

from prometheus_client import start_http_server

# 将指标暴露到 9100
start_http_server(9100)

附录 E：.env 示例（LLM 三选一）

OpenAI：

OPENAI_API_KEY=sk-你的密钥
LLM_PROVIDER=openai
LLM_MODEL=gpt-3.5-turbo

DeepSeek：

DEEPSEEK_API_KEY=ds-你的密钥
LLM_PROVIDER=deepseek
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-chat

Anthropic：

ANTHROPIC_API_KEY=anth-你的密钥
LLM_PROVIDER=anthropic
LLM_MODEL=claude-3-5-sonnet-20240620

附录 F：Grafana 数据源与仪表盘自动化

数据源（已随 Compose 挂载）：monitoring/grafana/datasources/datasource.yml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

仪表盘 Provider：monitoring/grafana/dashboards/dashboards.yml

apiVersion: 1
providers:
  - name: 'Multi-Agent System'
    orgId: 1
    folder: 'Multi-Agent'
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards

示例仪表盘 JSON：monitoring/grafana/dashboards/multi_agent_overview.json

{
  "title": "Multi-Agent System Overview",
  "schemaVersion": 38,
  "refresh": "10s",
  "panels": [
    {"type":"stat","title":"Targets UP","gridPos":{"x":0,"y":0,"w":6,"h":4},"datasource":"Prometheus","targets":[{"expr":"sum(up)"}]},
    {"type":"graph","title":"Prometheus Scrape Duration","gridPos":{"x":6,"y":0,"w":12,"h":8},"datasource":"Prometheus","targets":[{"expr":"scrape_duration_seconds{job=~\\"multi-agent-system-metrics|redis|postgres|grafana\\"}"}]},
    {"type":"table","title":"Jobs Status","gridPos":{"x":0,"y":4,"w":18,"h":8},"datasource":"Prometheus","targets":[{"expr":"up"}]}
  ]
}

打开 Grafana（默认 admin/admin），在 “Multi-Agent” 文件夹内可见仪表盘。

附录 G：/debug/llm 端点示例（FastAPI）

已在 api/server.py 中新增：

@app.get("/debug/llm")
async def debug_llm():
    """返回当前 LLM 配置识别结果，并做一次最小连通性自测"""
    # 返回 env/config/detected 与 probe 结果（是否可连通、错误信息等）

使用示例：

curl -s http://localhost:8000/debug/llm | jq .

期望看到 detected.provider/model/base_url，probe.ok=true；若失败会返回具体错误信息，便于快速纠偏（如 provider 与 model 不匹配、网络不可达等）。

企业级多智能体系统部署实战：Docker Compose + FastAPI + Prometheus + Grafana 全面排障指南

企业级多智能体系统部署与可观测性实战

TL;DR

1. 背景与目标

2. 架构与服务角色

3. 监控与抓取修复

4. API 服务化与持久化

5. LLM 接入与回退策略

6. Windows/PowerShell 中文最佳实践

7. 常见故障速查

8. 一键验证清单

9. 结语与下一步

附录 A：prometheus.yml（推荐）

附录 B：docker-compose.yml 关键片段（节选）

附录 C：Nginx 反代（nginx/nginx.conf 节选）

附录 D：主应用指标端口（Python 节选）

附录 E：.env 示例（LLM 三选一）

附录 F：Grafana 数据源与仪表盘自动化

附录 G：/debug/llm 端点示例（FastAPI）

相关文章