企业级多智能体系统部署与可观测性实战

Docker Compose + FastAPI + Prometheus + Grafana + Nginx 全面落地与排障清单。

TL;DR

  • 指标端口改为 9100;API 独占 8000。
  • Redis/Postgres 用 Exporter;Prometheus 目标修正。
  • 新增 FastAPI API(/chat、/tasks、/analysis、/health、/metrics)。
  • 任务持久化到 Postgres,异步后台处理 + 实时查询。
  • LLM 自动选择 OpenAI/DeepSeek/Anthropic,失败回退。
  • Windows/PowerShell 统一 UTF‑8;服务端 application/json; charset=utf-8
  • 基础镜像参数化为 AWS Public ECR,解决 Docker Hub/apt 访问问题。

1. 背景与目标

  • 多智能体 AI 系统生产落地的三大关键:稳定运行、可观测、可快速排障。
  • 本文覆盖 Compose 编排、API 服务化、指标采集与可视化、LLM 接入与常见故障自查。

2. 架构与服务角色

  • multi-agent-system:主进程(Tracing、Message Bus、Workflows、Agents)。
  • api:FastAPI 服务(/chat、/tasks、/analysis、/health、/metrics)。
  • redis / postgres:缓存与持久化;exporter 提供 HTTP 指标。
  • prometheus / grafana:指标采集与可视化。
  • nginx:反代 /api/ 与 /streamlit/。
  • streamlit:占位 UI;jupyter:开发环境(可选)。

3. 监控与抓取修复

  • 主应用指标:9100 暴露,Prometheus 目标 multi-agent-system:9100
  • Exporter:Redis 用 redis-exporter:9121,Postgres 用 postgres-exporter:9187
  • 避免抓取未运行的 8501(Streamlit)。

4. API 服务化与持久化

  • 将 /metrics 与业务 API 解耦:/metrics 走 9100;/chat、/tasks 等走 8000(api 服务)。
  • 任务写入 Postgres(processing),后台协程处理;GET /tasks/{id} 实时返回结果。

5. LLM 接入与回退策略

  • 按密钥优先级自动选择 provider:OPENAI > DEEPSEEK > ANTHROPIC。
  • openai 时忽略 base_url;deepseek 需 LLM_BASE_URL=https://api.deepseek.com/v1
  • 失败打印清晰日志并回退占位结果,保证系统稳定出参。

6. Windows/PowerShell 中文最佳实践

  • 服务端:统一 application/json; charset=utf-8
  • 客户端:[Console]::OutputEncoding = [System.Text.Encoding]::UTF8 + Invoke-RestMethod。

7. 常见故障速查

  • connection refused/EOF:修正抓取目标;使用 Redis/Postgres Exporter;主指标改 9100。
  • 405 Method Not Allowed:/metrics 不要占用 8000;8000 专供 API。
  • model_not_found:provider/model 混用(OpenAI 不存在 deepseek-chat)。
  • Docker Hub token/apt 100:FROM 参数化为 ECR 镜像;或预拉取再打标签。
  • PowerShell 中文“????”:服务端 charset + 客户端 UTF‑8。

8. 一键验证清单

docker compose up -d --build api multi-agent-system prometheus grafana nginx redis postgres redis-exporter postgres-exporter
open http://localhost:9090/targets
open http://localhost:3000
open http://localhost:8000/health
open http://localhost:8000/docs

PowerShell 示例:

[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
$task = @{ query="人工智能在医疗领域的应用前景"; priority="high"; agent_type="research" } | ConvertTo-Json
$c = Invoke-RestMethod -Method POST -Uri "http://localhost:8000/tasks" -ContentType "application/json; charset=utf-8" -Body $task
Start-Sleep -Seconds 3
Invoke-RestMethod -Method GET -Uri ("http://localhost:8000/tasks/{0}" -f $c.task_id)

9. 结语与下一步

  • 将后台处理切换为真实 research_agent 流程(检索-分析-报告)。
  • Grafana 增加任务吞吐、延时、错误率与工作流指标面板。
  • 增加 /debug/llm/agents 端点;把 on_event 改为 lifespan 消除警告。

附录 A:prometheus.yml(推荐)

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # 主应用(Prometheus Python client 暴露在 9100)
  - job_name: 'multi-agent-system-metrics'
    metrics_path: /metrics
    static_configs:
      - targets: ['multi-agent-system:9100']

  # Redis 指标通过 Exporter 暴露
  - job_name: 'redis'
    metrics_path: /metrics
    static_configs:
      - targets: ['redis-exporter:9121']

  # Postgres 指标通过 Exporter 暴露
  - job_name: 'postgres'
    metrics_path: /metrics
    static_configs:
      - targets: ['postgres-exporter:9187']

  # Grafana 自带 /metrics
  - job_name: 'grafana'
    metrics_path: /metrics
    static_configs:
      - targets: ['grafana:3000']

说明:不要抓取 redis:6379postgres:5432,它们不是 HTTP 指标端点;必须抓取各自的 Exporter。

附录 B:docker-compose.yml 关键片段(节选)

services:
  multi-agent-system:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
    container_name: multi-agent-system
    environment:
      - PYTHONPATH=/app
      - REDIS_HOST=redis
      - POSTGRES_HOST=postgres
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
      - LLM_PROVIDER=${LLM_PROVIDER:-}
      - LLM_MODEL=${LLM_MODEL:-}
      - LLM_BASE_URL=${LLM_BASE_URL:-}
    volumes:
      - ./logs:/app/logs
      - ./data:/app/data
    networks:
      - multi-agent-network
    restart: unless-stopped
    # 健康检查改为抓取 9100/metrics
    healthcheck:
      test: ["CMD-SHELL", "curl -fsS http://localhost:9100/metrics > /dev/null || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  api:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
    container_name: multi-agent-api
    command: ["python", "api/server.py"]
    ports:
      - "8000:8000"
    environment:
      - PYTHONPATH=/app
      - POSTGRES_HOST=postgres
      - POSTGRES_DB=multi_agent_system
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
      - LLM_PROVIDER=${LLM_PROVIDER:-}
      - LLM_MODEL=${LLM_MODEL:-}
      - LLM_BASE_URL=${LLM_BASE_URL:-}
    depends_on:
      - redis
      - postgres
    networks:
      - multi-agent-network
    restart: unless-stopped

  streamlit:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
    container_name: multi-agent-streamlit
    command: ["streamlit", "run", "ui/streamlit_app.py", "--server.port", "8501", "--server.address", "0.0.0.0"]
    ports:
      - "8501:8501"
    networks:
      - multi-agent-network

  redis:
    image: redis:7-alpine
    container_name: multi-agent-redis
    ports:
      - "6379:6379"
    networks: [multi-agent-network]

  postgres:
    image: postgres:15-alpine
    container_name: multi-agent-postgres
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_DB=multi_agent_system
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./scripts/init.sql:/docker-entrypoint-initdb.d/init.sql
    networks: [multi-agent-network]

  redis-exporter:
    image: oliver006/redis_exporter:v1.63.0
    container_name: multi-agent-redis-exporter
    environment:
      - REDIS_ADDR=redis://redis:6379
    ports:
      - "9121:9121"
    depends_on: [redis]
    networks: [multi-agent-network]

  postgres-exporter:
    image: quay.io/prometheuscommunity/postgres-exporter:v0.15.0
    container_name: multi-agent-postgres-exporter
    environment:
      - DATA_SOURCE_NAME=postgresql://postgres:postgres@postgres:5432/multi_agent_system?sslmode=disable
    ports:
      - "9187:9187"
    depends_on: [postgres]
    networks: [multi-agent-network]

  prometheus:
    image: prom/prometheus:latest
    container_name: multi-agent-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    networks: [multi-agent-network]

  grafana:
    image: grafana/grafana:latest
    container_name: multi-agent-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_METRICS_ENABLED=true
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources
    networks: [multi-agent-network]

  nginx:
    image: nginx:alpine
    container_name: multi-agent-nginx
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - multi-agent-system
      - streamlit
      - api
    networks: [multi-agent-network]

networks:
  multi-agent-network:
    driver: bridge

volumes:
  postgres_data:
  prometheus_data:
  grafana_data:

附录 C:Nginx 反代(nginx/nginx.conf 节选)

http {
  server {
    listen 80;
    server_name localhost;

    # API -> api:8000
    location /api/ {
      proxy_pass http://api:8000/;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Streamlit -> streamlit:8501
    location /streamlit/ {
      proxy_pass http://streamlit:8501/;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;
    }
  }
}

附录 D:主应用指标端口(Python 节选)

from prometheus_client import start_http_server

# 将指标暴露到 9100
start_http_server(9100)

附录 E:.env 示例(LLM 三选一)

OpenAI:

OPENAI_API_KEY=sk-你的密钥
LLM_PROVIDER=openai
LLM_MODEL=gpt-3.5-turbo

DeepSeek:

DEEPSEEK_API_KEY=ds-你的密钥
LLM_PROVIDER=deepseek
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-chat

Anthropic:

ANTHROPIC_API_KEY=anth-你的密钥
LLM_PROVIDER=anthropic
LLM_MODEL=claude-3-5-sonnet-20240620

附录 F:Grafana 数据源与仪表盘自动化

  • 数据源(已随 Compose 挂载):monitoring/grafana/datasources/datasource.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
  • 仪表盘 Provider:monitoring/grafana/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'Multi-Agent System'
    orgId: 1
    folder: 'Multi-Agent'
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards
  • 示例仪表盘 JSON:monitoring/grafana/dashboards/multi_agent_overview.json
{
  "title": "Multi-Agent System Overview",
  "schemaVersion": 38,
  "refresh": "10s",
  "panels": [
    {"type":"stat","title":"Targets UP","gridPos":{"x":0,"y":0,"w":6,"h":4},"datasource":"Prometheus","targets":[{"expr":"sum(up)"}]},
    {"type":"graph","title":"Prometheus Scrape Duration","gridPos":{"x":6,"y":0,"w":12,"h":8},"datasource":"Prometheus","targets":[{"expr":"scrape_duration_seconds{job=~\\"multi-agent-system-metrics|redis|postgres|grafana\\"}"}]},
    {"type":"table","title":"Jobs Status","gridPos":{"x":0,"y":4,"w":18,"h":8},"datasource":"Prometheus","targets":[{"expr":"up"}]}
  ]
}

打开 Grafana(默认 admin/admin),在 “Multi-Agent” 文件夹内可见仪表盘。

附录 G:/debug/llm 端点示例(FastAPI)

已在 api/server.py 中新增:

@app.get("/debug/llm")
async def debug_llm():
    """返回当前 LLM 配置识别结果,并做一次最小连通性自测"""
    # 返回 env/config/detected 与 probe 结果(是否可连通、错误信息等)

使用示例:

curl -s http://localhost:8000/debug/llm | jq .

期望看到 detected.provider/model/base_url,probe.ok=true;若失败会返回具体错误信息,便于快速纠偏(如 provider 与 model 不匹配、网络不可达等)。