企业级多智能体系统部署与可观测性实战
Docker Compose + FastAPI + Prometheus + Grafana + Nginx 全面落地与排障清单。
TL;DR
-
指标端口改为 9100;API 独占 8000。 -
Redis/Postgres 用 Exporter;Prometheus 目标修正。 -
新增 FastAPI API(/chat、/tasks、/analysis、/health、/metrics)。 -
任务持久化到 Postgres,异步后台处理 + 实时查询。 -
LLM 自动选择 OpenAI/DeepSeek/Anthropic,失败回退。 -
Windows/PowerShell 统一 UTF‑8;服务端 application/json; charset=utf-8。 -
基础镜像参数化为 AWS Public ECR,解决 Docker Hub/apt 访问问题。
1. 背景与目标
-
多智能体 AI 系统生产落地的三大关键:稳定运行、可观测、可快速排障。 -
本文覆盖 Compose 编排、API 服务化、指标采集与可视化、LLM 接入与常见故障自查。
2. 架构与服务角色
-
multi-agent-system:主进程(Tracing、Message Bus、Workflows、Agents)。 -
api:FastAPI 服务(/chat、/tasks、/analysis、/health、/metrics)。 -
redis / postgres:缓存与持久化;exporter 提供 HTTP 指标。 -
prometheus / grafana:指标采集与可视化。 -
nginx:反代 /api/ 与 /streamlit/。 -
streamlit:占位 UI;jupyter:开发环境(可选)。
3. 监控与抓取修复
-
主应用指标:9100 暴露,Prometheus 目标 multi-agent-system:9100。 -
Exporter:Redis 用 redis-exporter:9121,Postgres 用postgres-exporter:9187。 -
避免抓取未运行的 8501(Streamlit)。
4. API 服务化与持久化
-
将 /metrics 与业务 API 解耦:/metrics 走 9100;/chat、/tasks 等走 8000(api 服务)。 -
任务写入 Postgres(processing),后台协程处理;GET /tasks/{id} 实时返回结果。
5. LLM 接入与回退策略
-
按密钥优先级自动选择 provider:OPENAI > DEEPSEEK > ANTHROPIC。 -
openai 时忽略 base_url;deepseek 需 LLM_BASE_URL=https://api.deepseek.com/v1。 -
失败打印清晰日志并回退占位结果,保证系统稳定出参。
6. Windows/PowerShell 中文最佳实践
-
服务端:统一 application/json; charset=utf-8。 -
客户端: [Console]::OutputEncoding = [System.Text.Encoding]::UTF8+ Invoke-RestMethod。
7. 常见故障速查
-
connection refused/EOF:修正抓取目标;使用 Redis/Postgres Exporter;主指标改 9100。 -
405 Method Not Allowed:/metrics 不要占用 8000;8000 专供 API。 -
model_not_found:provider/model 混用(OpenAI 不存在 deepseek-chat)。 -
Docker Hub token/apt 100:FROM 参数化为 ECR 镜像;或预拉取再打标签。 -
PowerShell 中文“????”:服务端 charset + 客户端 UTF‑8。
8. 一键验证清单
docker compose up -d --build api multi-agent-system prometheus grafana nginx redis postgres redis-exporter postgres-exporter
open http://localhost:9090/targets
open http://localhost:3000
open http://localhost:8000/health
open http://localhost:8000/docs
PowerShell 示例:
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
$task = @{ query="人工智能在医疗领域的应用前景"; priority="high"; agent_type="research" } | ConvertTo-Json
$c = Invoke-RestMethod -Method POST -Uri "http://localhost:8000/tasks" -ContentType "application/json; charset=utf-8" -Body $task
Start-Sleep -Seconds 3
Invoke-RestMethod -Method GET -Uri ("http://localhost:8000/tasks/{0}" -f $c.task_id)
9. 结语与下一步
-
将后台处理切换为真实 research_agent 流程(检索-分析-报告)。 -
Grafana 增加任务吞吐、延时、错误率与工作流指标面板。 -
增加 /debug/llm与/agents端点;把 on_event 改为 lifespan 消除警告。
附录 A:prometheus.yml(推荐)
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# 主应用(Prometheus Python client 暴露在 9100)
- job_name: 'multi-agent-system-metrics'
metrics_path: /metrics
static_configs:
- targets: ['multi-agent-system:9100']
# Redis 指标通过 Exporter 暴露
- job_name: 'redis'
metrics_path: /metrics
static_configs:
- targets: ['redis-exporter:9121']
# Postgres 指标通过 Exporter 暴露
- job_name: 'postgres'
metrics_path: /metrics
static_configs:
- targets: ['postgres-exporter:9187']
# Grafana 自带 /metrics
- job_name: 'grafana'
metrics_path: /metrics
static_configs:
- targets: ['grafana:3000']
说明:不要抓取
redis:6379和postgres:5432,它们不是 HTTP 指标端点;必须抓取各自的 Exporter。
附录 B:docker-compose.yml 关键片段(节选)
services:
multi-agent-system:
build:
context: .
dockerfile: Dockerfile
args:
PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
container_name: multi-agent-system
environment:
- PYTHONPATH=/app
- REDIS_HOST=redis
- POSTGRES_HOST=postgres
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
- LLM_PROVIDER=${LLM_PROVIDER:-}
- LLM_MODEL=${LLM_MODEL:-}
- LLM_BASE_URL=${LLM_BASE_URL:-}
volumes:
- ./logs:/app/logs
- ./data:/app/data
networks:
- multi-agent-network
restart: unless-stopped
# 健康检查改为抓取 9100/metrics
healthcheck:
test: ["CMD-SHELL", "curl -fsS http://localhost:9100/metrics > /dev/null || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
api:
build:
context: .
dockerfile: Dockerfile
args:
PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
container_name: multi-agent-api
command: ["python", "api/server.py"]
ports:
- "8000:8000"
environment:
- PYTHONPATH=/app
- POSTGRES_HOST=postgres
- POSTGRES_DB=multi_agent_system
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
- LLM_PROVIDER=${LLM_PROVIDER:-}
- LLM_MODEL=${LLM_MODEL:-}
- LLM_BASE_URL=${LLM_BASE_URL:-}
depends_on:
- redis
- postgres
networks:
- multi-agent-network
restart: unless-stopped
streamlit:
build:
context: .
dockerfile: Dockerfile
args:
PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
container_name: multi-agent-streamlit
command: ["streamlit", "run", "ui/streamlit_app.py", "--server.port", "8501", "--server.address", "0.0.0.0"]
ports:
- "8501:8501"
networks:
- multi-agent-network
redis:
image: redis:7-alpine
container_name: multi-agent-redis
ports:
- "6379:6379"
networks: [multi-agent-network]
postgres:
image: postgres:15-alpine
container_name: multi-agent-postgres
ports:
- "5432:5432"
environment:
- POSTGRES_DB=multi_agent_system
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
volumes:
- postgres_data:/var/lib/postgresql/data
- ./scripts/init.sql:/docker-entrypoint-initdb.d/init.sql
networks: [multi-agent-network]
redis-exporter:
image: oliver006/redis_exporter:v1.63.0
container_name: multi-agent-redis-exporter
environment:
- REDIS_ADDR=redis://redis:6379
ports:
- "9121:9121"
depends_on: [redis]
networks: [multi-agent-network]
postgres-exporter:
image: quay.io/prometheuscommunity/postgres-exporter:v0.15.0
container_name: multi-agent-postgres-exporter
environment:
- DATA_SOURCE_NAME=postgresql://postgres:postgres@postgres:5432/multi_agent_system?sslmode=disable
ports:
- "9187:9187"
depends_on: [postgres]
networks: [multi-agent-network]
prometheus:
image: prom/prometheus:latest
container_name: multi-agent-prometheus
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
networks: [multi-agent-network]
grafana:
image: grafana/grafana:latest
container_name: multi-agent-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_METRICS_ENABLED=true
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources
networks: [multi-agent-network]
nginx:
image: nginx:alpine
container_name: multi-agent-nginx
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf
depends_on:
- multi-agent-system
- streamlit
- api
networks: [multi-agent-network]
networks:
multi-agent-network:
driver: bridge
volumes:
postgres_data:
prometheus_data:
grafana_data:
附录 C:Nginx 反代(nginx/nginx.conf 节选)
http {
server {
listen 80;
server_name localhost;
# API -> api:8000
location /api/ {
proxy_pass http://api:8000/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# Streamlit -> streamlit:8501
location /streamlit/ {
proxy_pass http://streamlit:8501/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
}
附录 D:主应用指标端口(Python 节选)
from prometheus_client import start_http_server
# 将指标暴露到 9100
start_http_server(9100)
附录 E:.env 示例(LLM 三选一)
OpenAI:
OPENAI_API_KEY=sk-你的密钥
LLM_PROVIDER=openai
LLM_MODEL=gpt-3.5-turbo
DeepSeek:
DEEPSEEK_API_KEY=ds-你的密钥
LLM_PROVIDER=deepseek
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-chat
Anthropic:
ANTHROPIC_API_KEY=anth-你的密钥
LLM_PROVIDER=anthropic
LLM_MODEL=claude-3-5-sonnet-20240620
附录 F:Grafana 数据源与仪表盘自动化
-
数据源(已随 Compose 挂载): monitoring/grafana/datasources/datasource.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
-
仪表盘 Provider: monitoring/grafana/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'Multi-Agent System'
orgId: 1
folder: 'Multi-Agent'
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards
-
示例仪表盘 JSON: monitoring/grafana/dashboards/multi_agent_overview.json
{
"title": "Multi-Agent System Overview",
"schemaVersion": 38,
"refresh": "10s",
"panels": [
{"type":"stat","title":"Targets UP","gridPos":{"x":0,"y":0,"w":6,"h":4},"datasource":"Prometheus","targets":[{"expr":"sum(up)"}]},
{"type":"graph","title":"Prometheus Scrape Duration","gridPos":{"x":6,"y":0,"w":12,"h":8},"datasource":"Prometheus","targets":[{"expr":"scrape_duration_seconds{job=~\\"multi-agent-system-metrics|redis|postgres|grafana\\"}"}]},
{"type":"table","title":"Jobs Status","gridPos":{"x":0,"y":4,"w":18,"h":8},"datasource":"Prometheus","targets":[{"expr":"up"}]}
]
}
打开 Grafana(默认 admin/admin),在 “Multi-Agent” 文件夹内可见仪表盘。
附录 G:/debug/llm 端点示例(FastAPI)
已在 api/server.py 中新增:
@app.get("/debug/llm")
async def debug_llm():
"""返回当前 LLM 配置识别结果,并做一次最小连通性自测"""
# 返回 env/config/detected 与 probe 结果(是否可连通、错误信息等)
使用示例:
curl -s http://localhost:8000/debug/llm | jq .
期望看到 detected.provider/model/base_url,probe.ok=true;若失败会返回具体错误信息,便于快速纠偏(如 provider 与 model 不匹配、网络不可达等)。

