# Enterprise Multi-Agent System Deployment and Observability: A Practical Guide
>
Complete Implementation and Troubleshooting Checklist with Docker Compose, FastAPI, Prometheus, Grafana, and Nginx.
## Executive Summary
-
Changed metrics port to 9100; API service exclusively uses port 8000. -
Use Exporters for Redis and Postgres; corrected Prometheus scrape targets. -
Added new FastAPI endpoints ( /chat,/tasks,/analysis,/health,/metrics). -
Task persistence to Postgres, with asynchronous background processing and real-time querying. -
Automated LLM provider selection (OpenAI/DeepSeek/Anthropic) with failure fallback. -
Unified UTF-8 handling for Windows/PowerShell; server uses application/json; charset=utf-8. -
Parameterized base images to use AWS Public ECR, resolving Docker Hub and apt access issues.
## 1. Background and Objectives
-
Three critical factors for deploying production multi-agent AI systems: stable operation, observability, and rapid troubleshooting. -
This guide covers Compose orchestration, API service design, metrics collection and visualization, LLM integration, and common fault diagnosis.
## 2. Architecture and Service Roles
-
multi-agent-system: Main process (handles Tracing, Message Bus, Workflows, Agents). -
api: FastAPI service (serves /chat,/tasks,/analysis,/health,/metrics). -
redis / postgres: Caching and persistence; exporters provide HTTP metrics. -
prometheus / grafana: Metrics collection and visualization. -
nginx: Reverse proxy for /api/and/streamlit/. -
streamlit: Placeholder UI; jupyter: Optional development environment.
## 3. Monitoring and Scraping Fixes
-
Main application metrics: Exposed on port 9100, Prometheus target multi-agent-system:9100. -
Exporters: Redis uses redis-exporter:9121, Postgres usespostgres-exporter:9187. -
Avoid scraping port 8501 (Streamlit) when the service is not running.
## 4. API Service Design and Persistence
-
Decoupled /metricsfrom business APIs:/metricsuses port 9100;/chat,/tasks, etc., use port 8000 (API service). -
Tasks are written to Postgres with status processing; background coroutines handle them;GET /tasks/{id}returns results in real-time.
## 5. LLM Integration and Fallback Strategy
-
Automatically select provider based on key priority: OPENAI > DEEPSEEK > ANTHROPIC. -
For OpenAI, ignore base_url; for DeepSeek, setLLM_BASE_URL=https://api.deepseek.com/v1. -
On failure, log clear messages and return placeholder results to ensure stable system output.
## 6. Best Practices for Windows/PowerShell and Chinese Text
-
Server-side: Uniformly use application/json; charset=utf-8. -
Client-side (PowerShell): [Console]::OutputEncoding = [System.Text.Encoding]::UTF8combined withInvoke-RestMethod.
## 7. Common Issues Quick Reference
-
Connection refused/EOF: Correct scrape targets; use Redis/Postgres Exporters; change main metrics port to 9100. -
405 Method Not Allowed: Do not host /metricson port 8000; reserve 8000 exclusively for API endpoints. -
Model not found: Provider and model mismatch (e.g., OpenAI does not have a deepseek-chatmodel). -
Docker Hub token/apt 100 errors: Parameterize FROMto use ECR images; or pre-pull and retag images. -
PowerShell Chinese text shows “????”: Ensure server charsetand client UTF-8 encoding.
## 8. One-Click Validation Checklist
docker compose up -d --build api multi-agent-system prometheus grafana nginx redis postgres redis-exporter postgres-exporter
open http://localhost:9090/targets
open http://localhost:3000
open http://localhost:8000/health
open http://localhost:8000/docs
PowerShell Example:
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
$task = @{ query="The application prospects of artificial intelligence in the medical field"; priority="high"; agent_type="research" } | ConvertTo-Json
$c = Invoke-RestMethod -Method POST -Uri "http://localhost:8000/tasks" -ContentType "application/json; charset=utf-8" -Body $task
Start-Sleep -Seconds 3
Invoke-RestMethod -Method GET -Uri ("http://localhost:8000/tasks/{0}" -f $c.task_id)
## 9. Conclusion and Next Steps
-
Replace background processing with actual research_agentworkflow (retrieve-analyze-report). -
Add Grafana dashboards for task throughput, latency, error rates, and workflow metrics. -
Add /debug/llmand/agentsendpoints; replaceon_eventwithlifespanto eliminate warnings.
## Appendix A: Recommended prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Main application (Prometheus Python client exposes on 9100)
- job_name: 'multi-agent-system-metrics'
metrics_path: /metrics
static_configs:
- targets: ['multi-agent-system:9100']
# Redis metrics via Exporter
- job_name: 'redis'
metrics_path: /metrics
static_configs:
- targets: ['redis-exporter:9121']
# Postgres metrics via Exporter
- job_name: 'postgres'
metrics_path: /metrics
static_configs:
- targets: ['postgres-exporter:9187']
# Grafana built-in /metrics
- job_name: 'grafana'
metrics_path: /metrics
static_configs:
- targets: ['grafana:3000']
>
Note: Do not scrape
redis:6379andpostgres:5432directly; they are not HTTP metric endpoints. You must scrape their respective Exporters.
## Appendix B: Key docker-compose.yml Excerpts
services:
multi-agent-system:
build:
context: .
dockerfile: Dockerfile
args:
PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
container_name: multi-agent-system
environment:
- PYTHONPATH=/app
- REDIS_HOST=redis
- POSTGRES_HOST=postgres
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
- LLM_PROVIDER=${LLM_PROVIDER:-}
- LLM_MODEL=${LLM_MODEL:-}
- LLM_BASE_URL=${LLM_BASE_URL:-}
volumes:
- ./logs:/app/logs
- ./data:/app/data
networks:
- multi-agent-network
restart: unless-stopped
# Health check updated to scrape 9100/metrics
healthcheck:
test: ["CMD-SHELL", "curl -fsS http://localhost:9100/metrics > /dev/null || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
api:
build:
context: .
dockerfile: Dockerfile
args:
PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
container_name: multi-agent-api
command: ["python", "api/server.py"]
ports:
- "8000:8000"
environment:
- PYTHONPATH=/app
- POSTGRES_HOST=postgres
- POSTGRES_DB=multi_agent_system
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
- LLM_PROVIDER=${LLM_PROVIDER:-}
- LLM_MODEL=${LLM_MODEL:-}
- LLM_BASE_URL=${LLM_BASE_URL:-}
depends_on:
- redis
- postgres
networks:
- multi-agent-network
restart: unless-stopped
streamlit:
build:
context: .
dockerfile: Dockerfile
args:
PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
container_name: multi-agent-streamlit
command: ["streamlit", "run", "ui/streamlit_app.py", "--server.port", "8501", "--server.address", "0.0.0.0"]
ports:
- "8501:8501"
networks:
- multi-agent-network
redis:
image: redis:7-alpine
container_name: multi-agent-redis
ports:
- "6379:6379"
networks: [multi-agent-network]
postgres:
image: postgres:15-alpine
container_name: multi-agent-postgres
ports:
- "5432:5432"
environment:
- POSTGRES_DB=multi_agent_system
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
volumes:
- postgres_data:/var/lib/postgresql/data
- ./scripts/init.sql:/docker-entrypoint-initdb.d/init.sql
networks: [multi-agent-network]
redis-exporter:
image: oliver006/redis_exporter:v1.63.0
container_name: multi-agent-redis-exporter
environment:
- REDIS_ADDR=redis://redis:6379
ports:
- "9121:9121"
depends_on: [redis]
networks: [multi-agent-network]
postgres-exporter:
image: quay.io/prometheuscommunity/postgres-exporter:v0.15.0
container_name: multi-agent-postgres-exporter
environment:
- DATA_SOURCE_NAME=postgresql://postgres:postgres@postgres:5432/multi_agent_system?sslmode=disable
ports:
- "9187:9187"
depends_on: [postgres]
networks: [multi-agent-network]
prometheus:
image: prom/prometheus:latest
container_name: multi-agent-prometheus
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
networks: [multi-agent-network]
grafana:
image: grafana/grafana:latest
container_name: multi-agent-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_METRICS_ENABLED=true
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources
networks: [multi-agent-network]
nginx:
image: nginx:alpine
container_name: multi-agent-nginx
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf
depends_on:
- multi-agent-system
- streamlit
- api
networks: [multi-agent-network]
networks:
multi-agent-network:
driver: bridge
volumes:
postgres_data:
prometheus_data:
grafana_data:
## Appendix C: Nginx Reverse Proxy (nginx/nginx.conf Excerpt)
http {
server {
listen 80;
server_name localhost;
# API -> api:8000
location /api/ {
proxy_pass http://api:8000/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# Streamlit -> streamlit:8501
location /streamlit/ {
proxy_pass http://streamlit:8501/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
}
## Appendix D: Main Application Metrics Port (Python Excerpt)
from prometheus_client import start_http_server
# Expose metrics on port 9100
start_http_server(9100)
## Appendix E: .env Example (Choose One LLM)
OpenAI:
OPENAI_API_KEY=sk-your-key-here
LLM_PROVIDER=openai
LLM_MODEL=gpt-3.5-turbo
DeepSeek:
DEEPSEEK_API_KEY=ds-your-key-here
LLM_PROVIDER=deepseek
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-chat
Anthropic:
ANTHROPIC_API_KEY=anth-your-key-here
LLM_PROVIDER=anthropic
LLM_MODEL=claude-3-5-sonnet-20240620
## Appendix F: Grafana Data Source and Dashboard Automation
-
Data Source (mounted via Compose): monitoring/grafana/datasources/datasource.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
-
Dashboard Provider: monitoring/grafana/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'Multi-Agent System'
orgId: 1
folder: 'Multi-Agent'
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards
-
Example Dashboard JSON: monitoring/grafana/dashboards/multi_agent_overview.json
{
"title": "Multi-Agent System Overview",
"schemaVersion": 38,
"refresh": "10s",
"panels": [
{"type":"stat","title":"Targets UP","gridPos":{"x":0,"y":0,"w":6,"h":4},"datasource":"Prometheus","targets":[{"expr":"sum(up)"}]},
{"type":"graph","title":"Prometheus Scrape Duration","gridPos":{"x":6,"y":0,"w":12,"h":8},"datasource":"Prometheus","targets":[{"expr":"scrape_duration_seconds{job=~\\"multi-agent-system-metrics|redis|postgres|grafana\\"}"}]},
{"type":"table","title":"Jobs Status","gridPos":{"x":0,"y":4,"w":18,"h":8},"datasource":"Prometheus","targets":[{"expr":"up"}]}
]
}
>
Open Grafana (default admin/admin) to see the dashboard in the “Multi-Agent” folder.
## Appendix G: /debug/llm Endpoint Example (FastAPI)
Added to api/server.py:
@app.get("/debug/llm")
async def debug_llm():
"""Returns current LLM configuration detection results and performs a minimal connectivity self-test"""
# Returns env/config/detected and probe results (connectivity status, error messages, etc.)
Usage Example:
curl -s http://localhost:8000/debug/llm | jq .
>
Expected output includes
detected.provider/model/base_urlandprobe.ok=true. If failed, returns specific error messages for quick debugging (e.g., provider-model mismatch, network unreachable).
