Enterprise Multi-Agent AI Deployment: A Complete Observability & Troubleshooting Guide

高效码农

17 hours ago

# Enterprise Multi-Agent System Deployment and Observability: A Practical Guide

>

Complete Implementation and Troubleshooting Checklist with Docker Compose, FastAPI, Prometheus, Grafana, and Nginx.

## Executive Summary

Changed metrics port to 9100; API service exclusively uses port 8000.
Use Exporters for Redis and Postgres; corrected Prometheus scrape targets.
Added new FastAPI endpoints (/chat, /tasks, /analysis, /health, /metrics).
Task persistence to Postgres, with asynchronous background processing and real-time querying.
Automated LLM provider selection (OpenAI/DeepSeek/Anthropic) with failure fallback.
Unified UTF-8 handling for Windows/PowerShell; server uses application/json; charset=utf-8.
Parameterized base images to use AWS Public ECR, resolving Docker Hub and apt access issues.

## 1. Background and Objectives

Three critical factors for deploying production multi-agent AI systems: stable operation, observability, and rapid troubleshooting.
This guide covers Compose orchestration, API service design, metrics collection and visualization, LLM integration, and common fault diagnosis.

## 2. Architecture and Service Roles

multi-agent-system: Main process (handles Tracing, Message Bus, Workflows, Agents).
api: FastAPI service (serves /chat, /tasks, /analysis, /health, /metrics).
redis / postgres: Caching and persistence; exporters provide HTTP metrics.
prometheus / grafana: Metrics collection and visualization.
nginx: Reverse proxy for /api/ and /streamlit/.
streamlit: Placeholder UI; jupyter: Optional development environment.

## 3. Monitoring and Scraping Fixes

Main application metrics: Exposed on port 9100, Prometheus target multi-agent-system:9100.
Exporters: Redis uses redis-exporter:9121, Postgres uses postgres-exporter:9187.
Avoid scraping port 8501 (Streamlit) when the service is not running.

## 4. API Service Design and Persistence

Decoupled /metrics from business APIs: /metrics uses port 9100; /chat, /tasks, etc., use port 8000 (API service).
Tasks are written to Postgres with status processing; background coroutines handle them; GET /tasks/{id} returns results in real-time.

## 5. LLM Integration and Fallback Strategy

Automatically select provider based on key priority: OPENAI > DEEPSEEK > ANTHROPIC.
For OpenAI, ignore base_url; for DeepSeek, set LLM_BASE_URL=https://api.deepseek.com/v1.
On failure, log clear messages and return placeholder results to ensure stable system output.

## 6. Best Practices for Windows/PowerShell and Chinese Text

Server-side: Uniformly use application/json; charset=utf-8.
Client-side (PowerShell): [Console]::OutputEncoding = [System.Text.Encoding]::UTF8 combined with Invoke-RestMethod.

## 7. Common Issues Quick Reference

Connection refused/EOF: Correct scrape targets; use Redis/Postgres Exporters; change main metrics port to 9100.
405 Method Not Allowed: Do not host /metrics on port 8000; reserve 8000 exclusively for API endpoints.
Model not found: Provider and model mismatch (e.g., OpenAI does not have a deepseek-chat model).
Docker Hub token/apt 100 errors: Parameterize FROM to use ECR images; or pre-pull and retag images.
PowerShell Chinese text shows “????”: Ensure server charset and client UTF-8 encoding.

## 8. One-Click Validation Checklist

docker compose up -d --build api multi-agent-system prometheus grafana nginx redis postgres redis-exporter postgres-exporter
open http://localhost:9090/targets
open http://localhost:3000
open http://localhost:8000/health
open http://localhost:8000/docs

PowerShell Example:

[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
$task = @{ query="The application prospects of artificial intelligence in the medical field"; priority="high"; agent_type="research" } | ConvertTo-Json
$c = Invoke-RestMethod -Method POST -Uri "http://localhost:8000/tasks" -ContentType "application/json; charset=utf-8" -Body $task
Start-Sleep -Seconds 3
Invoke-RestMethod -Method GET -Uri ("http://localhost:8000/tasks/{0}" -f $c.task_id)

## 9. Conclusion and Next Steps

Replace background processing with actual research_agent workflow (retrieve-analyze-report).
Add Grafana dashboards for task throughput, latency, error rates, and workflow metrics.
Add /debug/llm and /agents endpoints; replace on_event with lifespan to eliminate warnings.

## Appendix A: Recommended prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Main application (Prometheus Python client exposes on 9100)
  - job_name: 'multi-agent-system-metrics'
    metrics_path: /metrics
    static_configs:
      - targets: ['multi-agent-system:9100']

  # Redis metrics via Exporter
  - job_name: 'redis'
    metrics_path: /metrics
    static_configs:
      - targets: ['redis-exporter:9121']

  # Postgres metrics via Exporter
  - job_name: 'postgres'
    metrics_path: /metrics
    static_configs:
      - targets: ['postgres-exporter:9187']

  # Grafana built-in /metrics
  - job_name: 'grafana'
    metrics_path: /metrics
    static_configs:
      - targets: ['grafana:3000']

>

Note: Do not scrape redis:6379 and postgres:5432 directly; they are not HTTP metric endpoints. You must scrape their respective Exporters.

## Appendix B: Key docker-compose.yml Excerpts

services:
  multi-agent-system:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
    container_name: multi-agent-system
    environment:
      - PYTHONPATH=/app
      - REDIS_HOST=redis
      - POSTGRES_HOST=postgres
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
      - LLM_PROVIDER=${LLM_PROVIDER:-}
      - LLM_MODEL=${LLM_MODEL:-}
      - LLM_BASE_URL=${LLM_BASE_URL:-}
    volumes:
      - ./logs:/app/logs
      - ./data:/app/data
    networks:
      - multi-agent-network
    restart: unless-stopped
    # Health check updated to scrape 9100/metrics
    healthcheck:
      test: ["CMD-SHELL", "curl -fsS http://localhost:9100/metrics > /dev/null || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  api:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
    container_name: multi-agent-api
    command: ["python", "api/server.py"]
    ports:
      - "8000:8000"
    environment:
      - PYTHONPATH=/app
      - POSTGRES_HOST=postgres
      - POSTGRES_DB=multi_agent_system
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
      - LLM_PROVIDER=${LLM_PROVIDER:-}
      - LLM_MODEL=${LLM_MODEL:-}
      - LLM_BASE_URL=${LLM_BASE_URL:-}
    depends_on:
      - redis
      - postgres
    networks:
      - multi-agent-network
    restart: unless-stopped

  streamlit:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PY_BASE_IMAGE: public.ecr.aws/docker/library/python:3.11-slim
    container_name: multi-agent-streamlit
    command: ["streamlit", "run", "ui/streamlit_app.py", "--server.port", "8501", "--server.address", "0.0.0.0"]
    ports:
      - "8501:8501"
    networks:
      - multi-agent-network

  redis:
    image: redis:7-alpine
    container_name: multi-agent-redis
    ports:
      - "6379:6379"
    networks: [multi-agent-network]

  postgres:
    image: postgres:15-alpine
    container_name: multi-agent-postgres
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_DB=multi_agent_system
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./scripts/init.sql:/docker-entrypoint-initdb.d/init.sql
    networks: [multi-agent-network]

  redis-exporter:
    image: oliver006/redis_exporter:v1.63.0
    container_name: multi-agent-redis-exporter
    environment:
      - REDIS_ADDR=redis://redis:6379
    ports:
      - "9121:9121"
    depends_on: [redis]
    networks: [multi-agent-network]

  postgres-exporter:
    image: quay.io/prometheuscommunity/postgres-exporter:v0.15.0
    container_name: multi-agent-postgres-exporter
    environment:
      - DATA_SOURCE_NAME=postgresql://postgres:postgres@postgres:5432/multi_agent_system?sslmode=disable
    ports:
      - "9187:9187"
    depends_on: [postgres]
    networks: [multi-agent-network]

  prometheus:
    image: prom/prometheus:latest
    container_name: multi-agent-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    networks: [multi-agent-network]

  grafana:
    image: grafana/grafana:latest
    container_name: multi-agent-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_METRICS_ENABLED=true
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources
    networks: [multi-agent-network]

  nginx:
    image: nginx:alpine
    container_name: multi-agent-nginx
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - multi-agent-system
      - streamlit
      - api
    networks: [multi-agent-network]

networks:
  multi-agent-network:
    driver: bridge

volumes:
  postgres_data:
  prometheus_data:
  grafana_data:

## Appendix C: Nginx Reverse Proxy (nginx/nginx.conf Excerpt)

http {
  server {
    listen 80;
    server_name localhost;

    # API -> api:8000
    location /api/ {
      proxy_pass http://api:8000/;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Streamlit -> streamlit:8501
    location /streamlit/ {
      proxy_pass http://streamlit:8501/;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;
    }
  }
}

## Appendix D: Main Application Metrics Port (Python Excerpt)

from prometheus_client import start_http_server

# Expose metrics on port 9100
start_http_server(9100)

## Appendix E: .env Example (Choose One LLM)

OpenAI:

OPENAI_API_KEY=sk-your-key-here
LLM_PROVIDER=openai
LLM_MODEL=gpt-3.5-turbo

DeepSeek:

DEEPSEEK_API_KEY=ds-your-key-here
LLM_PROVIDER=deepseek
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-chat

Anthropic:

ANTHROPIC_API_KEY=anth-your-key-here
LLM_PROVIDER=anthropic
LLM_MODEL=claude-3-5-sonnet-20240620

## Appendix F: Grafana Data Source and Dashboard Automation

Data Source (mounted via Compose): monitoring/grafana/datasources/datasource.yml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Dashboard Provider: monitoring/grafana/dashboards/dashboards.yml

apiVersion: 1
providers:
  - name: 'Multi-Agent System'
    orgId: 1
    folder: 'Multi-Agent'
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards

Example Dashboard JSON: monitoring/grafana/dashboards/multi_agent_overview.json

{
  "title": "Multi-Agent System Overview",
  "schemaVersion": 38,
  "refresh": "10s",
  "panels": [
    {"type":"stat","title":"Targets UP","gridPos":{"x":0,"y":0,"w":6,"h":4},"datasource":"Prometheus","targets":[{"expr":"sum(up)"}]},
    {"type":"graph","title":"Prometheus Scrape Duration","gridPos":{"x":6,"y":0,"w":12,"h":8},"datasource":"Prometheus","targets":[{"expr":"scrape_duration_seconds{job=~\\"multi-agent-system-metrics|redis|postgres|grafana\\"}"}]},
    {"type":"table","title":"Jobs Status","gridPos":{"x":0,"y":4,"w":18,"h":8},"datasource":"Prometheus","targets":[{"expr":"up"}]}
  ]
}

>

Open Grafana (default admin/admin) to see the dashboard in the “Multi-Agent” folder.

## Appendix G: /debug/llm Endpoint Example (FastAPI)

Added to api/server.py:

@app.get("/debug/llm")
async def debug_llm():
    """Returns current LLM configuration detection results and performs a minimal connectivity self-test"""
    # Returns env/config/detected and probe results (connectivity status, error messages, etc.)

Usage Example:

curl -s http://localhost:8000/debug/llm | jq .

>

Expected output includes detected.provider/model/base_url and probe.ok=true. If failed, returns specific error messages for quick debugging (e.g., provider-model mismatch, network unreachable).