Ultimate Guide: Building High-Availability Multi-Container AI Systems with Docker Compose

高效码农

2 months ago

Building a High-Availability Multi-Container AI System: Complete Guide from Docker Compose to Monitoring and Visualization

Snippet / Summary

This article provides a comprehensive guide to deploying a multi-container AI system using Docker Compose, including core services, Prometheus monitoring, Fluentd log collection, Grafana visualization, and a Streamlit frontend, with full configuration examples and troubleshooting steps.

System Overview and Design Goals
Docker Compose Architecture
Core Services Deployment
Monitoring and Visualization
- Prometheus Configuration
- Grafana Configuration
Fluentd Log Collection
Frontend and Streamlit Service
Nginx Reverse Proxy Configuration
Common Troubleshooting
FAQ

System Overview and Design Goals

This AI system is based on a multi-agent architecture, designed for enterprise-grade AI workflows. Key objectives include:

Multi-container isolation: Each service runs in its own container for resource isolation and scalability.
High availability and auto-recovery: Using Docker Compose restart: unless-stopped and health checks ensures automatic container recovery.
Centralized monitoring and visualization: Prometheus collects metrics, Grafana provides real-time dashboards for performance analysis.
Centralized logging: Fluentd aggregates logs across all services for unified storage and analysis.
Web-based user interface: Streamlit offers an interactive frontend, accessible through Nginx reverse proxy.

Docker Compose Architecture

The system uses version: '3.8' Docker Compose. Service overview:

Service Name	Image / Build	Ports	Function
multi-agent-system	Local Dockerfile	8000 (API)	Core AI processing
redis	redis:7-alpine	6379	Cache and message queue
postgres	postgres:15-alpine	5432	Database storage
prometheus	prom/prometheus:latest	9090	Metrics collection
grafana	grafana/grafana:latest	3000	Metrics visualization
nginx	nginx:alpine	80/443	Unified HTTP/HTTPS access
fluentd	fluent/fluentd:v1.16-debian-1	24224	Log aggregation
streamlit	Local Dockerfile	8501	Interactive web frontend
jupyter	Local Dockerfile.jupyter	8888	Optional development and debugging environment

All containers share the multi-agent-network bridge network with subnet 172.20.0.0/16, allowing service discovery by container name.

Core Services Deployment

Multi-Agent System

Purpose:

Launches multiple AI agents for tasks such as research, analysis, and workflow execution
Integrates internal messaging via Message Bus
Supports Workflow Engine and performance monitoring

Docker Compose configuration:

multi-agent-system:
  build:
    context: .
    dockerfile: Dockerfile
  container_name: multi-agent-system
  ports:
    - "8000:8000"
  environment:
    - PYTHONPATH=/app
    - REDIS_HOST=redis
    - POSTGRES_HOST=postgres
    - LANGSMITH_API_KEY=${LANGSMITH_API_KEY:-}
  volumes:
    - ./logs:/app/logs
    - ./data:/app/data
  depends_on:
    - redis
    - postgres
  networks:
    - multi-agent-network
  restart: unless-stopped
  healthcheck:
    test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/metrics')"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 40s

Key Notes:

Health checks should target a valid HTTP endpoint (/metrics) instead of a non-existent /health.
Expose Prometheus metrics with Python start_http_server(8000) to allow Prometheus scraping.

Redis Cache

Image: redis:7-alpine
Ports: 6379:6379
Volume: redis_data
Health check: redis-cli ping

Tip: Prometheus cannot scrape raw Redis TCP port. Use redis-exporter on port 9121 for HTTP metrics.

PostgreSQL Database

Image: postgres:15-alpine
Initialization: ./scripts/init_db.sql
Environment variables: POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD
Health check: pg_isready -U postgres

Tip: Use postgres-exporter (port 9187) for Prometheus to scrape metrics, raw 5432 TCP port is not HTTP.

Monitoring and Visualization

Prometheus Configuration

monitoring/prometheus.yml example:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "multi-agent-system"
    static_configs:
      - targets: ["multi-agent-system:8000"]

  - job_name: "redis"
    static_configs:
      - targets: ["multi-agent-redis:9121"]

  - job_name: "postgres"
    static_configs:
      - targets: ["multi-agent-postgres:9187"]

  - job_name: "grafana"
    metrics_path: /metrics
    static_configs:
      - targets: ["multi-agent-grafana:3000"]

Experience Notes:

Disable scraping of non-existent ports (e.g., Streamlit 8501 if service is down).
Always scrape HTTP metrics endpoints (Exporters) for Redis and Postgres.
Enable Grafana metrics: GF_METRICS_ENABLED=true.

Grafana Configuration

Admin password: admin
Dashboards directory: ./monitoring/grafana/dashboards
Datasources: ./monitoring/grafana/datasources
Network: multi-agent-network
Dependency: Prometheus must be UP

Ensure Prometheus jobs are UP before Grafana visualization.

Fluentd Log Collection

fluentd.conf example:

<source>
  @type forward
  port 24224
  bind 0.0.0.0
</source>

<match **>
  @type file
  path /var/log/multi-agent
  append true
</match>

Collects logs from all containers
Mount /logs to host for persistent storage
Ports: 24224 TCP/UDP

Frontend and Streamlit Service

Add a standalone Streamlit service to avoid empty responses:

streamlit:
  build:
    context: .
    dockerfile: Dockerfile.streamlit
  container_name: multi-agent-streamlit
  ports:
    - "8501:8501"
  volumes:
    - ./ui:/app
  networks:
    - multi-agent-network
  restart: unless-stopped

Key Notes:

Listen on 0.0.0.0:8501 to accept external traffic.
Nginx reverse proxy /streamlit/ → streamlit:8501 for unified access.
Avoid occupying 8501 in the main app container.

Check port usage on Windows:

netstat -ano | findstr :8501

Nginx Reverse Proxy Configuration

nginx/nginx.conf example:

user  nginx;
worker_processes  auto;
error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    keepalive_timeout  65;

    upstream multi_agent_system {
        server multi-agent-system:8000;
    }

    upstream streamlit_app {
        server streamlit:8501;
    }

    server {
        listen 80;
        server_name localhost;

        location / {
            proxy_pass http://multi_agent_system;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }

        location /streamlit/ {
            proxy_pass http://streamlit_app/;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }
}

Notes:

Fixed unknown log format issue by defining log_format main.
Streamlit proxy ensures /streamlit/ path serves frontend correctly.

Common Troubleshooting

Empty response (ERR_EMPTY_RESPONSE)
- Verify service is listening on port
- Check Nginx upstream configuration
- Test network:
```
docker exec -it multi-agent-nginx sh -c "apk add curl; curl -sI http://streamlit:8501"
```
Prometheus scrape failure
- Reason: Target port is non-HTTP
- Solution: Use Redis/Postgres Exporters
Streamlit not running
- Check service status:
```
docker compose ps
docker logs --tail=50 multi-agent-streamlit
```
- Ensure port 8501 is mapped and listening on 0.0.0.0

FAQ

Q1: Why does localhost:8501 return blank?
A1: The main application was not listening on 8501. A separate Streamlit service must run and Nginx must proxy /streamlit/.

Q2: Why can’t Prometheus scrape Redis/Postgres metrics?
A2: Raw TCP ports are not HTTP endpoints. Use redis-exporter (9121) and postgres-exporter (9187).

Q3: Nginx logs show unknown log format "main"?
A3: Fixed by defining log_format main in nginx.conf.

Q4: How to verify Multi-Agent System health?
A4: Check Prometheus metrics endpoint:

curl http://localhost:8000/metrics

Returns text format if healthy.

Summary and Key Insights

Service isolation: Each container has a single responsibility to avoid conflicts.
Port health check: Prometheus requires HTTP endpoints for scraping.
Centralized logging: Fluentd aggregates logs for easy long-term analysis.
Frontend proxy: Streamlit + Nginx provides a unified, stable access path.
Container networking: Bridge network ensures service name resolution.
Rebuild cache when necessary:

docker compose build --no-cache
docker compose up -d

With this deployment:

Multi-agent AI services run reliably
Prometheus monitors core metrics
Grafana dashboards visualize performance
Streamlit frontend is accessible
Fluentd centralizes logs for analysis

This configuration has been tested in practice to ensure container communication, port mapping, metrics scraping, and web access are fully functional.

Word Count: ~3,480