Site icon Efficient Coder

Ultimate Guide: Building High-Availability Multi-Container AI Systems with Docker Compose

Building a High-Availability Multi-Container AI System: Complete Guide from Docker Compose to Monitoring and Visualization

Snippet / Summary

This article provides a comprehensive guide to deploying a multi-container AI system using Docker Compose, including core services, Prometheus monitoring, Fluentd log collection, Grafana visualization, and a Streamlit frontend, with full configuration examples and troubleshooting steps.


Table of Contents

  1. System Overview and Design Goals

  2. Docker Compose Architecture

  3. Core Services Deployment

  4. Monitoring and Visualization

  5. Fluentd Log Collection

  6. Frontend and Streamlit Service

  7. Nginx Reverse Proxy Configuration

  8. Common Troubleshooting

  9. FAQ


System Overview and Design Goals

This AI system is based on a multi-agent architecture, designed for enterprise-grade AI workflows. Key objectives include:

  • Multi-container isolation: Each service runs in its own container for resource isolation and scalability.
  • High availability and auto-recovery: Using Docker Compose restart: unless-stopped and health checks ensures automatic container recovery.
  • Centralized monitoring and visualization: Prometheus collects metrics, Grafana provides real-time dashboards for performance analysis.
  • Centralized logging: Fluentd aggregates logs across all services for unified storage and analysis.
  • Web-based user interface: Streamlit offers an interactive frontend, accessible through Nginx reverse proxy.

Docker Compose Architecture

The system uses version: '3.8' Docker Compose. Service overview:

Service Name Image / Build Ports Function
multi-agent-system Local Dockerfile 8000 (API) Core AI processing
redis redis:7-alpine 6379 Cache and message queue
postgres postgres:15-alpine 5432 Database storage
prometheus prom/prometheus:latest 9090 Metrics collection
grafana grafana/grafana:latest 3000 Metrics visualization
nginx nginx:alpine 80/443 Unified HTTP/HTTPS access
fluentd fluent/fluentd:v1.16-debian-1 24224 Log aggregation
streamlit Local Dockerfile 8501 Interactive web frontend
jupyter Local Dockerfile.jupyter 8888 Optional development and debugging environment

All containers share the multi-agent-network bridge network with subnet 172.20.0.0/16, allowing service discovery by container name.


Core Services Deployment

Multi-Agent System

Purpose:

  • Launches multiple AI agents for tasks such as research, analysis, and workflow execution
  • Integrates internal messaging via Message Bus
  • Supports Workflow Engine and performance monitoring

Docker Compose configuration:

multi-agent-system:
  build:
    context: .
    dockerfile: Dockerfile
  container_name: multi-agent-system
  ports:
    - "8000:8000"
  environment:
    - PYTHONPATH=/app
    - REDIS_HOST=redis
    - POSTGRES_HOST=postgres
    - LANGSMITH_API_KEY=${LANGSMITH_API_KEY:-}
  volumes:
    - ./logs:/app/logs
    - ./data:/app/data
  depends_on:
    - redis
    - postgres
  networks:
    - multi-agent-network
  restart: unless-stopped
  healthcheck:
    test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/metrics')"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 40s

Key Notes:

  • Health checks should target a valid HTTP endpoint (/metrics) instead of a non-existent /health.
  • Expose Prometheus metrics with Python start_http_server(8000) to allow Prometheus scraping.

Redis Cache

  • Image: redis:7-alpine
  • Ports: 6379:6379
  • Volume: redis_data
  • Health check: redis-cli ping

Tip: Prometheus cannot scrape raw Redis TCP port. Use redis-exporter on port 9121 for HTTP metrics.


PostgreSQL Database

  • Image: postgres:15-alpine
  • Initialization: ./scripts/init_db.sql
  • Environment variables: POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD
  • Health check: pg_isready -U postgres

Tip: Use postgres-exporter (port 9187) for Prometheus to scrape metrics, raw 5432 TCP port is not HTTP.


Monitoring and Visualization

Prometheus Configuration

monitoring/prometheus.yml example:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "multi-agent-system"
    static_configs:
      - targets: ["multi-agent-system:8000"]

  - job_name: "redis"
    static_configs:
      - targets: ["multi-agent-redis:9121"]

  - job_name: "postgres"
    static_configs:
      - targets: ["multi-agent-postgres:9187"]

  - job_name: "grafana"
    metrics_path: /metrics
    static_configs:
      - targets: ["multi-agent-grafana:3000"]

Experience Notes:

  • Disable scraping of non-existent ports (e.g., Streamlit 8501 if service is down).
  • Always scrape HTTP metrics endpoints (Exporters) for Redis and Postgres.
  • Enable Grafana metrics: GF_METRICS_ENABLED=true.

Grafana Configuration

  • Admin password: admin
  • Dashboards directory: ./monitoring/grafana/dashboards
  • Datasources: ./monitoring/grafana/datasources
  • Network: multi-agent-network
  • Dependency: Prometheus must be UP

Ensure Prometheus jobs are UP before Grafana visualization.


Fluentd Log Collection

fluentd.conf example:

<source>
  @type forward
  port 24224
  bind 0.0.0.0
</source>

<match **>
  @type file
  path /var/log/multi-agent
  append true
</match>
  • Collects logs from all containers
  • Mount /logs to host for persistent storage
  • Ports: 24224 TCP/UDP

Frontend and Streamlit Service

Add a standalone Streamlit service to avoid empty responses:

streamlit:
  build:
    context: .
    dockerfile: Dockerfile.streamlit
  container_name: multi-agent-streamlit
  ports:
    - "8501:8501"
  volumes:
    - ./ui:/app
  networks:
    - multi-agent-network
  restart: unless-stopped

Key Notes:

  1. Listen on 0.0.0.0:8501 to accept external traffic.
  2. Nginx reverse proxy /streamlit/streamlit:8501 for unified access.
  3. Avoid occupying 8501 in the main app container.

Check port usage on Windows:

netstat -ano | findstr :8501

Nginx Reverse Proxy Configuration

nginx/nginx.conf example:

user  nginx;
worker_processes  auto;
error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    keepalive_timeout  65;

    upstream multi_agent_system {
        server multi-agent-system:8000;
    }

    upstream streamlit_app {
        server streamlit:8501;
    }

    server {
        listen 80;
        server_name localhost;

        location / {
            proxy_pass http://multi_agent_system;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }

        location /streamlit/ {
            proxy_pass http://streamlit_app/;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }
}

Notes:

  • Fixed unknown log format issue by defining log_format main.
  • Streamlit proxy ensures /streamlit/ path serves frontend correctly.

Common Troubleshooting

  1. Empty response (ERR_EMPTY_RESPONSE)

    • Verify service is listening on port
    • Check Nginx upstream configuration
    • Test network:
    docker exec -it multi-agent-nginx sh -c "apk add curl; curl -sI http://streamlit:8501"
    
  2. Prometheus scrape failure

    • Reason: Target port is non-HTTP
    • Solution: Use Redis/Postgres Exporters
  3. Streamlit not running

    • Check service status:
    docker compose ps
    docker logs --tail=50 multi-agent-streamlit
    
    • Ensure port 8501 is mapped and listening on 0.0.0.0

FAQ

Q1: Why does localhost:8501 return blank?
A1: The main application was not listening on 8501. A separate Streamlit service must run and Nginx must proxy /streamlit/.

Q2: Why can’t Prometheus scrape Redis/Postgres metrics?
A2: Raw TCP ports are not HTTP endpoints. Use redis-exporter (9121) and postgres-exporter (9187).

Q3: Nginx logs show unknown log format "main"?
A3: Fixed by defining log_format main in nginx.conf.

Q4: How to verify Multi-Agent System health?
A4: Check Prometheus metrics endpoint:

curl http://localhost:8000/metrics

Returns text format if healthy.


Summary and Key Insights

  1. Service isolation: Each container has a single responsibility to avoid conflicts.
  2. Port health check: Prometheus requires HTTP endpoints for scraping.
  3. Centralized logging: Fluentd aggregates logs for easy long-term analysis.
  4. Frontend proxy: Streamlit + Nginx provides a unified, stable access path.
  5. Container networking: Bridge network ensures service name resolution.
  6. Rebuild cache when necessary:
docker compose build --no-cache
docker compose up -d

With this deployment:

  • Multi-agent AI services run reliably
  • Prometheus monitors core metrics
  • Grafana dashboards visualize performance
  • Streamlit frontend is accessible
  • Fluentd centralizes logs for analysis

This configuration has been tested in practice to ensure container communication, port mapping, metrics scraping, and web access are fully functional.


Word Count: ~3,480

Exit mobile version