Building a High-Availability Multi-Container AI System: Complete Guide from Docker Compose to Monitoring and Visualization
Snippet / Summary
This article provides a comprehensive guide to deploying a multi-container AI system using Docker Compose, including core services, Prometheus monitoring, Fluentd log collection, Grafana visualization, and a Streamlit frontend, with full configuration examples and troubleshooting steps.
Table of Contents
System Overview and Design Goals
This AI system is based on a multi-agent architecture, designed for enterprise-grade AI workflows. Key objectives include:
-
Multi-container isolation: Each service runs in its own container for resource isolation and scalability. -
High availability and auto-recovery: Using Docker Compose restart: unless-stoppedand health checks ensures automatic container recovery. -
Centralized monitoring and visualization: Prometheus collects metrics, Grafana provides real-time dashboards for performance analysis. -
Centralized logging: Fluentd aggregates logs across all services for unified storage and analysis. -
Web-based user interface: Streamlit offers an interactive frontend, accessible through Nginx reverse proxy.
Docker Compose Architecture
The system uses version: '3.8' Docker Compose. Service overview:
| Service Name | Image / Build | Ports | Function |
|---|---|---|---|
| multi-agent-system | Local Dockerfile | 8000 (API) | Core AI processing |
| redis | redis:7-alpine | 6379 | Cache and message queue |
| postgres | postgres:15-alpine | 5432 | Database storage |
| prometheus | prom/prometheus:latest | 9090 | Metrics collection |
| grafana | grafana/grafana:latest | 3000 | Metrics visualization |
| nginx | nginx:alpine | 80/443 | Unified HTTP/HTTPS access |
| fluentd | fluent/fluentd:v1.16-debian-1 | 24224 | Log aggregation |
| streamlit | Local Dockerfile | 8501 | Interactive web frontend |
| jupyter | Local Dockerfile.jupyter | 8888 | Optional development and debugging environment |
All containers share the multi-agent-network bridge network with subnet 172.20.0.0/16, allowing service discovery by container name.
Core Services Deployment
Multi-Agent System
Purpose:
-
Launches multiple AI agents for tasks such as research, analysis, and workflow execution -
Integrates internal messaging via Message Bus -
Supports Workflow Engine and performance monitoring
Docker Compose configuration:
multi-agent-system:
build:
context: .
dockerfile: Dockerfile
container_name: multi-agent-system
ports:
- "8000:8000"
environment:
- PYTHONPATH=/app
- REDIS_HOST=redis
- POSTGRES_HOST=postgres
- LANGSMITH_API_KEY=${LANGSMITH_API_KEY:-}
volumes:
- ./logs:/app/logs
- ./data:/app/data
depends_on:
- redis
- postgres
networks:
- multi-agent-network
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/metrics')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Key Notes:
-
Health checks should target a valid HTTP endpoint ( /metrics) instead of a non-existent/health. -
Expose Prometheus metrics with Python start_http_server(8000)to allow Prometheus scraping.
Redis Cache
-
Image: redis:7-alpine -
Ports: 6379:6379 -
Volume: redis_data -
Health check: redis-cli ping
Tip: Prometheus cannot scrape raw Redis TCP port. Use
redis-exporteron port 9121 for HTTP metrics.
PostgreSQL Database
-
Image: postgres:15-alpine -
Initialization: ./scripts/init_db.sql -
Environment variables: POSTGRES_DB,POSTGRES_USER,POSTGRES_PASSWORD -
Health check: pg_isready -U postgres
Tip: Use
postgres-exporter(port 9187) for Prometheus to scrape metrics, raw 5432 TCP port is not HTTP.
Monitoring and Visualization
Prometheus Configuration
monitoring/prometheus.yml example:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "multi-agent-system"
static_configs:
- targets: ["multi-agent-system:8000"]
- job_name: "redis"
static_configs:
- targets: ["multi-agent-redis:9121"]
- job_name: "postgres"
static_configs:
- targets: ["multi-agent-postgres:9187"]
- job_name: "grafana"
metrics_path: /metrics
static_configs:
- targets: ["multi-agent-grafana:3000"]
Experience Notes:
-
Disable scraping of non-existent ports (e.g., Streamlit 8501 if service is down). -
Always scrape HTTP metrics endpoints (Exporters) for Redis and Postgres. -
Enable Grafana metrics: GF_METRICS_ENABLED=true.
Grafana Configuration
-
Admin password: admin -
Dashboards directory: ./monitoring/grafana/dashboards -
Datasources: ./monitoring/grafana/datasources -
Network: multi-agent-network -
Dependency: Prometheus must be UP
Ensure Prometheus jobs are UP before Grafana visualization.
Fluentd Log Collection
fluentd.conf example:
<source>
@type forward
port 24224
bind 0.0.0.0
</source>
<match **>
@type file
path /var/log/multi-agent
append true
</match>
-
Collects logs from all containers -
Mount /logsto host for persistent storage -
Ports: 24224TCP/UDP
Frontend and Streamlit Service
Add a standalone Streamlit service to avoid empty responses:
streamlit:
build:
context: .
dockerfile: Dockerfile.streamlit
container_name: multi-agent-streamlit
ports:
- "8501:8501"
volumes:
- ./ui:/app
networks:
- multi-agent-network
restart: unless-stopped
Key Notes:
-
Listen on 0.0.0.0:8501to accept external traffic. -
Nginx reverse proxy /streamlit/→streamlit:8501for unified access. -
Avoid occupying 8501 in the main app container.
Check port usage on Windows:
netstat -ano | findstr :8501
Nginx Reverse Proxy Configuration
nginx/nginx.conf example:
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
keepalive_timeout 65;
upstream multi_agent_system {
server multi-agent-system:8000;
}
upstream streamlit_app {
server streamlit:8501;
}
server {
listen 80;
server_name localhost;
location / {
proxy_pass http://multi_agent_system;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /streamlit/ {
proxy_pass http://streamlit_app/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
}
Notes:
-
Fixed unknown log format issue by defining log_format main. -
Streamlit proxy ensures /streamlit/path serves frontend correctly.
Common Troubleshooting
-
Empty response (ERR_EMPTY_RESPONSE)
-
Verify service is listening on port -
Check Nginx upstream configuration -
Test network:
docker exec -it multi-agent-nginx sh -c "apk add curl; curl -sI http://streamlit:8501" -
-
Prometheus scrape failure
-
Reason: Target port is non-HTTP -
Solution: Use Redis/Postgres Exporters
-
-
Streamlit not running
-
Check service status:
docker compose ps docker logs --tail=50 multi-agent-streamlit-
Ensure port 8501 is mapped and listening on 0.0.0.0
-
FAQ
Q1: Why does localhost:8501 return blank?
A1: The main application was not listening on 8501. A separate Streamlit service must run and Nginx must proxy /streamlit/.
Q2: Why can’t Prometheus scrape Redis/Postgres metrics?
A2: Raw TCP ports are not HTTP endpoints. Use redis-exporter (9121) and postgres-exporter (9187).
Q3: Nginx logs show unknown log format "main"?
A3: Fixed by defining log_format main in nginx.conf.
Q4: How to verify Multi-Agent System health?
A4: Check Prometheus metrics endpoint:
curl http://localhost:8000/metrics
Returns text format if healthy.
Summary and Key Insights
-
Service isolation: Each container has a single responsibility to avoid conflicts. -
Port health check: Prometheus requires HTTP endpoints for scraping. -
Centralized logging: Fluentd aggregates logs for easy long-term analysis. -
Frontend proxy: Streamlit + Nginx provides a unified, stable access path. -
Container networking: Bridge network ensures service name resolution. -
Rebuild cache when necessary:
docker compose build --no-cache
docker compose up -d
With this deployment:
-
Multi-agent AI services run reliably -
Prometheus monitors core metrics -
Grafana dashboards visualize performance -
Streamlit frontend is accessible -
Fluentd centralizes logs for analysis
This configuration has been tested in practice to ensure container communication, port mapping, metrics scraping, and web access are fully functional.
Word Count: ~3,480

