AI Agent Orchestration: How the Big Three Realtime Agents Unlocks Voice-Controlled Coding

高效码农

2 months ago

Exploring the “Big Three Realtime Agents”: A Voice-Controlled AI Agent Orchestration System

Have you ever imagined directing multiple AI assistants to work together with just your voice? One writes code, another operates a browser to verify results, and all you have to do is speak? This might sound like science fiction, but the “Big Three Realtime Agents” project is turning this vision into reality. It’s a unified, voice-coordinated system that integrates three cutting-edge AIs—OpenAI, Anthropic Claude, and Google Gemini—to seamlessly dispatch different types of AI agents for complex digital tasks through natural conversation. This article will provide an in-depth analysis of this system’s architecture, how it works, and its practical applications. Whether you’re a developer, a tech enthusiast, or simply curious about AI automation, you’ll gain valuable, practical insights.

Abstract

The “Big Three Realtime Agents” is a unified multi-AI agent orchestration platform. At its core is a voice agent based on the OpenAI Realtime API, which understands user commands and dynamically coordinates two types of functional agents: the Claude Code agent (responsible for software development and file operations) and the Gemini Browser agent (responsible for web automation and validation). The system requires a Python 3.11+ environment, corresponding API keys, and the Playwright browser automation tool. Through modular tool calling and session management mechanisms, it implements complex orchestration logic spanning over 3,000 lines of code. All agent activities are executed within a user-configurable working directory (default: apps/content-gen/).

The System Core: How Do the Three Agents Collaborate?

The essence of this project lies in “orchestration.” It doesn’t simply bundle three AIs together but designs a clear chain of command.

1. The Conductor: OpenAI Realtime Voice Agent
This is your primary interface with the system. Utilizing OpenAI’s Realtime API, it continuously processes your voice or text input. Its core responsibilities are “understanding” and “decision-making”: comprehending your intent and then deciding which agent to create, what command to issue, or which status to query. You can think of it as the project’s “brain.”

2. Executor One: Claude Code Agentic Coder
When a task involves writing code, creating files, or analyzing project structure, the brain calls upon this specialist. Based on Anthropic’s Claude model, it is designed as an agentic programmer. Its scope of activity is confined to a specified Working Directory, typically apps/content-gen/. Here, it can freely read, create, and modify backend and frontend code files. All operations are logged and tracked.

3. Executor Two: Gemini Browser Agent
When a task requires interaction with the real-world web—for example, “Go to my website and see if the login button is still there,” or “Search for the latest React documentation”—the brain dispatches this agent. It leverages Gemini’s “Computer Use” capability to navigate, click, fill forms, and validate results through a real browser controlled by Playwright. It sees what you see and can plan actions based on visual information.

Deep Dive into System Architecture: From User Command to Task Completion

A picture is worth a thousand words, and the project’s architecture diagram clearly reveals the data and control flow. The entire process can be summarized in several key steps:

User Input: Everything begins with your voice or text command, e.g., “Create a new Claude agent and have it list all files in the working directory.”

Tool Calling & Task Dispatch: The voice agent has a built-in set of “Tools,” which are its means of commanding other agents. Key tools include:

🍂

create_agent(tool, type, agent_name): Creates a new agent of the specified type.
🍂

command_agent(agent_name, prompt): Sends specific instructions to an existing agent.
🍂

list_agents(): Views all active agents and their statuses.
🍂

browser_use(task, url): Issues a task directly to the browser.

Agent Execution & State Management: Each created agent receives a unique session ID. Its state (such as history, current task) is saved in dedicated registry files within the working directory (e.g., agents/claude_code/ or agents/gemini/). This allows sessions to be paused and resumed, enabling persistent work across interactions.

Result Return & Feedback: The executing agent saves operation results (like code files, browser screenshots, execution logs) to designated locations (e.g., output_logs/screenshots/). The coordinator then feeds a result summary back to you, completing the loop.

Highlight Feature: Multi-Agent Observability

Managing a single agent isn’t simple; coordinating the activities of multiple agents requires clear visibility. This project integrates Multi-Agent Observability, an out-of-the-box monitoring solution.

🍂

How It Works: The system automatically captures every tool call, file operation, and session lifecycle event through Claude Code Hooks. These events are sent in real-time to a separate observability server dashboard.
🍂

What You See: Opening http://localhost:3000 reveals a real-time event stream, including AI-generated task summaries, API call costs, and complete chat transcripts. This provides unprecedented transparency for debugging and understanding system behavior.
🍂

Zero-Configuration Startup: This feature is pre-configured. Simply clone and run the observability server, and your agent activity data will automatically flow into the dashboard—no additional setup required.

Core Components & Project Structure

To truly master this system, understanding its physical composition is key. The project’s backbone is a Python script of over 3,000 lines (big_three_realtime_agents.py), but its internals are highly modular:

🍂

Lines 184-616: Define the GeminiBrowserAgent class, encapsulating all browser automation logic.
🍂

Lines 617-1540: Define the ClaudeCodeAgenticCoder class, containing all the code agent’s capabilities and session management.
🍂

Lines 1541-2900: Define the OpenAIRealtimeVoiceAgent class, the coordination center of the entire system.

The project directory structure is also thoughtfully designed to isolate different concerns:

big-3-super-agent/
├── apps/
│   ├── content-gen/           # 【Core】Agent Workspace
│   │   ├── agents/            # Agent Session Registries
│   │   ├── backend/           # Agents will write backend code here
│   │   ├── frontend/          # Agents will write frontend code here
│   │   └── logs/              # Execution Logs
│   └── realtime-poc/          # Directory containing coordinator code
│       ├── big_three_realtime_agents.py  # Main Script
│       ├── prompts/           # System prompts for various agents
│       └── output_logs/       # Coordinator logs and screenshots

This structure ensures that agent work output is organized and easy to manage.

Getting Started: A Step-by-Step Setup Guide from Zero to One

Theory is exciting, but practice is crucial. Here is the complete setup procedure based strictly on the file content:

Prerequisites:

Ensure your Python version is 3.11 or higher.
Prepare the three required API keys: OpenAI (for voice coordination), Anthropic (for Claude), and Google Gemini (for browser automation).

Step One: Install the uv Package Manager
The project recommends using the high-performance Python package manager uv. Execute the following command in your terminal to install it:

curl -LsSf https://astral.sh/uv/install.sh | sh

Step Two: Clone the Project and Configure the Environment

Navigate to the project coordinator directory: cd apps/realtime-poc
Copy the environment variable template and fill in your keys:
```
cp .env.sample .env
```
Open the .env file with a text editor and fill in at least these three items:
```
OPENAI_API_KEY=sk-your-actual-key
ANTHROPIC_API_KEY=sk-ant-your-actual-key
GEMINI_API_KEY=your-actual-key
```
You can also set ENGINEER_NAME (your name) and AGENT_WORKING_DIRECTORY (custom agent working directory; leave empty to use the default apps/content-gen).

Step Three: Install the Browser Automation Tool
The system uses Playwright to drive a real browser. Run the following command to install the required browsers:

playwright install

Step Four: Run the System
You can choose to start it in different modes based on your needs:

🍂

Full-Featured Voice Mode (recommended for the full experience): uv run big_three_realtime_agents.py --voice
🍂

Text-Only Test Mode: uv run big_three_realtime_agents.py --input text --output text
🍂

Quick Automatic Task Mode: uv run big_three_realtime_agents.py --prompt "Create a Claude agent and list the files in the working directory"
🍂

Economy/Fast Mode (uses smaller models): uv run big_three_realtime_agents.py --mini --voice

Once started, you can converse with the system via your microphone and command your AI team.

Frequently Asked Questions (FAQ) & Troubleshooting

Q1: I get the error “OPENAI_API_KEY not set” on startup. What should I do?
A: This indicates the system is not reading your API key. Please ensure: 1) You are in the apps/realtime-poc/ directory; 2) A correctly filled .env file exists in this directory; 3) The key format is correct.

Q2: When running a command, I get a “playwright not found” error or a browser-related error.
A: You need to fully execute the playwright install command. This command downloads browser kernels like Chromium, which are essential for the Gemini agent’s visual operations.

Q3: The agent I created doesn’t seem to be working in the directory I specified.
A: The agent’s working directory is controlled by the environment variable AGENT_WORKING_DIRECTORY. Check the value of this variable in your .env file. If it’s empty, it defaults to apps/content-gen/. Ensure the path you specify exists.

Q4: Voice input isn’t working. How can I debug this?
A: First, ensure your operating system has granted microphone access to the Python program. Second, try testing the core agent functionality using the --input text --output text text mode to rule out voice hardware or permission issues.

Project Evolution and Future Directions

Any complex system has room for improvement, and the project’s README honestly lists its directions for evolution, providing a window into its design trade-offs and future potential:

Known Areas for Improvement:

🍂

Code Refactoring: The main coordinator script exceeds 3,000 lines. Although modularized, it could be further split into smaller, single-responsibility modules for better maintainability.
🍂

Robustness Enhancement: More comprehensive API failure retry mechanisms and error recovery processes are needed.
🍂

Isolation & Security: Consider stricter sandbox isolation for agent file system operations.
🍂

Deepening Observability: Implement structured logging with trace IDs and finer-grained API cost tracking per agent/session.

Future Possibilities:
The project envisions many exciting possibilities, including support for multimodal input (images/video), shareable pre-configured agent templates, a plugin system for custom agent tools, support for multi-person collaborative coding, and even containerized cloud deployment. These directions outline a more powerful, open, and user-friendly AI agent collaboration platform.

Conclusion: Mastering the Future-Oriented Development Paradigm

“Big Three Realtime Agents” is more than just a cool tech demo. It represents an emerging paradigm in software development—Agentic Coding. In this paradigm, the engineer’s role gradually shifts from a direct code writer to a coordinator of AI capabilities, a definer of tasks, and a controller of quality.

By deeply understanding this project’s architecture, setup, and workflow, you’re not just learning how to use a tool; you are experiencing and contemplating a critical question: As AI capabilities grow increasingly powerful, how do we organize these capabilities more efficiently and intelligently to solve real-world problems? This is perhaps a question every future-oriented technology practitioner needs to explore.

Further Learning:
If you wish to delve deeper into the patterns and best practices of this agentic coding approach, you can refer to the Tactical Agentic Coding resource recommended by the project author. You can also watch a live demonstration of the project in action on the IndyDevDan YouTube channel for a more intuitive feel.