AgentOS 2 Live: A Hands-On Guide to Building Low-Latency Voice Assistants with OpenAI Realtime API
Quick Summary
AgentOS 2 Live is an open-source, full-stack platform for creating real-time voice assistants using OpenAI’s Realtime API (powered by GPT-4o realtime). It delivers end-to-end voice-to-voice conversations with very low latency, built-in voice activity detection (VAD), animated robot face visualization, modular tool calling, and even hardware control integration for OrionStar robots. The project uses a clean monorepo structure (npm workspaces) with React + TypeScript on the front end, Node.js + Express + WebSocket on the back end, and a dedicated Android WebView bridge for physical robot deployment.
Real-time voice AI has moved from science fiction to practical development in just a couple of years. Developers now ask: How can I build a natural, interruptible voice assistant without stacking separate speech-to-text, LLM, and text-to-speech services — and still keep latency under control?
AgentOS 2 Live gives you a ready-to-run answer. In this article we’ll walk through the architecture, key technical decisions, setup steps, robot deployment, and practical ways to extend it — all based directly on the project’s own documentation.
Why Real-Time Voice Matters (and Why Latency Is the Make-or-Break Factor)
Traditional voice assistants follow a multi-hop pipeline:
-
Record audio → send to speech-to-text service -
Get text → send to LLM -
Get text reply → send to text-to-speech service -
Play synthesized audio
Each hop typically adds 200–800 ms (sometimes more on poor networks). The total round-trip easily exceeds 1.5–3 seconds — long enough for users to feel the system is “thinking too hard” or has lagged.
OpenAI’s Realtime API changes the equation by processing audio natively inside a multimodal model (gpt-4o-realtime-preview). The model hears raw audio, reasons, and streams audio back — no intermediate ASR or TTS steps. AgentOS 2 Live is built exactly around this capability, aiming for fluid, human-like turn-taking.
Project Architecture at a Glance
The codebase uses a monorepo managed by npm workspaces, which keeps shared types and the communication protocol consistent across layers.
| Folder | Responsibility | Main Technologies |
|---|---|---|
client/ |
Browser UI, audio capture/playback, VAD, animations | React, TypeScript, Tailwind CSS, Web Audio API, Opus codec, VAD |
server/ |
WebSocket relay, OpenAI session management, static hosting | Node.js, TypeScript, Express, WebSocket |
shared/ |
Type-safe protocol between client & server | TypeScript interfaces & enums |
e2e_android/ |
Android app that embeds the web UI + exposes robot hardware | Kotlin, Android WebView, OrionStar RobotService SDK |
Communication flows over a single WebSocket connection using a strictly typed protocol defined in shared/. This design reduces bugs when you add new features or change message shapes.
Standout Technical Features
1. Ultra-Low-Latency Voice Pipeline
Audio captured from the microphone is Opus-encoded on the client, streamed in real time to the server, forwarded almost untouched to OpenAI, and streamed back as Opus packets. The client decodes and plays them immediately. Because there’s no text round-trip, first-token-to-speech latency can stay well under 1.5 seconds in good network conditions.
2. Front-End Voice Activity Detection (VAD)
Sending continuous microphone data (including silence) wastes bandwidth and delays turn detection. AgentOS 2 Live runs VAD directly in the browser; when the user stops speaking for a short period, the SDK automatically stops the input stream and tells the model to generate a reply. A postinstall script copies the required VAD model files into the client’s public folder so everything works out of the box.
3. Animated Robot Face UI
The front-end includes a lively robot face component that changes expression and mouth shape based on four states:
-
Listening (user speaking) -
Thinking (model processing) -
Speaking (assistant audio playing) -
Idle
These visual cues make conversations feel more natural — especially valuable when the interface is shown on a physical robot screen.
4. Two Ready-to-Use Demo Scenes
The project ships with two practical verticals:
-
Face Register — enroll and later recognize faces (useful for access control, membership systems, etc.) -
Advice 3C — guide users through IT & consumer electronics purchases (smartphones, laptops, headphones, etc.)
Both scenes demonstrate how to tune system prompts and tool calls for domain-specific behavior.
5. Physical Robot Integration (OrionStar)
The e2e_android/ folder contains an Android app that loads the React web UI inside a WebView and exposes robot hardware through JavaScript bridges. Capabilities include:
-
Head rotation -
Mobile base navigation -
Sensor reading
This turns a pure web voice assistant into a complete embodied agent.
Step-by-Step: Running the Project Locally
Prerequisites
-
Node.js ≥ 18 -
npm ≥ 9 -
OpenAI API key with access to gpt-4o-realtime-preview
1. Environment Setup
Create .env in the project root:
OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
PORT=8081
USE_SSL=false
USE_SSL=true requires valid HTTPS certificates (for wss://); most local development uses plain ws://.
2. Install Dependencies
npm install
The postinstall hook automatically copies VAD assets.
3. Development Mode (Hot Reloading)
npm run dev
-
Web UI → http://localhost:3000 -
Server → http://localhost:8081
Open the browser URL and grant microphone permission to start talking.
4. Production Build & Run
npm run build
node server/dist/index.js
The server now serves both the bundled React app and the WebSocket endpoint from one process.
Deploying on an OrionStar Robot
Requirements
-
Android Studio -
RobotService SDK .jar file (copy from the robot system into e2e_android/app/libs/)
Steps
-
Open the e2e_androidfolder in Android Studio -
Sync Gradle -
Connect the robot via USB (enable Developer Options & USB debugging) -
Run the app -
Grant camera + microphone permissions on first launch
By default the WebView loads http://localhost:3000. To use a remote or different server:
-
Edit DEFAULT_URLinMainActivity.kt -
Or pass the URL via an Android Intent
Using AgentSDK — The Easiest Way to Customize
The provided AgentSDK class hides most of the WebSocket and audio complexity:
import { AgentSDK } from './sdk';
const agent = new AgentSDK({
modelType: 'openai',
systemPrompt: 'You are a friendly 3C electronics shopping guide.',
voice: 'alloy',
tools: [
{
name: 'get_current_weather',
description: 'Fetch current weather for any location',
parameters: {
type: 'object',
properties: { location: { type: 'string' } }
}
}
]
});
agent.on('ready', () => console.log('Assistant ready'));
agent.on('text_output', ({ text }) => console.log('AI:', text));
agent.on('tool_call', (call) => {
console.log(`Tool requested: ${call.name}`);
// Implement your tool logic here
});
await agent.initialize();
agent.connect();
Listen to events, implement tool handlers, and you have a working custom voice agent.
Frequently Asked Questions (FAQ)
Can I use a different model or provider?
The current implementation targets OpenAI Realtime API exclusively (modelType: 'openai'). The clean protocol in shared/ makes it feasible to extend support later.
What kind of latency should I expect?
Exact numbers depend on network quality, but the speech-to-speech design + client-side VAD typically delivers first audio chunk in 800–1500 ms under good conditions.
How is the robot face animation implemented?
A React component subscribes to conversation state changes (listening / thinking / speaking / idle) and swaps between different SVG / animation assets.
How do I add my own business logic or new scenes?
The simplest path is changing the systemPrompt. For more advanced routing, maintain multiple sessions on the server and switch based on user intent or keywords.
Is this production-ready or just a demo?
The README states clearly: this project is for demonstration and testing. Any commercial deployment must follow OpenAI’s usage policies — especially rules around disclosing synthetic voice.
Who Should Use AgentOS 2 Live?
This starter kit fits best if you want to:
-
Experiment quickly with OpenAI Realtime API -
Need an animated face visualization out of the box -
Plan to control physical robots (especially OrionStar models) -
Value very low end-to-end voice latency
You can go from git clone to a working microphone conversation in under half an hour. From there, customizing prompts, adding tools, and polishing animations becomes the main work.
If you’ve already tried AgentOS 2 Live or built something similar, feel free to share your experience in the comments — what worked well, what you extended, or any rough edges you hit. The real-time voice space is moving fast, and community feedback helps everyone move faster too.
