Comprehensive Guide to Agent-Browser: The Ultimate Headless Browser Automation CLI for AI Agents

「Agent-Browser is a high-performance headless browser automation Command Line Interface (CLI) designed specifically for AI agents. Built with a fast Rust CLI frontend and a Node.js fallback, it leverages Playwright to manage Chromium instances, supporting semantic locators, refs for deterministic element selection, and isolated sessions across macOS, Linux, and Windows platforms.」

Introduction: Bridging AI Agents and Web Automation

In the rapidly evolving landscape of artificial intelligence, the ability for agents to interact with the web in a structured, reliable, and efficient manner is paramount. Traditional browser automation tools often fall short when integrated with Large Language Models (LLMs) due to complex selector strategies, slow execution speeds, or lack of machine-readable outputs. Agent-Browser addresses these challenges by providing a specialized CLI tool that bridges the gap between AI reasoning and browser control.
This tool stands out by utilizing a client-daemon architecture where a Rust CLI handles command parsing and communication, while a Node.js daemon manages the Playwright browser instance. This hybrid approach ensures the speed of native binaries with the flexibility of the Node.js ecosystem. Whether you are building an AI assistant that needs to book flights, a data collection agent, or a testing bot, Agent-Browser provides the granular control required for sophisticated web interactions.

Technical Architecture and Platform Support

Understanding the underpinnings of Agent-Browser helps in appreciating its performance and reliability. The architecture is designed for speed and persistence.

Client-Daemon Architecture

The system operates on a three-tier architecture:

  1. 「Rust CLI:」 This serves as the fast native binary that parses incoming commands and handles communication with the backend.
  2. 「Node.js Daemon:」 This component manages the actual Playwright browser instance. It starts automatically upon the first command execution and persists in the background.
  3. 「Fallback Mechanism:」 In environments where the native binary might be unavailable, the system seamlessly falls back to running directly via Node.js.
    Because the daemon persists between commands, subsequent operations are significantly faster, as there is no overhead for launching a new browser process for every single action.

Supported Platforms and Binaries

Agent-Browser offers broad support across different operating systems and architectures, ensuring compatibility regardless of your deployment environment.

Platform Binary Type Fallback Option
macOS ARM64 Native Rust Node.js
macOS x64 Native Rust Node.js
Linux ARM64 Native Rust Node.js
Linux x64 Native Rust Node.js
Windows x64 Native Rust Node.js
The browser engine defaults to Chromium, though the daemon also supports Firefox and WebKit through the Playwright protocol for users requiring alternative rendering engines.

Installation and Setup

Getting Agent-Browser up and running is a straightforward process, with multiple installation methods catering to different development workflows.

Method 1: NPM Installation (Recommended)

The quickest way to get started is via npm, which handles the installation of the CLI package and the necessary browser binaries.

npm install -g agent-browser
agent-browser install

The agent-browser install command specifically triggers the download of the Chromium browser to your local environment.

Method 2: Installation from Source

For developers who prefer building from source or need to customize the build, the following steps are required. Note that this method requires the Rust toolchain.

git clone https://github.com/vercel-labs/agent-browser
cd agent-browser
pnpm install
pnpm build
pnpm build:native   # Requires Rust (https://rustup.rs)
pnpm link --global  # Makes agent-browser available globally
agent-browser install

Linux System Dependencies

Users deploying on Linux must ensure that the required system-level libraries are present for the browser to function correctly.

agent-browser install --with-deps

Alternatively, you can install these dependencies manually using Playwright’s native commands:

npx playwright install-deps chromium

Core Commands and Workflow

Once installed, Agent-Browser exposes a rich set of commands that cover every aspect of browser manipulation. The typical workflow involves navigation, inspection, interaction, and verification.

Navigation and Lifecycle Management

The entry point for any automation task is opening a URL.

agent-browser open <url>              # Navigate to URL

The open command has aliases such as goto and navigate for developer preference. Once a session is complete, the browser should be closed to free up resources.

agent-browser close                   # Close browser (aliases: quit, exit)

Basic Interaction Commands

Interacting with page elements is the core function of the tool. Agent-Browser provides explicit commands for various user actions.

  • 「Clicking:」 agent-browser click <sel> or agent-browser dblclick <sel> for double-clicks.
  • 「Input and Typing:」

    • agent-browser type <sel> <text> – Types text into an element without clearing existing content.
    • agent-browser fill <sel> <text> – Clears the field and fills it with the specified text.
  • 「Form Controls:」

    • agent-browser select <sel> <val> – Selects an option from a dropdown.
    • agent-browser check <sel> and agent-browser uncheck <sel> – Toggles checkboxes.
  • 「Mouse and Keyboard:」

    • agent-browser hover <sel> – Simulates mouse hover.
    • agent-browser press <key> – Presses a key (e.g., Enter, Tab, Control+a).
    • agent-browser drag <src> <tgt> – Performs drag-and-drop operations.

Advanced Navigation

For more complex page states, the tool supports scrolling and file uploads.

agent-browser scroll <dir> [px]       # Scroll (up/down/left/right)
agent-browser scrollintoview <sel>    # Scroll element into view
agent-browser upload <sel> <files>    # Upload files

The “Refs” System: The Optimal Workflow for AI

One of the most significant innovations in Agent-Browser is the “Refs” system. Traditional selectors like CSS or XPath can be brittle and difficult for LLMs to use reliably. Refs provide a deterministic way to interact with elements based on a snapshot of the page’s accessibility tree.

How Refs Work

  1. 「Capture Snapshot:」 You take a snapshot of the current page state. The tool returns a text-based representation of the accessibility tree, assigning a unique reference ID (e.g., @e1, @e2) to every interactive element.
agent-browser snapshot

「Output Example:」

- heading "Example Domain" [ref=e1] [level=1]
- button "Submit" [ref=e2]
- textbox "Email" [ref=e3]
- link "Learn more" [ref=e4]
  1. 「Interact via Refs:」 Use the assigned ref to perform actions. This is deterministic because the ref points directly to the exact element instance captured in the snapshot, removing the need for the AI to guess at a CSS selector.
agent-browser click @e2                   # Click the button
agent-browser fill @e3 "test@example.com" # Fill the textbox
agent-browser get text @e1                # Get heading text

「Why use refs?」

  • 「Deterministic:」 The ref points to the exact element from the snapshot.
  • 「Fast:」 No DOM re-query is needed for the interaction.
  • 「AI-Friendly:」 This snapshot + ref workflow is specifically optimized for Large Language Models.

Snapshot Options

To manage the amount of data sent to an AI, the snapshot command supports powerful filtering options.

Option Description
-i, --interactive Only show interactive elements (buttons, links, inputs).
-c, --compact Remove empty structural elements.
-d, --depth <n> Limit the tree depth to n levels.
-s, --selector <sel> Scope the snapshot to a specific CSS selector.
You can combine these options. For example, to get only interactive elements within a specific container with limited depth:
agent-browser snapshot -i -c -d 5 -s "#main"

Semantic Locators

In addition to refs, Agent-Browser supports Semantic Locators. These allow finding elements based on their meaning (ARIA roles, labels, text) rather than their styling or structure. This is generally more robust against UI changes than CSS selectors.

agent-browser find role <role> <action> [value]
agent-browser find text <text> <action>
agent-browser find label <label> <action> [value]

「Supported Actions:」 click, fill, check, hover, text.
「Examples:」

agent-browser find role button click --name "Submit"
agent-browser find text "Sign In" click
agent-browser find label "Email" fill "test@test.com"

Other semantic finders include finding by placeholder, alt text, title attribute, or data-testid.

Information Retrieval and State Checking

Automation is not just about clicking buttons; it is about verifying results. Agent-Browser provides comprehensive commands for getting data and checking element states.

Getting Information

  • agent-browser get text <sel> – Retrieves the text content of an element.
  • agent-browser get html <sel> – Gets the inner HTML.
  • agent-browser get value <sel> – Gets the value of an input field.
  • agent-browser get attr <sel> <attr> – Retrieves a specific attribute value.
  • agent-browser get title – Returns the page title.
  • agent-browser get url – Returns the current URL.
  • agent-browser get box <sel> – Gets the bounding box coordinates.

Checking State

These commands return a boolean status indicating the condition of the element.

  • agent-browser is visible <sel>
  • agent-browser is enabled <sel>
  • agent-browser is checked <sel>

Waiting and Synchronization

Handling dynamic content is critical for reliable automation. Agent-Browser offers versatile waiting strategies.

  • 「Wait for Selector:」 agent-browser wait <selector> – Pauses execution until the element is visible.
  • 「Wait for Time:」 agent-browser wait <ms> – Pauses for a specific number of milliseconds.
  • 「Wait for Text:」 agent-browser wait --text "Welcome" – Waits until specific text appears on the page.
  • 「Wait for URL:」 agent-browser wait --url "**/dash" – Waits until the URL matches a glob pattern.
  • 「Wait for Network State:」 agent-browser wait --load networkidle – Waits until network requests are idle.
  • 「Wait for JS Condition:」 agent-browser wait --fn "window.ready === true" – Waits until a JavaScript condition evaluates to true.
    Supported load states include load, domcontentloaded, and networkidle.

Session Management and Isolation

Agent-Browser allows you to run multiple, completely isolated browser instances simultaneously. This is crucial for scenarios where you need to perform tasks for different users or in different contexts concurrently.

Creating Sessions

You can specify a session name using the --session flag.

agent-browser --session agent1 open site-a.com
agent-browser --session agent2 open site-b.com

Alternatively, you can set the session via an environment variable:

AGENT_BROWSER_SESSION=agent1 agent-browser click "#btn"

Managing Sessions

Each session maintains its own isolated state:

  • Browser Instance
  • Cookies and Storage
  • Navigation History
  • Authentication State
    To see all active sessions:
agent-browser session list

「Output:」

Active sessions:
-> default
   agent1

To view the current session, simply run:

agent-browser session

Browser Settings and Emulation

To ensure your automation behaves exactly like a human user or matches specific test environments, Agent-Browser allows detailed configuration of the browser context.

agent-browser set viewport <w> <h>    # Set viewport size (e.g., 1280 720)
agent-browser set device <name>       # Emulate device (e.g., "iPhone 14")
agent-browser set geo <lat> <lng>     # Set geolocation coordinates
agent-browser set offline [on|off]    # Toggle offline mode
agent-browser set media [dark|light]  # Emulate color scheme

Authentication and Headers

Handling authentication without cumbersome login flows is a powerful feature. You can set HTTP headers scoped to a specific origin.

agent-browser open api.example.com --headers '{"Authorization": "Bearer <token>"}'

This approach is useful for:

  • 「Skipping Login Flows:」 Authenticate via headers instead of UI interactions.
  • 「Switching Users:」 Start new sessions with different auth tokens instantly.
  • 「API Testing:」 Access protected endpoints directly.
  • 「Security:」 Headers are scoped to the origin; they are not leaked to other domains.
    For global headers applicable to all domains, use the set headers command:
agent-browser set headers '{"X-Custom-Header": "value"}'

Network Control and Interception

Advanced users can intercept, modify, or block network requests. This is useful for blocking trackers, testing API error states, or mocking responses.

agent-browser network route <url> --abort      # Block requests matching URL
agent-browser network route <url> --body <json>  # Mock response body
agent-browser network unroute [url]            # Remove routes
agent-browser network requests                 # View tracked requests

Cookies and Storage Management

Direct access to cookies and storage allows for complex state management.

  • 「Cookies:」 agent-browser cookies (list), agent-browser cookies set <name> <val> (set), agent-browser cookies clear (clear).
  • 「Local Storage:」 agent-browser storage local (get all), agent-browser storage local <key> (get specific), agent-browser storage local set <k> <v> (set), agent-browser storage local clear (clear).
  • 「Session Storage:」 Commands mirror local storage, e.g., agent-browser storage session.

Agent Mode: JSON Output for AI Integration

When integrating Agent-Browser with AI agents, structured output is essential. Using the --json flag converts the output into a machine-readable format.

agent-browser snapshot --json

「Returns:」

{
  "success": true,
  "data": {
    "snapshot": "...",
    "refs": {
      "e1": {"role": "heading", "name": "Title"},
      ...
    }
  }
}

This enables the AI to parse the page structure programmatically and make decisions based on the JSON data.

Advanced Features: CDP, Streaming, and Serverless

Custom Browser Executables

You can instruct Agent-Browser to use a specific browser executable path instead of the bundled Chromium. This is particularly useful for 「serverless deployment」 (e.g., Vercel, AWS Lambda) where size constraints matter. For instance, using @sparticuz/chromium (~50MB) instead of the full bundle (~684MB).

agent-browser --executable-path /path/to/chromium open example.com

Or via environment variable:

AGENT_BROWSER_EXECUTABLE_PATH=/path/to/chromium agent-browser open example.com

CDP Mode (Chrome DevTools Protocol)

Agent-Browser can attach to an already running browser instance via the Chrome DevTools Protocol (CDP). This enables control over Electron apps, Chrome instances running with remote debugging, or WebView2 applications.

# Start Chrome with: google-chrome --remote-debugging-port=9222
agent-browser --cdp 9222 open about:blank

Streaming: Browser Preview

For scenarios requiring human oversight or “pair browsing,” Agent-Browser can stream the browser viewport via WebSocket.
「Enable Streaming:」

AGENT_BROWSER_STREAM_PORT=9223 agent-browser open example.com

This starts a WebSocket server on port 9223. You can connect to ws://localhost:9223 to receive base64-encoded JPEG frames of the viewport and send input events (mouse, keyboard, touch) back to the browser.
「Frame Data Structure:」

{
  "type": "frame",
  "data": "<base64-encoded-jpeg>",
  "metadata": {
    "deviceWidth": 1280,
    "deviceHeight": 720,
    "pageScaleFactor": 1,
    "offsetTop": 0,
    "scrollOffsetX": 0,
    "scrollOffsetY": 0
  }
}

Programmatic API Usage

While this guide focuses on the CLI, Agent-Browser also exposes a programmatic API for TypeScript/JavaScript users, allowing direct control over the browser manager, screencasting, and event injection within custom applications.

import { BrowserManager } from 'agent-browser';
const browser = new BrowserManager();
await browser.launch({ headless: true });
await browser.navigate('https://example.com');
// Start screencast
await browser.startScreencast((frame) => {
  console.log('Frame received:', frame.metadata.deviceWidth, 'x', frame.metadata.deviceHeight);
}, {
  format: 'jpeg',
  quality: 80,
  maxWidth: 1280,
  maxHeight: 720,
});
// Inject events
await browser.injectMouseEvent({
  type: 'mousePressed',
  x: 100,
  y: 200,
  button: 'left',
});
await browser.stopScreencast();

Debugging and Troubleshooting

Debugging browser automation can be difficult. Agent-Browser provides several tools to aid developers.

  • 「Headed Mode:」 Run with a visible UI to see what the bot sees.

    agent-browser open example.com --headed
    
  • 「Console Logging:」 View browser console messages.

    agent-browser console
    agent-browser console --clear
    
  • 「Error Viewing:」 Inspect page errors.

    agent-browser errors
    
  • 「Tracing:」 Record a trace of the execution for detailed analysis.

    agent-browser trace start [path]
    # ... perform actions ...
    agent-browser trace stop [path]
    
  • 「Highlighting:」 Visually highlight an element to verify selectors.

    agent-browser highlight <sel>
    
  • 「State Management:」 Save and load authentication states to avoid repeated logins during development.

    agent-browser state save <path>
    agent-browser state load <path>
    

Global Options Reference

Several global flags can be appended to commands to modify behavior.

Option Description
--session <name> Use an isolated session name.
--headers <json> Set HTTP headers scoped to the URL’s origin.
--executable-path <path> Path to custom browser executable.
--json Output results in JSON format.
--full, -f Capture full page screenshot.
--headed Show browser window (non-headless).
--cdp <port> Connect via Chrome DevTools Protocol port.
--debug Enable debug output.

Frequently Asked Questions (FAQ)

「How does Agent-Browser differ from standard tools like Selenium or Puppeteer?」

Agent-Browser is specifically architected with a hybrid Rust/Node.js model to maximize speed for AI agents. It introduces features like “Refs” and accessibility tree snapshots which are optimized for LLM consumption, reducing the ambiguity found in standard CSS selectors. Furthermore, its built-in session isolation and JSON output modes make it more “AI-native” than traditional testing frameworks.

「Can I use Agent-Browser in a serverless environment like AWS Lambda?」

Yes. Agent-Browser supports custom browser executables. You can configure it to use lightweight Chromium builds like @sparticuz/chromium (approx 50MB) instead of the standard bundled version, making it suitable for the size and memory constraints of serverless functions.

「What is the best way to select elements for an AI agent?」

The recommended method is using the “Refs” system. First, execute agent-browser snapshot -i to get an interactive element tree with reference IDs. Then, use those IDs (e.g., @e2, @e3) for interaction. This method is deterministic and faster than re-querying the DOM with CSS selectors.

「How do I handle authentication without manually logging in every time?」

You can use the --headers flag to inject Authorization headers when opening a URL. This allows you to bypass the UI login flow entirely. Additionally, you can save and load the authentication state using agent-browser state save <path> and agent-browser state load <path>.

「Is it possible to watch what the AI agent is doing in real-time?」

Yes. By setting the AGENT_BROWSER_STREAM_PORT environment variable (e.g., 9223), Agent-Browser starts a WebSocket server that streams the browser viewport. You can connect to this port to receive live video feeds of the browser session and even inject mouse/keyboard events for “pair browsing.”

「What platforms are supported by the native Rust binary?」

The native Rust binary is supported on macOS (ARM64 and x64), Linux (ARM64 and x64), and Windows (x64). If the native binary is not available for a specific environment, the tool automatically falls back to a Node.js implementation.

「How can I intercept or block network requests?」

Use the network route commands. For example, agent-browser network route <url> --abort will block requests matching the URL pattern. You can also use --body <json> to mock responses. Use agent-browser network unroute <url> to remove the interception.

「Can I run multiple automation tasks simultaneously?」

Yes, Agent-Browser supports session isolation. By using the --session <name> flag or the AGENT_BROWSER_SESSION environment variable, you can run multiple isolated browser instances at the same time, each with its own cookies, storage, and state.