Skyvern: The Complete Guide to Browser Workflow Automation Using AI and Computer Vision

高效码农

2 months ago

Introduction

In our daily work, we often need to repeatedly perform various browser operations—filling out forms, downloading files, extracting data, completing login processes, and more. Traditional automation methods rely on writing scripts for specific websites, using XPath or CSS selectors to locate elements. However, any minor change in website layout can cause these scripts to fail.

Now, a smarter solution has emerged. Skyvern fundamentally changes how browser automation is implemented by combining Large Language Models (LLMs) and computer vision technology. It can “see” and understand web page content like a human, comprehend task requirements, and autonomously decide how to operate—all without writing specific code for each website.

This article provides an in-depth look at Skyvern’s working principles, installation and usage methods, core features, and practical application scenarios, helping you fully understand this revolutionary automation tool.

What is Skyvern?

Skyvern is an AI-based browser automation platform that uses LLMs and computer vision to automate various browser workflows. Unlike traditional methods, Skyvern doesn’t require pre-written scripts for specific websites. Instead, it understands web pages’ visual elements and text content to make autonomous decisions and execute operations.

Key Features:

No need to write website-specific code
Resilient to website layout changes
Capable of handling never-before-seen websites
Supports complex reasoning and decision-making

How Skyvern Works

Skyvern’s design draws inspiration from task-driven autonomous agent architectures like BabyAGI and AutoGPT, but adds a crucial capability: interacting with websites through browser automation libraries like Playwright.

Multi-Agent System Architecture

Skyvern uses a team of specialized agents that collaborate to complete tasks:

Understanding Agent: Analyzes web page content and identifies interactive elements
Planning Agent: Develops the sequence of steps needed to complete the task
Execution Agent: Actually performs browser operations like clicking, typing, and scrolling
Validation Agent: Confirms whether operation results meet expectations

This division of labor enables Skyvern to handle complex workflows and adjust strategies when encountering unexpected situations.

Comparison with Traditional Methods

Traditional browser automation typically relies on:

DOM parsing and XPath selectors
Pre-written scripts and workflows
Custom code tailored to specific websites

The main weakness of these methods is their fragility—minor changes in website layout can break automation workflows.

Skyvern’s fundamentally different approach includes:

Visual understanding instead of code-based selectors
Strong adaptability to handle layout changes
Reasoning capabilities to manage complex situations

For example, when obtaining a car insurance quote from Geico, Skyvern can infer the answer to “Were you eligible to drive at 18?” from the fact that the driver received their license at age 16, without needing explicit instructions.

Performance and Evaluation

In the WebBench benchmark tests, Skyvern demonstrates outstanding performance with an overall accuracy rate of 64.4%. Particularly in “write” tasks (such as form filling, login, file downloads, etc.), Skyvern is the best-performing agent, which is especially important for Robotic Process Automation (RPA) related tasks.

These results indicate that Skyvern has reached industry-leading levels in handling real-world automation tasks.

Getting Started with Skyvern

Skyvern Cloud Service

For users who don’t want to handle infrastructure management, Skyvern Cloud offers a fully managed cloud service. It includes features like running multiple Skyvern instances in parallel, anti-bot detection mechanisms, proxy networks, and CAPTCHA solutions.

To try Skyvern Cloud, simply visit app.skyvern.com to create an account.

Local Installation and Usage

Environment Requirements

Before starting, ensure your system meets the following requirements:

Python 3.11.x (supports 3.12, not ready for 3.13 yet)
NodeJS and NPM
Additional requirements for Windows users:
- Rust
- VS Code with C++ development tools and Windows SDK

Installation Steps

Install Skyvern
```
pip install skyvern
```
Initialize Skyvern

For first-time runs, database setup and migrations are needed:
```
skyvern quickstart
```
Run Skyvern Service
```
skyvern run all
```
Once completed, visit http://localhost:8080 to use the web interface for creating and managing tasks.

Running Tasks via Code

Besides the web interface, you can also use Skyvern through Python code:

from skyvern import Skyvern

skyvern = Skyvern()
task = await skyvern.run_task(prompt="Find today's top post on HackerNews")
print(task)

Skyvern executes tasks in a browser window that pops up, automatically closing when the task is complete. You can view task history at http://localhost:8080/history.

Advanced Usage Techniques

Using Your Own Chrome Browser

Note: Starting from Chrome 136, the default user data directory refuses any CDP connections. To use your browser data, Skyvern copies the default user data directory to ./tmp/user_data_dir when first connecting to your local browser.

Control via Code

from skyvern import Skyvern

# Chrome path example for Mac systems
browser_path = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
skyvern = Skyvern(
    base_url="http://localhost:8000",
    api_key="YOUR_API_KEY",
    browser_path=browser_path,
)
task = await skyvern.run_task(
    prompt="Find today's top post on HackerNews",
)

Control via Skyvern Service

Add the following variables to your .env file:

# Chrome path example for Mac systems
CHROME_EXECUTABLE_PATH="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
BROWSER_TYPE=cdp-connect

After restarting the Skyvern service, you can run tasks through the UI or code.

Connecting to Remote Browsers

Get the CDP connection URL and pass it to Skyvern:

from skyvern import Skyvern

skyvern = Skyvern(cdp_url="Your CDP connection URL")
task = await skyvern.run_task(
    prompt="Find today's top post on HackerNews",
)

Getting Structured Output

By specifying a data extraction schema, you can ensure output conforms to a specific format:

from skyvern import Skyvern

skyvern = Skyvern()
task = await skyvern.run_task(
    prompt="Find today's top post on HackerNews",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "title": {
                "type": "string",
                "description": "The title of the top post"
            },
            "url": {
                "type": "string",
                "description": "The URL of the top post"
            },
            "points": {
                "type": "integer",
                "description": "Number of points the post has received"
            }
        }
    }
)

Common Debugging Commands

# Start Skyvern server separately
skyvern run server

# Start Skyvern UI
skyvern run ui

# Check Skyvern service status
skyvern status

# Stop all Skyvern services
skyvern stop all

# Stop Skyvern UI
skyvern stop ui

# Stop Skyvern server
skyvern stop server

Docker Compose Deployment

For users who prefer containerized deployment, Skyvern provides Docker Compose configuration:

Ensure Docker Desktop is installed and running
Check if Postgres is running locally (using the docker ps command)
Clone the repository and navigate to the root directory
Run skyvern init llm to generate a .env file (this will be copied to the Docker image)
Fill in the LLM provider key in docker-compose.yml
Run the following command:
```
docker compose up -d
```
Access http://localhost:8080 in your browser to start using the UI

Important Note: Only one Postgres container can run on port 5432 at a time. If switching from CLI-managed Postgres to Docker Compose, you must first remove the original container:
docker rm -f postgresql-container

Skyvern Core Features

Task Management

Tasks are the fundamental building blocks in Skyvern. Each task represents a single request, instructing Skyvern to navigate a website and complete a specific goal.

Creating a task requires specifying:

url: Target website address
prompt: Task description
Optional data schema: If output needs to conform to a specific structure
Optional error codes: If you want to stop execution under specific conditions

Workflow Design

Workflows allow chaining multiple tasks together to form coherent work units.

Typical Workflow Examples:

Invoice Download Workflow:
- Navigate to invoice page
- Filter to show invoices after January 1st
- Extract list of eligible invoices
- Iterate through each invoice and download
E-commerce Purchase Workflow:
- Navigate to target product page
- Add product to shopping cart
- Navigate to cart and validate state
- Complete checkout process

Supported Workflow Features:

Browser tasks
Browser actions
Data extraction
Validation
Loops
File parsing
Email sending
Text prompts
HTTP request blocks
Custom code blocks
Uploading files to block storage
(Coming soon) Conditional statements

Live Streaming

Skyvern allows streaming the browser viewport to your local machine in real time, letting you watch Skyvern’s operations on web pages as they happen. This is extremely useful for debugging and understanding how Skyvern interacts with websites, allowing for intervention when necessary.

Form Filling

Skyvern natively supports filling out form inputs on websites. By passing information through the navigation_goal, Skyvern can comprehend the information and fill out forms accordingly.

Data Extraction

Skyvern can also extract data from websites. You can directly specify a data_extraction_schema in the main prompt to tell Skyvern exactly what data you want to extract from the website in JSONC format. Skyvern’s output will be structured according to the provided schema.

File Downloading

Skyvern supports downloading files from websites. All downloaded files are automatically uploaded to block storage (if configured), and you can access them through the UI.

Authentication Support

Skyvern supports multiple authentication methods, making it easier to automate tasks behind logins. If you’d like to try this feature, please contact us via email or Discord.

Two-Factor Authentication (2FA) Support

Skyvern supports multiple 2FA methods, allowing you to automate workflows that require 2FA:

QR code-based 2FA (like Google Authenticator, Authy)
Email-based 2FA
SMS-based 2FA

Password Manager Integration

Skyvern currently supports the following password manager integrations:

[x] Bitwarden
[ ] 1Password (in development)
[ ] LastPass (in development)

Model Context Protocol (MCP) Support

Skyvern supports the Model Context Protocol (MCP), allowing you to use any LLM that supports MCP.

Zapier / Make.com / N8N Integration

Skyvern integrates with Zapier, Make.com, and N8N, allowing you to connect Skyvern workflows to other applications.

Real-World Application Cases

Here are some practical examples of Skyvern in real-world scenarios:

Multi-Website Invoice Downloading

Businesses often need to download invoices from multiple vendor portals, each with different interfaces and navigation flows. Skyvern can automate this process without writing specific code for each website.

Job Application Automation

Job seekers can use Skyvern to automate the process of submitting resumes and filling out application forms, saving significant time.

Manufacturing Material Procurement

Manufacturing companies can use Skyvern to automate the process of finding and procuring raw materials, comparing prices and inventory across multiple supplier websites.

Government Website Account Registration and Form Filling

Skyvern can handle complex registration and form-filling processes on government websites, which often have unique interfaces and validation processes.

Contact Form Filling

Businesses can use Skyvern to automate filling out contact forms across multiple websites for lead generation or partner outreach.

Multi-Language Insurance Quote Retrieval

Insurance companies or comparison websites can use Skyvern to obtain quotes from multiple insurance providers, even when websites use different languages.

Supported LLM Providers

Skyvern supports multiple LLM providers, allowing you to choose the right model based on your requirements, budget, and performance needs.

Provider	Supported Models
OpenAI	gpt4-turbo, gpt-4o, gpt-4o-mini
Anthropic	Claude 3 (Haiku, Sonnet, Opus), Claude 3.5 (Sonnet)
Azure OpenAI	Any GPT models, better performance with multimodal LLMs (azure/gpt4-o)
AWS Bedrock	Anthropic Claude 3 (Haiku, Sonnet, Opus), Claude 3.5 (Sonnet)
Gemini	Gemini 2.5 Pro and flash, Gemini 2.0
Ollama	Run any locally hosted model via Ollama
OpenRouter	Access models through OpenRouter
OpenAI-compatible	Any custom API endpoint following OpenAI API format (via liteLLM)

Environment Variable Configuration

OpenAI

Variable	Description	Type	Sample Value
`ENABLE_OPENAI`	Register OpenAI models	Boolean	`true`, `false`
`OPENAI_API_KEY`	OpenAI API Key	String	`sk-1234567890`
`OPENAI_API_BASE`	OpenAI API Base URL, optional	String	`https://openai.api.base`
`OPENAI_ORGANIZATION`	OpenAI Organization ID, optional	String	`your-org-id`

Recommended LLM_KEY: OPENAI_GPT4O, OPENAI_GPT4O_MINI, OPENAI_GPT4_1, OPENAI_O4_MINI, OPENAI_O3

Anthropic

Variable	Description	Type	Sample Value
`ENABLE_ANTHROPIC`	Register Anthropic models	Boolean	`true`, `false`
`ANTHROPIC_API_KEY`	Anthropic API Key	String	`sk-1234567890`

Recommended LLM_KEY: ANTHROPIC_CLAUDE3.5_SONNET, ANTHROPIC_CLAUDE3.7_SONNET, ANTHROPIC_CLAUDE4_OPUS, ANTHROPIC_CLAUDE4_SONNET

Azure OpenAI

Variable	Description	Type	Sample Value
`ENABLE_AZURE`	Register Azure OpenAI models	Boolean	`true`, `false`
`AZURE_API_KEY`	Azure deployment API key	String	`sk-1234567890`
`AZURE_DEPLOYMENT`	Azure OpenAI deployment name	String	`skyvern-deployment`
`AZURE_API_BASE`	Azure deployment API base URL	String	`https://skyvern-deployment.openai.azure.com/`
`AZURE_API_VERSION`	Azure API version	String	`2024-02-01`

Recommended LLM_KEY: AZURE_OPENAI

AWS Bedrock

Variable	Description	Type	Sample Value
`ENABLE_BEDROCK`	Register AWS Bedrock models. To use AWS Bedrock, make sure your AWS configurations are set up correctly first	Boolean	`true`, `false`

Recommended LLM_KEY: BEDROCK_ANTHROPIC_CLAUDE3.7_SONNET_INFERENCE_PROFILE, BEDROCK_ANTHROPIC_CLAUDE4_OPUS_INFERENCE_PROFILE, BEDROCK_ANTHROPIC_CLAUDE4_SONNET_INFERENCE_PROFILE

Gemini

Variable	Description	Type	Sample Value
`ENABLE_GEMINI`	Register Gemini models	Boolean	`true`, `false`
`GEMINI_API_KEY`	Gemini API Key	String	`your_google_gemini_api_key`

Recommended LLM_KEY: GEMINI_2.5_PRO_PREVIEW, GEMINI_2.5_FLASH_PREVIEW

Ollama

Variable	Description	Type	Sample Value
`ENABLE_OLLAMA`	Register local models via Ollama	Boolean	`true`, `false`
`OLLAMA_SERVER_URL`	Ollama server URL	String	`http://host.docker.internal:11434`
`OLLAMA_MODEL`	Ollama model name	String	`qwen2.5:7b-instruct`

Recommended LLM_KEY: OLLAMA

Note: Ollama doesn’t support vision capabilities yet.

OpenRouter

Variable	Description	Type	Sample Value
`ENABLE_OPENROUTER`	Register OpenRouter models	Boolean	`true`, `false`
`OPENROUTER_API_KEY`	OpenRouter API key	String	`sk-1234567890`
`OPENROUTER_MODEL`	OpenRouter model name	String	`mistralai/mistral-small-3.1-24b-instruct`
`OPENROUTER_API_BASE`	OpenRouter API base URL	String	`https://api.openrouter.ai/v1`

Recommended LLM_KEY: OPENROUTER

OpenAI-Compatible

Variable	Description	Type	Sample Value
`ENABLE_OPENAI_COMPATIBLE`	Register custom OpenAI-compatible API endpoint	Boolean	`true`, `false`
`OPENAI_COMPATIBLE_MODEL_NAME`	OpenAI-compatible endpoint model name	String	`yi-34b`, `gpt-3.5-turbo`, `mistral-large`, etc.
`OPENAI_COMPATIBLE_API_KEY`	OpenAI-compatible endpoint API key	String	`sk-1234567890`
`OPENAI_COMPATIBLE_API_BASE`	OpenAI-compatible endpoint base URL	String	`https://api.together.xyz/v1`, `http://localhost:8000/v1`, etc.
`OPENAI_COMPATIBLE_API_VERSION`	OpenAI-compatible endpoint API version, optional	String	`2023-05-15`
`OPENAI_COMPATIBLE_MAX_TOKENS`	Maximum tokens for completion, optional	Integer	`4096`, `8192`, etc.
`OPENAI_COMPATIBLE_TEMPERATURE`	Temperature setting, optional	Float	`0.0`, `0.5`, `0.7`, etc.
`OPENAI_COMPATIBLE_SUPPORTS_VISION`	Whether model supports vision, optional	Boolean	`true`, `false`

Supported LLM Key: OPENAI_COMPATIBLE

General LLM Configuration

Variable	Description	Type	Sample Value
`LLM_KEY`	The name of the model you want to use	String	See supported LLM keys above
`SECONDARY_LLM_KEY`	The name of the model for mini agents Skyvern runs with	String	See supported LLM keys above
`LLM_CONFIG_MAX_TOKENS`	Override the max tokens used by the LLM	Integer	`128000`

Developer Setup

For developers who want to contribute code or customize Skyvern, here are the steps to set up the development environment:

Make sure you have uv installed.

Create virtual environment (.venv)
```
uv sync --group dev
```
Perform initial server configuration
```
uv run skyvern quickstart
```
Access http://localhost:8080 in your browser to start using the UI

Skyvern CLI supports Windows, WSL, macOS, and Linux environments.

Feature Roadmap

The Skyvern team has a clear development plan. Here are the main goals for the coming months:

[x] Open Source – Open source Skyvern core codebase
[x] Workflow Support – Support chaining multiple Skyvern calls together
[x] Improved Context Understanding – Enhance Skyvern’s ability to understand content around interactive elements by providing relevant label context through text prompts
[x] Cost Optimization – Improve stability and reduce running costs by optimizing the context tree passed to Skyvern
[x] Self-Service UI – Replace Streamlit UI with React-based UI components allowing users to launch new tasks in Skyvern
[x] Workflow UI Builder – Introduce UI allowing users to visually build and analyze workflows
[x] Chrome Viewport Streaming – Introduce method to stream Chrome viewport to user’s browser in real time
[x] Historical Run UI – Replace Streamlit UI with React-based UI allowing visualization of historical runs and their results
[X] Auto Workflow Builder (“Observer” Mode) – Allow Skyvern to automatically generate workflows while browsing the web, making it easier to build new workflows
[x] Prompt Caching – Introduce caching layer for LLM calls, significantly reducing Skyvern running costs
[x] Web Evaluation Dataset – Integrate Skyvern with public benchmark tests to track model quality over time
[ ] Improved Debug Mode – Allow Skyvern to plan actions and get “approval” before execution, facilitating debugging and prompt iteration
[ ] Chrome Extension – Allow users to interact with Skyvern through Chrome extension
[ ] Skyvern Action Recorder – Allow Skyvern to observe users completing tasks and automatically generate workflows
[ ] Interactive Live Streaming – Allow users to interact with streams in real time for intervention when necessary
[ ] Integrated LLM Observability Tools – Integrate LLM observability tools allowing backtesting of prompt changes with specific datasets
[x] Langchain Integration – Create integration in langchain_community to use Skyvern as a “tool”

Frequently Asked Questions

How is Skyvern different from traditional RPA tools?

Traditional RPA tools typically rely on recording and playback techniques or scripts based on XPath/CSS selectors—methods that often fail when website layouts change. Skyvern uses LLMs and computer vision to understand web page content, adapt to layout changes, handle never-before-seen websites, and employ reasoning capabilities for complex situations.

Can Skyvern handle websites that require login?

Yes, Skyvern supports multiple authentication methods, including username/password login and two-factor authentication (2FA). It supports QR code-based, email-based, and SMS-based 2FA, and can integrate with password managers like Bitwarden.

How does Skyvern ensure data security?

When using local deployment, all data remains in your environment. Skyvern’s open-source version doesn’t include the anti-bot detection features available in the cloud service, but the core automation logic is identical. If you have licensing questions, you can contact the support team.

Which browsers does Skyvern support?

Skyvern is primarily optimized for Chromium-based browsers (like Google Chrome, Microsoft Edge) and interacts with browsers through Chrome DevTools Protocol (CDP). It supports connecting to both local and remote browser instances.

What if Skyvern gets stuck while executing a task?

Skyvern provides multiple debugging tools:

Live streaming functionality lets you observe the execution process
Detailed task history allows reviewing each operation step
You can intervene in task execution through UI or code
Comprehensive logging helps diagnose issues

How well does Skyvern perform?

According to WebBench benchmark tests, Skyvern achieves 64.4% accuracy on overall tasks, with particularly outstanding performance on “write” tasks (like form filling, login, file downloads, etc.), which are core requirements for RPA scenarios.

Can Skyvern’s behavior be customized?

Yes, Skyvern offers multiple customization methods:

Define output format through data extraction schemas
Define stopping conditions through error codes
Support for custom workflows combining multiple tasks
Integration of custom code blocks

Conclusion

Skyvern represents a significant advancement in the field of browser automation. By combining LLMs and computer vision, it addresses the fundamental limitations of traditional automation methods. It doesn’t require writing specific code for each website, can adapt to website layout changes, and possesses the ability to handle complex situations.

Whether you’re a business looking to automate repetitive workflows or a developer seeking more reliable browser automation solutions, Skyvern is worth trying. Its open-source version provides complete core functionality, while the cloud service offers convenience for users who don’t want to manage infrastructure.

As AI technology continues to evolve, tools like Skyvern have the potential to fundamentally change how we interact with web applications, freeing people from repetitive tasks and allowing them to focus on more valuable work.