What is UniVLA and How It Enables Robots to Truly Understand and Execute Complex Tasks

Imagine you’re teaching a robot to “put the screwdriver back in the toolbox.” Traditional approaches require writing precise motion commands for that specific robot: lift arm 15 centimeters, rotate wrist 30 degrees, apply 2 newtons of grip force. Switch to a different robotic arm, and every parameter must be recalibrated. It’s like teaching a person to do something by first explaining how to contract every muscle—inefficient and lacking universal applicability.

UniVLA (Unified Vision-Language-Action) directly addresses this core challenge. It aims to enable robots to understand the essence of tasks like humans do, rather than memorizing mechanical motions. By creating a unified action space, UniVLA allows robots to “draw inferences about other cases from one instance,” transferring learned skills across different devices and scenarios. This deep technical guide walks you through UniVLA’s mechanics and how to leverage it to build genuinely intelligent robotic systems.

Three Core Innovations: The Technical Foundation of UniVLA

According to the project documentation, UniVLA’s breakthrough rests on three interconnected innovations. These aren’t simple feature additions but a fundamental restructuring of the robot learning framework.

1. A Unified, Embodiment-Agnostic Action Space

What is the biggest pain point of traditional robot policy models? Every robot requires training from scratch. Robotic arms, wheeled robots, and humanoid robots have completely different action definitions, preventing data sharing and experience transfer. UniVLA proposes a bold idea: can we create a “universal action language” that all robots understand?

The project achieves this through task-centric latent actions. Instead of defining actions as “how many degrees a joint rotates,” actions are defined as “key steps to complete the task.” For instance, the latent action of “grabbing the screwdriver” can be understood by robots with two-finger grippers or five-finger dexterous hands, each executing it in its own way. This abstraction layer makes cross-device knowledge transfer possible.

2. Extracting Task Essence from Cross-Embodiment Videos

Another critical challenge is data scarcity. High-quality robot manipulation data is extremely expensive, but human operation videos and demonstrations from different robots are widely available. UniVLA’s designed mechanism extracts shared action logic from diverse video sources.

The project leverages robotic manipulation and navigation data from the Open X-Embodiment (OXE) dataset, along with curated subsets of human daily activity videos from the Ego4D dataset. Through specialized learning algorithms, the system automatically ignores irrelevant details like “whether this person uses their left or right hand” and focuses on task-critical logic like “why they unscrew the bottle cap before pouring water.” It’s like teaching AI to understand the key steps in instructional videos rather than mechanically mimicking every motion.

3. Extreme Optimization of Computational Efficiency

When it comes to large models, massive compute is assumed. However, the UniVLA team explicitly highlights their efficiency advantage: full-scale training requires approximately 960 A100 GPU-hours, just 5% of OpenVLA’s resource consumption. If training only on the BridgeV2 or pure human video data, it takes merely 200 GPU-hours.

This efficiency boost isn’t achieved by shrinking the model but through the compact representation of the latent action space. Since actions are encoded as discrete “intention tokens” rather than continuous high-dimensional vectors, the model doesn’t waste resources learning redundant physical details. The post-training action decoder has only about 12 million parameters. Using parameter-efficient fine-tuning with LoRA rank 32, the total trainable parameters are just 123 million—lightweight enough to run on standard laboratory servers.

Technical Architecture Deep Dive: The Three-Stage Training Pipeline

UniVLA’s training isn’t a single step but a carefully designed three-stage process. This pipeline ensures learning quality while keeping each stage’s objectives clear and verifiable.

Stage 1: Building a Universal Action Vocabulary

This stage trains the Latent Action Model, essentially creating an “action dictionary.” The project uses a VQ-VAE (Vector Quantized Variational AutoEncoder) architecture to compress continuous action sequences into discrete codebook entries.

How does it work? The documentation provides clear instructions:

Overall batch size of 512
100,000 optimization steps
Training data mix of OXE robot data, navigation data, and Ego4D human videos

After this stage, the model can convert any video stream into a compact “action token” sequence. Like tokenizing a sentence into words, each token represents an action unit with clear intent (e.g., “grasp,” “move,” “place”), but without specifying the execution method.

Stage 2: Making Actions Task-Aware

The action tokens learned in Stage 1 are universal but may lack precision. Stage 2 performs task-centric fine-tuning on top of the Stage 1 checkpoint. The key difference lies in the loss function design: the model must not only predict action sequences but also ensure they optimally complete tasks.

The technical documentation emphasizes that the Stage 2 checkpoint is the final version for policy training. The benefit of separating the two stages is that you can directly use the team’s pre-trained checkpoint (qwbu/univla-latent-action-model), skipping the most time-consuming universal learning process and focusing directly on task-specific adaptation.

Stage 3: Teaching the Large Model to “Command”

With the action vocabulary ready, the next step is training the high-level policy model. UniVLA is built on the LLaMA architecture, mapping visual inputs (camera images) and language instructions (“put the apple in the plate”) into action token sequences.

The clever aspect of this process is pseudo-labeling: the trained latent action model generates action token labels for all video data, and the VLA model learns the mapping of “what I see + what I hear → what I should do.” Since labels are discrete integer sequences, training uses the standard next-token prediction objective, which is well-established in language modeling.

For technical details, the project maps VQ-VAE codebook indices to special tokens in the LLaMA vocabulary: {ACT_0, ACT_1, ..., ACT_C}. Thus, the VLA model generates abstract plans like “ACT_5 ACT_12 ACT_3” instead of specific joint angles.

Final Step: Installing “Actuators” for Specific Robots

The pre-trained UniVLA can only “make plans” but cannot “act.” To execute on real robots requires adding an action decoder. This post-training process is documented with three key points:

The decoder is extremely lightweight (12 million parameters)
Uses efficient LoRA fine-tuning (rank 32)
Total trainable parameters are only 123 million

This means you can freeze the massive VLA backbone and only train this small decoder to adapt to specific robots. For example, the LIBERO dataset requires 30k-40k fine-tuning steps, while CALVIN needs 100k steps. This modular design allows UniVLA to be quickly deployed to new devices without retraining the entire system.

Environment Setup and Installation: From Zero to Running

Theory is valuable only when put into practice. The UniVLA team provides a complete installation process, organized into a step-by-step manual.

Basic Environment Configuration

The project recommends using conda to manage environments, avoiding dependency conflicts:

# Create an independent Python 3.10 environment
conda create -n univla python=3.10 -y
conda activate univla

Next, install PyTorch. The documentation specifically warns: ensure you choose the correct command based on your CUDA version. The team’s experiments used the torch 2.2.0 + cuda 12.1 combination. If unsure of your CUDA version, run nvidia-smi first to check.

# Example: Installation command for CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Project Code and Dependency Installation

Obtain the source code and install project dependencies:

# Clone repository (requires SSH key configuration)
git clone git@github.com:OpenDriveLab/UniVLA.git
cd UniVLA
pip install -e .

The -e parameter indicates editable mode, facilitating subsequent code modifications. The installation process automatically downloads basic dependency libraries.

Training Acceleration Component Configuration

To train such a large model, Flash Attention acceleration must be enabled:

# Install build tools first
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja is ready, should return 0

# Install Flash Attention 2.5.5
pip install "flash-attn==2.5.5" --no-build-isolation

This step may take considerable time as it requires compiling CUDA kernels. If errors occur, they’re usually CUDA version or compiler-related. Check the official GitHub issue section for solutions.

Training Your Own UniVLA Model: Complete Workflow

Once the environment is installed, the next step is data preparation and training. The documentation provides detailed paths and parameter explanations, organized into a practical guide.

Data Preparation: Starting from the Source

Obtaining Robot Data

The OXE (Open X-Embodiment) dataset is the core training source. The documentation recommends a modified download script:

# Reference script in the rlds_dataset_mod repository
wget https://github.com/moojink/rlds_dataset_mod/blob/ad83e6c0efad5823540c0f6d3a05529596ead0b5/prepare_open_x.sh
bash prepare_open_x.sh

This script automatically downloads and converts multiple robot datasets into RLDS (Reinforcement Learning Dataset) format, which is the standard structure required for UniVLA training.

Converting Human Video Data

To include Ego4D human daily activity videos, additional conversion steps are needed. The documentation points to specific instructions:

# Conversion process documented in vla-scripts/extern/ego4d_rlds_dataset_builder/ego4d/README.md
# Main steps include video segmentation and action annotation format conversion

Important Note: While latent action model training relies on massive data, the team strongly recommends using their pre-trained checkpoints directly to save hundreds of hours of compute time, unless you’re researching the impact of data itself.

Training the Latent Action Model: Two-Stage Practice

Stage 1: Learning Universal Action Representations

The configuration file is located at latent_action_model/config/lam-stage-1.yaml. Key parameters include:

data_mix: Data mixture ratio, defined in prismatic/vla/datasets/rlds/oxe/mixtures.py
Batch size 512, corresponding to 64 per GPU across 8 GPUs

Launch command example (single node, 8 GPUs):

cd latent_action_model
torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
    --config config/lam-stage-1.yaml \
    2>&1 | tee lam-stage-1.log

Training takes approximately 3-5 days, depending on data volume and GPU speed. Logs display reconstruction loss, commitment loss, and other metrics in real-time.

Stage 2: Focusing on Task-Centric Actions

Modify stage_one_ckpt in lam-stage-2.yaml to point to Stage 1’s output path, then:

torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
    --config config/lam-stage-2.yaml \
    2>&1 | tee lam-stage-2.log

Stage 2 training is faster since most representation learning is complete. The output model will be used for subsequent VLA training.

Pretraining the Generalist Policy: Teaching AI to Plan

This is the most complex part of the system. The documentation provides a runnable script vla-scripts/train.sh. We break down the key parameters:

# 32-GPU cluster configuration example
GPUS_PER_NODE=8
NNODES=4
MASTER_PORT=28596
MASTER_ADDR="127.0.0.1"
RANK=0

# Launch distributed training
torchrun --nproc_per_node ${GPUS_PER_NODE} --nnodes ${NNODES} --node_rank ${RANK} \
         --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} train.py \
         --vla.type prism-dinosiglip-224px+mx-oxe-magic-soup-plus \
         --run_root_dir "vla_log"

Key parameter explanations:

vla.type: Selects VLM backbone and training data combination. +mx-oxe-magic-soup-plus indicates using all OXE data
run_root_dir: Path for saving logs and checkpoints

For training only on BridgeV2 subsets, change vla.type to prism-dinosiglip-224px+mx-bridge; for pure human data, use +mx-human.

After training completes, convert Prismatic-format weights to HuggingFace standard format:

python vla-scripts/extern/convert_openvla_weights_to_hf.py \
    --openvla_model_path_or_id /path/to/vla_log/checkpoints/latest.ckpt \
    --ckpt_name final_ckpt.pt \
    --output_hf_model_local_path /path/to/hf_format_model

The converted model is compatible with AutoModelForVision2Seq and can be loaded and inferred using standard HuggingFace tools.

Practical Deployment: From Simulation to Real Hardware

A pre-trained model is a “generalist,” but to become a “specialist” for specific scenarios, post-training is required. The documentation covers four evaluation platforms. We detail the most representative two.

LIBERO Simulation Benchmark: Kitchen Scene Testing

Environment Setup

LIBERO is a task suite simulating kitchen operations, containing 100 tasks across four difficulty levels:

Spatial: Spatial relationship understanding (e.g., “place the pot on the stove”)
Object: Object manipulation (e.g., “open the drawer”)
Goal: Goal-directed tasks (e.g., “put the apple in the plate”)
Long: Long-horizon multi-step tasks (e.g., “make a sandwich”)

First, download the converted RLDS-format data:

# Download from HuggingFace
git clone https://huggingface.co/datasets/openvla/modified_libero_rlds

Fine-Tuning Training

Modify three key paths in vla-scripts/finetune_libero.py:

vla_path: Path to pre-trained UniVLA (e.g., qwbu/univla-7b)
lam_path: Path to Stage 2 latent action model
data_root_dir: LIBERO RLDS data root directory

Launch 8-GPU training (using the most difficult Long tasks as an example):

# Train for 40k steps (30k for other subsets)
torchrun --standalone --nnodes 1 --nproc-per-node 8 finetune_libero.py \
         --dataset_name "libero_10_no_noops" \
         --max_steps 40000 \
         --run_root_dir "libero_log"

Performance Evaluation

After training completes, you’ll obtain two files: the fine-tuned VLA weights and the action decoder. The evaluation script runs 500 independent trials (50 per task):

# Install LIBERO environment
pip install -r experiments/robot/libero/libero_requirements.txt

# Start evaluation
python experiments/robot/libero/run_libero_eval.py \
    --task_suite_name libero_10 \
    --action_decoder_path /path/to/action_decoder.pt \
    --pretrained_checkpoint /path/to/libero_finetuned_univla \
    --num_trials_per_task 50 \
    --save_video True \
    --seed 7

save_video=True records videos for visual analysis of failure cases.

CALVIN Long-Horizon Task Chains: Continuous Operation Challenges

The CALVIN benchmark features multi-step task chains, such as “open drawer → take out cup → place at target location → close drawer.” This tests the model’s long-horizon planning capability and state retention ability.

Training Configuration

Similarly, modify path parameters in vla-scripts/finetune_calvin.py. Notably, window_size=12 indicates the model can remember the recent 12 steps of history, crucial for understanding task progress.

Launch command:

torchrun --standalone --nnodes 1 --nproc-per-node 8 finetune_calvin.py \
         --vla_path /path/to/univla-7b \
         --lam_path /path/to/lam-stage-2.ckpt \
         --calvin_root /path/to/calvin_dataset \
         --max_steps 100000 \
         --batch_size 8 \
         --grad_accumulation_steps 2 \
         --window_size 12 \
         --run_root_dir "calvin_log"

Distributed Evaluation

CALVIN evaluation supports multi-GPU parallelization to accelerate large-scale testing:

torchrun --standalone --nnodes 1 --nproc-per-node 8 \
         experiments/robot/calvin/run_calvin_eval_ddp.py \
         --calvin_root /path/to/calvin_dataset \
         --action_decoder_path /path/to/action_decoder.pt \
         --pretrained_checkpoint /path/to/calvin_finetuned_univla \
         --seed 7

DDP (Distributed Data Parallel) can reduce evaluation time to 1/8 of single-GPU runtime.

SimplerEnv: Cross-Platform Simulation Validation

SimplerEnv, built on the ManiSkill3 engine, provides high-fidelity physics simulation. UniVLA’s evaluation covers four core tasks:

Put Spoon on Towel
Put Carrot on Plate
Stack Green Block on Yellow Block
Put Eggplant in Yellow Basket

Environment Configuration

# Install ManiSkill3 branch version
git clone -b maniskill3 https://github.com/simpler-env/SimplerEnv.git
cd SimplerEnv
pip install --upgrade git+https://github.com/haosulab/ManiSkill.git
pip install -e .

# Copy UniVLA policy implementation
cp -r /path/to/UniVLA/experiments/robot/simpler-bridge/policies/univla \
      simpler_env/policies/
cp /path/to/UniVLA/experiments/robot/simpler-bridge/real2sim_eval_maniskill3.py \
   simpler_env/

Single-Task Evaluation Example

action_decoder_path="/path/to/univla-7b-224-sft-simpler-bridge/action_decoder.pt"
ckpt_path="/path/to/univla-7b-224-sft-simpler-bridge"

CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_PREALLOCATE=false \
python real2sim_eval_maniskill3.py \
    --model="univla" \
    -e "PutSpoonOnTableClothInScene-v1" \
    -s 0 \
    --num-episodes 24 \
    --num-envs 1 \
    --action_decoder_path ${action_decoder_path} \
    --ckpt_path ${ckpt_path}

XLA_PYTHON_CLIENT_PREALLOCATE=false prevents JAX from pre-allocating GPU memory, avoiding conflicts with PyTorch.

Real-World Robot Deployment: From Simulation to Reality

The documentation provides dedicated real-world deployment guidelines (docs/real-world-deployment.md) based on the AgiLex platform. While interfaces vary across platforms, the core workflow remains consistent:

Camera Calibration: Ensure visual inputs match training distribution
Action Decoder Adaptation: Map to actual robotic arm joints
Inference Optimization: Accelerate generation using TensorRT or vLLM
Safety Monitoring: Set action range and force feedback thresholds

Key code snippets demonstrate how to load models and perform real-time inference:

from prismatic.vla.action_utils import load_action_decoder
from prismatic.vla.datasets.rlds.utils import prepare_vla_obs

# Load fine-tuned VLA and decoder
vla = load_vla("/path/to/finetuned_univla")
action_decoder = load_action_decoder("/path/to/action_decoder.pt")

# Loop to read camera and language instructions
for obs in robot.camera.stream():
    # Preprocess observation
    vla_input = prepare_vla_obs(obs, task_language)
    
    # Generate latent action sequence
    latent_actions = vla.generate(vla_input, max_new_tokens=20)
    
    # Decode to actual joint actions
    robot_actions = action_decoder.decode(latent_actions, obs)
    
    # Execute with safety monitoring
    robot.execute(robot_actions, safe_mode=True)

Performance Breakdown: What the Numbers Really Mean

The documentation provides three detailed performance comparisons. Let’s interpret their practical significance one by one.

LIBERO Benchmark: Dominant Advantage

First, examine the full dataset training results:

Model	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long	Average Success Rate
Diffusion Policy	78.3%	92.5%	68.3%	50.5%	72.4%
Octo	78.9%	85.7%	84.6%	51.1%	75.1%
OpenVLA	84.7%	88.4%	79.2%	53.7%	76.5%
TraceVLA	84.6%	85.2%	75.1%	54.1%	74.8%
UniVLA (Ours)	96.5%	96.8%	95.6%	92.0%	95.2%

Interpretation: On long-horizon tasks (Long), UniVLA’s 92.0% success rate far exceeds other models’ 50-54% levels. This demonstrates that the latent action space fundamentally improves multi-step planning. Other models may perform adequately on local operations but lack the capacity for abstract, global task understanding.

Data Efficiency: Victory in Few-Shot Learning

More remarkable is the performance under data-constrained scenarios. On LIBERO-Goal and LIBERO-Long tasks, trained with only 10%, 20%, and 50% of data:

Model	Goal 10%	Goal 20%	Goal 50%	Long 10%	Long 20%	Long 50%
ATM	64.3%	77.1%	–	36.5%	39.1%	–
OpenVLA	61.4%	66.0%	77.0%	11.6%	22.4%	36.6%
OpenVLA-OFT	76.8%	88.2%	91.1%	43.0%	62.2%	77.8%
UniVLA	86.3%	90.4%	93.1%	62.4%	71.4%	87.0%

Interpretation: With only 10% of data, UniVLA achieves 62.4% success on the most challenging Long tasks, while base OpenVLA reaches only 11.6%. This indicates that the abstract representation of latent actions significantly reduces data requirements. The model doesn’t need to watch ten thousand “open drawer” demonstrations to understand the essence of the action.

SimplerEnv: Real Physical Interaction Tests

In SimplerEnv’s four tasks, evaluation is broken down into “grasp success rate” and “final task success rate”:

Model	Grasp Spoon	Task Success	Grasp Carrot	Task Success	Grasp Green Block	Task Success	Grasp Eggplant	Task Success	Overall Average
RT-1-X	16.7%	0.0%	20.8%	4.2%	8.3%	0.0%	0.0%	0.0%	1.1%
Octo-Base	34.7%	12.5%	52.8%	8.3%	31.9%	0.0%	66.7%	43.1%	16.0%
Octo-Small	77.8%	47.2%	27.8%	9.7%	40.3%	4.2%	87.5%	56.9%	30.0%
OpenVLA	4.1%	0.0%	33.3%	0.0%	12.5%	0.0%	8.3%	4.1%	1.0%
RoboVLM	54.2%	29.2%	25.0%	25.0%	45.8%	12.5%	58.3%	58.3%	31.3%
UniVLA	76.4%	52.8%	79.2%	55.6%	66.7%	2.8%	93.0%	80.6%	47.9%

Note: In the “Stack Green Block on Yellow Block” task, all models perform poorly (UniVLA only 2.8% success), indicating that precise stacking remains challenging for current methods. However, UniVLA demonstrates strong robustness in grasping and placement tasks, particularly with the eggplant task reaching 80.6% success, far surpassing other models.

In-Depth Frequently Asked Questions

Based on documented content and potential user confusion, we compile the following Q&A.

Q1: What is the fundamental difference between UniVLA and OpenVLA?

A: The core difference lies in action representation. OpenVLA directly predicts low-level joint actions, like memorizing a dictionary. UniVLA first predicts high-level intentions (latent actions), then decodes them into specific actions, like understanding grammar before forming sentences. This gives UniVLA clear advantages in cross-robot transfer and data efficiency. The documentation explicitly states training costs are only 5% of OpenVLA’s, yet performance nearly doubles on LIBERO-Long.

Q2: Which robot embodiments does the pre-trained model support?

A: The full UniVLA (univla-7b) is trained on OXE’s diverse manipulation and navigation datasets plus Ego4D human videos, theoretically supporting any robot that can extract visual observations and language instructions. However, real-world deployment requires fine-tuning the action decoder. The documentation provides deployment guidelines for the AgiLex platform; other platforms require custom adaptation.

Q3: How can I reproduce results with limited computational resources?

A: The most resource-efficient path is: use pre-trained checkpoints directly + fine-tune only the action decoder. The LIBERO and CALVIN fine-tuning described in the documentation supports 8 A100 GPUs. For fewer resources:

Use subset pre-trained models like univla-7b-bridge-pt or univla-7b-human-pt
Reduce batch_size in finetune_libero.py and increase grad_accumulation_steps
Use lower LoRA rank (e.g., 8 or 16)

Q4: How does the codebook size of the latent action model affect performance?

A: While the documentation doesn’t explicitly state the codebook size C, the action token design {ACT_0, ACT_1, ..., ACT_C} indicates this is a hyperparameter. Generally, larger codebooks yield finer action representations but increase training difficulty. In practice, start with 512 or 1024 and adjust based on reconstruction loss and downstream performance. The two-stage training design also makes codebook optimization more stable.

Q5: Why is the improvement most significant on long-horizon tasks (Long)?

A: Because latent actions naturally suit hierarchical planning. Imagine a 10-step task: low-level methods must precisely predict every joint angle, with errors accumulating; UniVLA only needs to predict 10 high-level intentions (e.g., “grasp → move → place → return”), with each intention corresponding to a short action sequence, drastically reducing planning difficulty. This explains the overwhelming 92.0% vs. 53.7% advantage on LIBERO-Long.

Q6: How can we verify the interpretability of latent actions?

A: While no direct visualization tools are provided in the documentation, you can analyze them through:

Fixing video input, perturbing latent action tokens, and observing changes in decoded physical actions
Extracting tokens from Ego4D videos and clustering to analyze token distributions for similar actions
Using t-SNE to visualize codebook vectors and see if semantics like “grasp” and “push” separate

Q7: Does the model support multimodal inputs like depth images or tactile feedback?

A: The current documentation only mentions RGB images and language instructions. However, the architecture design allows VQ-VAE encoders to be extended. To add depth images, modify prepare_vla_obs to input depth as additional channels; tactile data can be encoded as token sequences concatenated with visual features. This requires retraining or fine-tuning Stage 1.

Q8: What is the latency of the action decoder? Can it meet real-time control requirements?

A: Specific latency numbers aren’t provided, but the documentation emphasizes the decoder’s small size (12 million parameters). On modern GPUs, inference for such a network typically takes 5-10 milliseconds. The main latency comes from the VLA backbone’s autoregressive generation (around 100-200ms). For non-high-speed scenarios (e.g., tabletop manipulation), this is sufficiently real-time.

Q9: How does the model handle action frequency differences across robots in training data?

A: This is a key challenge for cross-robot learning. The latent action model naturally aligns different frequencies through discretization: regardless of whether original data is 10Hz or 50Hz, it’s encoded into similar-length token sequences. The “cross-embodiment video” learning mentioned in the documentation refers to this. Temporal downsampling/upsampling is automatically handled by the VQ-VAE encoder/decoder.

Q10: Can the model handle dynamic environments and moving targets?

A: LIBERO and CALVIN are static tabletop environments; SimplerEnv has some dynamic objects. Architecturally, UniVLA’s VLM backbone (DINO-SigLip + LLaMA) can understand dynamic scenes since Ego4D training data contains extensive human motion. However, actual performance depends on whether fine-tuning data covers similar distributions. To make robots chase moving targets, include such examples in post-training data.

Conclusion: UniVLA’s Present and Future

After thoroughly reviewing the documentation, UniVLA’s core value can be summarized as: using abstraction to combat complexity. It doesn’t try to make robots memorize every detail of the world but learns an action language that captures commonalities. This philosophy yields three practical benefits:

Data Efficiency: Achieving higher performance with less data, crucial in robotics
Transfer Capability: Train once, deploy everywhere, reducing marginal costs for new scenarios
Scalability: The lightweight decoder enables universities and small labs to participate in research

Technically, the three-stage pipeline clearly separates representation learning, policy planning, and execution adaptation, with clear inputs/outputs for each stage, facilitating debugging and improvement. The 9 model versions provided in the model zoo (from base to downstream task fine-tunes) form a complete research loop.

Of course, challenges remain. The documentation mentions poor “stacking task” performance and OpenVLA’s strange behavior on some grasping tasks (only 4.1% success), highlighting current method limitations—fine force control and physical reasoning remain difficult. Additionally, real-world deployment complexity (lighting variations, mechanical wear, sensor noise) requires more engineering effort.

For teams looking to adopt UniVLA, the recommended path is:

Research Validation: Start with LIBERO or CALVIN fine-tuning to quickly verify algorithmic improvements
Industrial Application: Fine-tune on BridgeV2 data for common robotic arms like WidowX
Frontier Exploration: Apply latent actions to integrated navigation-manipulation tasks for humanoid robots

As the team acknowledges OpenVLA in their contributions, this field advances by standing on giants’ shoulders. UniVLA isn’t the endpoint but a critical step toward making Vision-Language-Action models practical. In the future, we may see robots truly understand that “cleaning the room” means a combination of tidying, sweeping, and organizing—not just empty words.

Model Downloads and Code Repository:

Pre-trained Models: https://huggingface.co/qwbu
Source Code: https://github.com/OpenDriveLab/UniVLA
Technical Paper: https://arxiv.org/pdf/2505.06111

To cite this content, please reference:

@article{bu2025univla,
  title={Univla: Learning to act anywhere with task-centric latent actions},
  author={Bu, Qingwen and Yang, Yanting and Cai, Jisong and Gao, Shenyuan and Ren, Guanghui and Yao, Maoqing and Luo, Ping and Li, Hongyang},
  journal={arXiv preprint arXiv:2505.06111},
  year={2025}
}

UniVLA Unlocked: How Hidden Language Makes Robots Finally Understand Complex Tasks