HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents
Core Question: What is HY-Embodied-0.5, what core capabilities does it deliver, and how to deploy & use it for real-world embodied intelligence and robotic control tasks?
1. Model Overview
Core Question: What is the positioning and core value of HY-Embodied-0.5?
HY-Embodied-0.5 is a dedicated suite of embodied foundation models developed by Tencent Robotics X and HY Vision Team, built exclusively to power real-world embodied intelligence systems. It closes the critical performance gap between generic Vision-Language Models (VLMs) and the strict operational demands of physical agents, with specialized enhancements for spatial-temporal visual perception and complex embodied reasoning (prediction, interaction, and planning).
Officially released on April 9, 2026, HY-Embodied-0.5 open-sources the HY-Embodied-0.5 MoT-2B weights on Hugging Face, paired with full official inference code. The model family includes two purpose-built variants:
-
A highly efficient 2B model optimized for edge device deployment -
A powerful 32B model designed for high-complexity reasoning tasks
Through self-evolving post-training and large-to-small on-policy distillation, the compact MoT-2B outperforms state-of-the-art peer models across 16 benchmarks. The 32B variant achieves frontier-level performance comparable to Gemini 3.0 Pro. At its core, HY-Embodied serves as a robust “brain” for Vision-Language-Action (VLA) pipelines, delivering reliable results for real-world physical robot control.
2. Key Core Features
Core Question: What unique advantages make HY-Embodied-0.5 stand out from similar embodied intelligence models?
2.1 Evolved Mixture-of-Transformers (MoT) Architecture
The MoT architecture uses latent tokens for modality-specific computing, maximizing efficiency while preserving full visual precision. The MoT-2B variant has a total parameter count of 4B, but only 2.2B activated parameters during inference. By prioritizing modality-specific computing in the vision pathway, it matches the inference speed of a dense 2B model while producing superior fine-grained perceptual representations.
2.2 High-Quality Mixed Chain Reasoning
An advanced iterative, self-evolving post-training pipeline powers robust reasoning. Using on-policy distillation, the compact 2B model directly inherits sophisticated step-by-step reasoning, planning, and high-quality “thinking” capabilities from the 32B flagship model.
2.3 Large-Scale Embodied Pre-training
The model is trained on a curated dataset of over 100 million embodied and spatial-specific data points, with a total training corpus exceeding 200 billion tokens. This foundation builds native, deep understanding of 3D spaces, physical object interactions, and agent dynamics.
2.4 Optimized for VLA Pipeline Integration
Beyond academic benchmarks, HY-Embodied-0.5 is engineered as the core cognitive engine for physical robots. It integrates seamlessly into Vision-Language-Action (VLA) frameworks, acting as a stable, high-performance brain to drive high success rates in complex real-world robotic control tasks.
3. Project Roadmap
Core Question: What functional updates are planned for HY-Embodied-0.5 in future releases?
-
[x] Transformers Inference -
[ ] vLLM Inference -
[ ] Online Gradio Demo
4. Dependencies & Installation Guide
Core Question: What hardware/software requirements are needed to deploy HY-Embodied-0.5, and how to complete the installation correctly?
4.1 Prerequisites
-
Operating System: Linux (officially recommended) -
Python: 3.12 or later (tested and validated) -
CUDA: 12.6 -
PyTorch: 2.8.0 -
GPU: NVIDIA GPU with full CUDA support
4.2 Step-by-Step Installation
-
Install the custom Transformers version required for model compatibility:
pip install git+https://github.com/huggingface/transformers@9293856c419762ebf98fbe2bd9440f9ce7069f1a
Note: All custom improvements will be merged into the main Transformers branch in future updates.
-
Install remaining project dependencies:
pip install -r requirements.txt
4.3 Quick Start
-
Clone the official repository:
git clone https://github.com/Tencent-Hunyuan/HY-Embodied
cd HY-Embodied/
-
Install full dependencies:
pip install -r requirements.txt
-
Run the inference script:
python inference.py
The demo script supports both single-generation and batch-generation workflows.
4.4 Model Download
Model weights (tencent/HY-Embodied-0.5) are automatically downloaded from the Hugging Face Hub. Ensure 8 GB of free disk space for the weight files.
4.5 Hardware Requirements
-
GPU: NVIDIA GPU with minimum 16GB VRAM (recommended for optimal performance) -
CPU: Supported, but with significantly slower inference speed -
System Memory: Minimum 16GB RAM (recommended) -
Storage: 20GB+ free space for model files and dependencies
5. Inference Examples with Transformers
Core Question: How to implement single-sample and batch inference for HY-Embodied-0.5?
5.1 Basic Single Inference
import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
# Core configuration
MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8
# Load processor
processor = AutoProcessor.from_pretrained(MODEL_PATH)
# Load chat template if available
chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
processor.chat_template = open(chat_template_path).read()
# Load model with bfloat16 precision
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()
# Prepare input (image + text prompt)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "./figures/example.jpg"},
{"type": "text", "text": "Describe the image in detail."},
],
}
]
# Process input with chat template
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
enable_thinking=THINKING_MODE,
).to(model.device)
# Generate output without gradient computation
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=32768,
use_cache=True,
temperature=TEMPERATURE,
do_sample=TEMPERATURE > 0,
)
# Decode and print final result
output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)]
print(processor.batch_decode(output_ids, skip_special_tokens=True)[0])
5.2 Batch Inference
import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
# Core configuration
MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8
# Load processor and model
processor = AutoProcessor.from_pretrained(MODEL_PATH)
chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
processor.chat_template = open(chat_template_path).read()
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()
# Batch input set (image-text & text-only)
messages_batch = [
# Sample 1: Image + text prompt
[
{
"role": "user",
"content": [
{"type": "image", "image": "./figures/example.jpg"},
{"type": "text", "text": "Describe the image in detail."},
],
}
],
# Sample 2: Text-only prompt
[
{
"role": "user",
"content": [
{"type": "text", "text": "How to open a fridge?"},
],
}
],
]
# Process each input independently
all_inputs = []
for msgs in messages_batch:
inp = processor.apply_chat_template(
msgs,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
enable_thinking=THINKING_MODE,
)
all_inputs.append(inp)
# Left-padding for stable batch processing
batch = processor.pad(all_inputs, padding=True, padding_side="left").to(model.device)
# Run batch generation
with torch.no_grad():
batch_generated_ids = model.generate(
**batch,
max_new_tokens=32768,
use_cache=True,
temperature=TEMPERATURE,
do_sample=TEMPERATURE > 0,
)
# Decode and output results
padded_input_len = batch["input_ids"].shape[1]
for i, msgs in enumerate(messages_batch):
out_ids = batch_generated_ids[i][padded_input_len:]
print(f"\n--- Sample {i} ---")
print(processor.decode(out_ids, skip_special_tokens=True))
6. Performance Evaluation
Core Question: How does HY-Embodied-0.5 MoT-2B perform on standard embodied intelligence benchmarks?
HY-Embodied-0.5 MoT-2B was tested on 22 embodied-relevant benchmarks against models of similar size. All results for HY-Embodied-0.5 use thinking mode; other models use the best performance between thinking and non-thinking modes.
6.1 Visual Performance
| Benchmark | HY-Embodied 0.5 MoT-2B | Qwen3-VL 2B | Qwen3-VL 4B | RoboBrain 2.5 4B | MiMo-Embodied 7B |
|---|---|---|---|---|---|
| CV-Bench | 89.2 | 80.0 | 85.7 | 86.9 | 88.8 |
| DA-2K | 92.3 | 69.5 | 76.5 | 79.4 | 72.2 |
6.2 Embodied Understanding
| Benchmark | HY-Embodied 0.5 MoT-2B | Qwen3-VL 2B | Qwen3-VL 4B | RoboBrain 2.5 4B | MiMo-Embodied 7B |
|---|---|---|---|---|---|
| ERQA | 54.5 | 41.8 | 47.3 | 43.3 | 46.8 |
| EmbSpatial-Bench | 82.8 | 75.9 | 80.7 | 73.8 | 76.2 |
| RoboBench-MCQ | 49.2 | 36.9 | 45.8 | 44.4 | 43.6 |
| RoboBench-Planning | 54.2 | 36.2 | 36.4 | 39.2 | 58.7 |
| RoboSpatial-Home | 55.7 | 45.3 | 63.2 | 62.3 | 61.8 |
| ShareRobot-Aff. | 26.8 | 19.8 | 25.5 | 25.5 | 9.0 |
| ShareRobot-Traj. | 73.3 | 41.6 | 62.2 | 81.4 | 50.6 |
| Ego-Plan2 | 45.5 | 35.5 | 38.8 | 52.6 | 39.9 |
6.3 Spatial Understanding
| Benchmark | HY-Embodied 0.5 MoT-2B | Qwen3-VL 2B | Qwen3-VL 4B | RoboBrain 2.5 4B | MiMo-Embodied 7B |
|---|---|---|---|---|---|
| 3DSRBench | 57.0 | 39.9 | 43.9 | 44.8 | 42.0 |
| All-Angles Bench | 55.1 | 42.3 | 46.7 | 43.8 | 49.0 |
| MindCube | 66.3 | 28.4 | 31.0 | 26.9 | 36.2 |
| MMSI-Bench | 33.2 | 23.6 | 25.1 | 20.5 | 31.9 |
| RefSpatial-Bench | 45.8 | 28.9 | 45.3 | 56.0 | 48.0 |
| SAT | 76.7 | 45.3 | 56.7 | 51.3 | 78.7 |
| SIBench-mini | 58.2 | 42.0 | 50.9 | 47.3 | 53.1 |
| SITE-Bench-Image | 62.7 | 52.3 | 61.0 | 57.9 | 49.9 |
| SITE-Bench-Video | 63.5 | 52.2 | 58.0 | 54.8 | 58.9 |
| ViewSpatial | 53.1 | 37.2 | 41.6 | 36.6 | 36.1 |
| VSIBench | 60.5 | 48.0 | 55.2 | 41.7 | 48.5 |
| Where2Place | 68.0 | 45.0 | 59.0 | 65.0 | 63.6 |
7. Citation
If you use HY-Embodied-0.5 in your research or industrial applications, please cite the official paper:
@article{tencent2026hyembodied05,
title={HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents},
author={Tencent Robotics X and HY Vision Team},
journal={arXiv preprint arXiv:2604.07430},
year={2026}
}
8. SEO-Optimized FAQ
Core Question: What are the most frequently asked questions about HY-Embodied-0.5?
Q1: What is the difference between HY-Embodied-0.5 MoT-2B and 32B models?
A: The 2B variant is optimized for edge deployment with 2.2B activated parameters and fast inference. The 32B variant handles complex reasoning with frontier-level performance matching Gemini 3.0 Pro, requiring higher computational resources.
Q2: Is a dedicated GPU required to run HY-Embodied-0.5?
A: CPU inference is supported but extremely slow. An NVIDIA GPU with 16GB+ VRAM is strongly recommended for real-time task performance.
Q3: Why must I install a specific version of Transformers?
A: HY-Embodied-0.5 uses custom MoT architecture features not yet merged into the main Transformers branch, ensuring full model functionality and compatibility.
Q4: What does the THINKING_MODE parameter do in inference?
A: Enabling THINKING_MODE activates step-by-step reasoning logic, which is the mode used for all official benchmark results of HY-Embodied-0.5.
Q5: Does HY-Embodied-0.5 support Windows operating systems?
A: Linux is the only officially tested and recommended OS. Windows may cause dependency conflicts, driver mismatches, and runtime errors.
Q6: Where can I download HY-Embodied-0.5 model weights?
A: Weights are automatically downloaded from the Hugging Face Hub (tencent/HY-Embodied-0.5); no manual download is required.
Q7: Why use left-padding in batch inference?
A: Left-padding ensures consistent token alignment across variable-length inputs, preventing garbled outputs and logical errors in batch generation.
9. Conclusion
HY-Embodied-0.5 redefines embodied foundation models by balancing edge efficiency, real-world performance, and VLA pipeline compatibility. Its innovative MoT architecture, large-scale embodied pre-training, and cross-model distillation make it a production-ready cognitive core for physical robots.
The open-sourced MoT-2B variant lowers the barrier to edge deployment for researchers and engineers, while the 32B variant pushes the limits of complex embodied reasoning. Upcoming updates—including vLLM inference support and an online Gradio demo—will further expand accessibility and scalability for global embodied intelligence projects.
Would you like me to optimize this post with Google SEO keyword density tuning and schema markup for AI models?
