FastTD3: Simple, Fast, and Powerful Reinforcement Learning for Humanoid Control
Reinforcement learning has dramatically advanced robotics capabilities in recent years, particularly for humanoid control tasks that require complex movement and manipulation. However, traditional RL algorithms often suffer from long training times and implementation complexity that hinder practical application and rapid iteration.
Addressing these challenges, researchers have developed FastTD3 – a high-performance variant of the Twin Delayed Deep Deterministic Policy Gradient algorithm specifically optimized for complex humanoid control tasks. What makes FastTD3 remarkable isn’t algorithmic complexity but rather its strategic combination of proven techniques that deliver unprecedented training speeds without sacrificing performance.
The Challenge: Why Humanoid Control Needs Faster RL
Humanoid robots present unique control challenges with their high degrees of freedom, complex dynamics, and stringent balance requirements. Traditional approaches like Proximal Policy Optimization (PPO) have been widely used because they train quickly with massively parallel simulation. However, as on-policy algorithms, they lack sample efficiency, making them unsuitable for fine-tuning during real-world deployment or initializing training with demonstrations.
Off-policy algorithms like SAC and TD3 offer better sample efficiency but typically require much longer training times and greater implementation complexity. This trade-off has forced researchers and practitioners to choose between rapid iteration and sample efficiency – until now.
FastTD3 bridges this gap by delivering both speed and efficiency, enabling training of complex humanoid control policies in under 3 hours on a single A100 GPU for various tasks from popular benchmarks including HumanoidBench, IsaacLab, and MuJoCo Playground.
What Makes FastTD3 Different?
FastTD3 builds upon the TD3 foundation with several key optimizations that collectively deliver dramatic performance improvements:
Parallel Environment Simulation
By utilizing massively parallel environments (typically 1000+), FastTD3 significantly increases data throughput. The randomness from parallel environments enhances data diversity, allowing TD3 to better leverage its strength in value function exploitation while mitigating its exploration limitations.
Large-Batch Training
FastTD3 uses unusually large batch sizes of 32,768 for network updates. While this increases per-update computation time, the high data diversity in each gradient update provides more stable learning signals for the critic network, ultimately reducing total training time through improved training efficiency.
Distributional Critic
Incorporating distributional reinforcement learning, FastTD3 models value distributions rather than just expected values. This better captures environmental uncertainty and improves learning efficiency, though it introduces two additional hyperparameters (v_min and v_max) that require tuning.
Architectural Simplicity
Despite recent trends toward more complex architectures, FastTD3 uses standard MLPs with descending layer sizes (1024-512-256 for critics, 512-256-128 for actors). Experiments showed that smaller models degrade both time and sample efficiency, while additional complexity like residual connections or layer normalization slowed training without significant gains.
Implementation and Technical Details
Environment Support and Wrappers
FastTD3 supports multiple popular robotics simulation platforms through customized wrappers that provide consistent interfaces:
HumanoidBench: Uses Stable Baselines3’s SubprocVecEnv for multiprocessing parallelism. The team contributed a pull request to disable default GPU-based rendering, enabling more than 100 parallel environments.
MuJoCo Playground: Employs a custom wrapper that converts JAX tensors to PyTorch tensors and follows the RSL-RL API. Additional functionality was implemented to save final observations before environment resets.
IsaacLab: Features a lightweight wrapper compliant with the RSL-RL API. Note that rendering during training isn’t currently supported due to IsaacLab’s limitations on concurrent simulations.
Asymmetric Actor-Critic Design
For environments providing privileged state information (IsaacLab and MuJoCo Playground), FastTD3 implements asymmetric actor-critic architecture. This allows the critic network to leverage additional state information while the actor operates on partial observations only.
Performance Optimization Techniques
Automatic Mixed Precision (AMP): Using bfloat16 precision training provides up to 40% speedup without observed instability.
Torch.Compile: Built on the LeanRL framework, this delivers up to 35% additional speedup through just-in-time compilation.
Combined Optimization: Using both AMP and torch.compile together can achieve up to 70% training speed improvement.
Replay Buffer Design
Instead of a fixed-size global replay buffer, FastTD3 sets buffer size as N × num_envs. This approach decouples buffer size from environment count, ensuring that increasing parallel environments doesn’t reduce the amount of trajectory data stored per environment. All data resides on GPU to avoid CPU-GPU transfer bottlenecks.
Experimental Results and Performance
FastTD3 has been extensively evaluated across multiple benchmarks:
HumanoidBench Performance
Across 39 tasks in HumanoidBench, FastTD3 achieves success threshold performance in most tasks within 3 hours of training on a single A100 GPU. The algorithm particularly excels at tasks involving dexterous hand manipulation and whole-body coordination.
IsaacLab and MuJoCo Playground Results
In high-dimensional control tasks from these platforms, FastTD3 matches or exceeds PPO’s training speed while maintaining better sample efficiency. The advantage is especially pronounced on rough terrain tasks with domain randomization.
Sim-to-Real Transfer
In a significant milestone, policies trained with FastTD3 in MuJoCo Playground were successfully transferred to a physical Booster T1 humanoid robot. This represents the first documented instance of successful off-policy RL policy deployment on a full-size humanoid robot.
Getting Started with FastTD3
System Requirements and Dependencies
Before installation, ensure your system has these prerequisites:
-
Conda for environment management -
Git LFS (for IsaacLab support) -
CMake (for IsaacLab) -
System packages: libglfw3, libgl1-mesa-glx, libosmesa6, git-lfs, cmake
Environment Setup
FastTD3 requires different environments for different simulation platforms:
HumanoidBench Environment:
conda create -n fasttd3_hb -y python=3.10
conda activate fasttd3_hb
pip install --editable git+https://github.com/carlosferrazza/humanoid-bench.git#egg=humanoid-bench
pip install -r requirements/requirements.txt
MuJoCo Playground Environment:
conda create -n fasttd3_playground -y python=3.10
conda activate fasttd3_playground
pip install -r requirements/requirements_playground.txt
IsaacLab Environment:
conda create -n fasttd3_isaaclab -y python=3.10
conda activate fasttd3_isaaclab
pip install 'isaacsim[all,extscache]==4.5.0' --extra-index-url https://pypi.nvidia.com
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab
./isaaclab.sh --install
cd ..
pip install -r requirements/requirements.txt
Running Training Experiments
Basic HumanoidBench Training:
conda activate fasttd3_hb
python fast_td3/train.py \
--env_name h1hand-hurdle-v0 \
--exp_name FastTD3 \
--render_interval 5000 \
--seed 1
MuJoCo Playground Training:
conda activate fasttd3_playground
python fast_td3/train.py \
--env_name T1JoystickFlatTerrain \
--exp_name FastTD3 \
--render_interval 5000 \
--seed 1
FastTD3 + SimbaV2 Configuration
For best performance, the authors recommend using FastTD3 with SimbaV2:
python fast_td3/train.py \
--env_name T1JoystickFlatTerrain \
--exp_name FastTD3 \
--render_interval 5000 \
--agent fasttd3_simbav2 \
--batch_size 8192 \
--critic_learning_rate_end 3e-5 \
--actor_learning_rate_end 3e-5 \
--weight_decay 0.0 \
--critic_hidden_dim 512 \
--critic_num_blocks 2 \
--actor_hidden_dim 256 \
--actor_num_blocks 1 \
--seed 1
Multi-GPU Training Support
FastTD3 supports multi-GPU training through a parallelized approach where identical experiments run independently on each GPU rather than distributing parameters across devices. This design means:
-
Effective environment count = num_envs × num_gpus -
Effective batch size = batch_size × num_gpus -
Effective buffer size = buffer_size × num_gpus
To launch multi-GPU training:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train_multigpu.py \
--env_name Isaac-Velocity-Flat-G1-v0 \
--exp_name FastTD3-MultiGPU \
--seed 1
This approach scales data collection and training throughput proportionally to GPU count, making it possible to achieve similar results to single-GPU training with higher environment counts but in less time.
Practical Tips and Troubleshooting
Improving Training Performance
If FastTD3 gets stuck in local optima during early training:
-
Increase num_updates: This helps the agent better exploit value functions, especially useful for easier tasks or those with low-dimensional action spaces.
-
Adjust num_steps: Try num_steps=3 or disable use_cdq if the agent fails to make progress.
-
Modify reward penalties: For tasks with penalty terms (torque, energy, action rate), consider starting with lower penalties and gradually increasing them. Curriculum learning approaches often work well.
Managing Memory Usage
Since FastTD3 stores the entire replay buffer on GPU, memory can become limiting. To reduce GPU memory usage:
-
First decrease buffer_size -
Then reduce batch_size -
Finally reduce num_envs
This order preserves data diversity while managing memory constraints.
Optimizing Training Speed
-
Use --compile_mode max-autotune
for long training runs (provides ~10% speedup after initial compilation) -
Ensure proper GPU rendering configuration (rendering should take <5 seconds per episode) -
For cloud instances, install NVIDIA user-space libraries if needed:
sudo apt install -y kmod
sudo sh NVIDIA-Linux-x86_64-<your_driver_version>.run -s --no-kernel-module --ui=none --no-questions
Reward Function Considerations
An important finding from the FastTD3 development is that different RL algorithms may require different reward functions to produce desirable behaviors. During MuJoCo Playground experiments, the team discovered that:
-
PPO-tuned reward functions produced unnatural gaits when used with FastTD3 -
Reward functions specifically tuned for FastTD3 (with stronger penalty terms) produced stable, visually appealing gaits -
These FastTD3-tuned rewards caused PPO to learn overly conservative, slow-walking policies
This suggests that the common practice of using episode return as the primary performance metric may not adequately capture policy quality for real-world deployment. Researchers should consider algorithm-specific reward tuning, especially for sim-to-real transfer.
Beyond TD3: FastSAC Experiments
To test whether the FastTD3 approach generalizes to other algorithms, the team developed FastSAC – a variant of Soft Actor-Critic incorporating the same optimizations. Results showed:
-
FastSAC trains significantly faster than vanilla SAC -
It remains slower than FastTD3 overall -
FastSAC shows greater training instability, likely due to difficulties maximizing action entropy in high-dimensional spaces
This suggests that while the FastTD3 recipe provides benefits across algorithms, TD3 remains particularly well-suited to this optimized approach.
Future Directions and Applications
FastTD3’s design makes it orthogonal to many recent RL advances, meaning it can potentially incorporate developments from algorithms like SR-SAC, BBF, BRO, Simba, NaP, TDMPC2, and others. This positions FastTD3 as a strong foundation for future innovation.
Potential application areas include:
Demo-Driven RL: FastTD3’s off-policy nature makes it ideal for learning from human demonstrations, potentially accelerating real-world deployment.
Real-World Fine-Tuning: The algorithm’s sample efficiency supports fine-tuning simulation-trained policies through actual robot interactions.
Iterative Inverse RL: Combined with language models as reward generators, FastTD3 could help address reward design challenges in humanoid control.
Conclusion
FastTD3 demonstrates that sometimes the most significant advances come not from algorithmic innovation but from thoughtful integration and optimization of existing techniques. By combining parallel simulation, large-batch training, distributional RL, and careful hyperparameter tuning, FastTD3 delivers exceptional training speeds without compromising on final performance.
The project’s open-source implementation lowers barriers to entry for researchers and practitioners working on humanoid control problems. With support for multiple simulation platforms, preconfigured hyperparameters, and features like rendering support and checkpointing, FastTD3 provides a solid foundation for both research and application development.
As reinforcement learning continues to evolve, approaches like FastTD3 that prioritize practical efficiency alongside algorithmic performance will play increasingly important roles in bridging the gap between simulation research and real-world robotics deployment.
Frequently Asked Questions
What hardware do I need to run FastTD3?
All experiments were conducted on a single NVIDIA A100 80GB GPU with 16 CPU cores. For less demanding tasks, high-end consumer GPUs like the RTX 4090 may suffice.
How does FastTD3 compare to PQL (Parallel Q-Learning)?
While both leverage parallel simulation, FastTD3 uses a simpler synchronous approach compared to PQL’s asynchronous parallelism. FastTD3 focuses on providing an easier-to-implement, maintain, and extend codebase while delivering similar performance benefits.
Can I use FastTD3 for custom tasks?
Yes, the codebase is designed to be extensible. You can implement custom environment wrappers following the RSL-RL API pattern used for the supported platforms.
Does FastTD3 support vision-based inputs?
The current implementation focuses on state-based inputs. Adding vision support would require architectural modifications beyond the scope of the current codebase.
How often should I evaluate policies during training?
The default setting evaluates every 5,000 environment steps, but this can be adjusted based on task complexity and training duration.
What’s the best way to handle environment-specific hyperparameters?
The codebase includes preconfigured hyperparameters for supported environments. For new environments, start with settings from similar tasks and iterate based on performance.
Resources and References
-
Project Website: https://younggyo.me/fast_td3 -
arXiv Paper: https://arxiv.org/abs/2505.22642 -
GitHub Repository: https://github.com/younggyoseo/fast_td3