InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Introduction

In the fields of computer vision and artificial intelligence, accurately inferring 3D interaction information from 2D images has long been a challenging problem. InteractVLM emerges as a promising solution to this issue. It can estimate 3D contact points on both human bodies and objects from single in-the-wild images, enabling accurate joint 3D reconstruction of humans and objects. This article will provide a detailed overview of InteractVLM, including its core concepts, model architecture, installation and usage methods, training and evaluation processes, and more.

Visual representation of 3D interaction technology

An Overview of InteractVLM

Research Team and Background

InteractVLM is the result of collaborative efforts by researchers from several renowned institutions. These include Sai Kumar Dwivedi, Shashank Tripathi, Omid Taheri, and Michael J. Black from the Max Planck Institute for Intelligent Systems; Dimitrije Antić and Dimitrios Tzionas from the University of Amsterdam; and Cordelia Schmid from Inria. The project’s related成果 are scheduled to be presented at the CVPR 2025 conference.

Core Functions and Innovations

The core function of InteractVLM is to estimate 3D contact points on both human bodies and objects from single in-the-wild images, thereby enabling accurate joint 3D reconstruction of humans and objects. It introduces an innovative task called Semantic Human Contact, which goes beyond traditional Binary Human Contact by inferring object-specific contact points on the human body.

By leveraging the rich visual knowledge of large Vision-Language Models, InteractVLM addresses the limitation of limited availability of ground-truth 3D interaction data for training, allowing it to better generalize to various real-world interaction scenarios.

Application Scenarios of InteractVLM

Joint Human-Object Reconstruction

InteractVLM excels in joint human-object reconstruction. By inputting a single in-the-wild image, it can accurately estimate 3D contact points between humans and objects,进而实现 accurate joint 3D reconstruction of both. As seen in the example images from the README file, the original input images processed by InteractVLM produce high-quality results for joint human-object reconstruction.

Semantic Human Contact

Semantic Human Contact is an important task introduced by InteractVLM. Traditional Binary Human Contact can only determine whether a human body is in contact with an object, while Semantic Human Contact can infer the specific areas of the human body that make contact with a particular object. For instance, in an image of a person holding a bottle, Semantic Human Contact can accurately identify which parts of the human body are in contact with the bottle. This functionality has potential applications in various fields such as human-computer interaction and virtual reality.

Model Zoo Introduction

Model List

InteractVLM offers several pre-trained models, each designed for specific purposes and trained on particular datasets. Here are some of the models in the model zoo:

#	Model	Type	Training Datasets	Comment	Status
1	`interactvlm-3d-hcontact-damon`	hcontact	DAMON	Winner of RHOBIN Human Contact Challenge (CVPR 2025)	Available
2	`interactvlm-3d-hcontact-wScene-damon-lemon-rich`	hcontact	DAMON + LEMON-HU + RICH	Best in-the-wild 3D Human Contact Estimation (with foot ground contact)	Available
3	`interactvlm-3d-oafford-lemon-piad`	oafford	LEMON-OBJ + PIAD	Estimates Object Affordance	Available
4	`interactvlm-2d-hcontact`	h2dcontact	Extended LISA by projecting DAMON contact on images	2D Human Contact Segmentation via Referring Segmentation	Available
5	`interactvlm-3d-hcontact-ocontact`	hcontact + ocontact	DAMON + LEMON-HU + RICH + LEMON-OBJ + PIAD + PICO + HOI-VQA	Single Model for Joint 3D Human Object Contact Estimation	Available

Model Features and Uses

interactvlm-3d-hcontact-damon: This model won the RHOBIN Human Contact Challenge (CVPR 2025). It is primarily used for 3D human contact estimation and was trained using the DAMON dataset.
interactvlm-3d-hcontact-wScene-damon-lemon-rich: Suitable for in-the-wild 3D human contact estimation, considering foot contact with the ground. It was trained on the DAMON, LEMON-HU, and RICH datasets.
interactvlm-3d-oafford-lemon-piad: Used for estimating object affordances, trained on the LEMON-OBJ and PIAD datasets.
interactvlm-2d-hcontact: Achieves 2D human contact segmentation through referring segmentation. Its training data is the LISA dataset extended by projecting DAMON contacts onto images.
interactvlm-3d-hcontact-ocontact: A single model for joint 3D human-object contact estimation, trained on multiple datasets including DAMON, LEMON-HU, RICH, LEMON-OBJ, PIAD, PICO, and HOI-VQA.

Installation and Environment Configuration

Installing Micromamba

If Micromamba is not already installed, you can install it using the following commands:

curl -Ls https://micro.mamba.pm/api/download/linux-64/latest | tar -xvj bin/micromamba
sudo mv bin/micromamba /usr/local/bin/

Creating and Activating the Environment

Use Micromamba to create an environment named interactvlm and activate it:

micromamba create -n interactvlm python=3.10 -c conda-forge
micromamba activate interactvlm

Installing PyTorch

Install PyTorch with CUDA 12.1:

pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

Cloning the Repository

Clone the InteractVLM code repository and navigate into the directory:

git clone https://github.com/saidwivedi/InteractVLM.git
cd InteractVLM

Installing Dependencies

Install the required dependencies for the project:

pip install -r requirements.txt
pip install flash-attn --no-build-isolation
DS_BUILD_FUSED_ADAM=1 pip install deepspeed==0.15.1

Environment Setup

Before running demo, training, or evaluation scripts, ensure CUDA is properly configured:

export CUDA_HOME=/usr/local/cuda  # or your CUDA installation path
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

Code Structure

The code structure of InteractVLM is clear, with distinct divisions of labor among various modules. Here’s a detailed introduction to the code structure:

InteractVLM/
├── 📁 model/                         # Core model implementation
├── 📁 datasets/                      # Data loading and processing
├── 📁 utils/                         # Utility functions
├── 📁 preprocess_data/               # Data preprocessing scripts
├── 📁 scripts/                       # Execution scripts
├── 📁 data/                          # Dataset folders, Body models, Demo samples
├── 📁 trained_models/                # Trained models
├── 📄 train.py                       # Main training script
├── 📄 evaluate.py                    # Main evaluation script
├── 📄 run_demo.py                    # Run Demo
└── 📄 requirements.txt               # Python dependencies

model/: Contains the core model implementation code of InteractVLM.
datasets/: Responsible for loading and processing data required for training, testing, and demonstrations.
utils/: Includes various commonly used auxiliary functions.
preprocess_data/: Contains scripts for preprocessing raw data.
scripts/: Includes scripts for training, evaluation, and demonstrations.
data/: A folder containing various datasets, human body models, and demo samples.
trained_models/: A directory for storing trained models.
train.py: The main training script used to train the InteractVLM model.
evaluate.py: The main evaluation script used to assess the performance of trained models.
run_demo.py: The script for running demonstrations on your own images.
requirements.txt: A file listing all Python libraries required for the project.

Data and Model Downloads

Essential Data Files

To run InteractVLM, you need to download essential data files and pre-trained models. The project provides a convenient script fetch_data.sh to handle this process.

Using the Download Script

First, register at https://interactvlm.is.tue.mpg.de/login.php to obtain access credentials.
Then, run the download script:

bash fetch_data.sh

Running Demonstrations

Demo Commands

You can run demonstrations on your own images, supporting both human and object interaction estimation modes:

# For 3D human contact estimation
bash scripts/run_demo.sh hcontact data/demo_samples folder

# For 2D human contact segmentation
bash scripts/run_demo.sh h2dcontact data/demo_samples file

# For 3D object affordance estimation  
bash scripts/run_demo.sh oafford data/demo_samples folder

Demo Requirements

Human Contact Demo: The canonical human mesh and rendered input are already provided. Simply run the script to estimate 3D contact points on human bodies. The latest released model also supports human contact estimation with scenes (e.g., ground or undefined objects). Download the latest model using the hcontact-wScene argument in fetch_data.sh and use the same argument when running the demo script. The object name in the image filename serves as the query object for contact estimation (e.g., “bottle” or “chair”). To estimate contact with the scene or ground, use “scene” as the query or prefix the filename with “scene”.
2D Human Contact Demo: Performs 2D contact segmentation directly on the input image using referring segmentation. This extends LISA’s capabilities for human-object contact detection in 2D space. The object name in the image filename serves as the query object for contact estimation.
Object Affordance Demo: The code expects an object mesh as input. The script will automatically render multiple views of the object for affordance prediction.

Input Modes

The demo supports two input structures:

Folder-based mode (default): Each sample is in its own folder (required for 3D human contact and object affordance estimation).
File-based mode: All samples are files in a single folder. Supported for:
- 2D Human Contact (h2dcontact): Direct segmentation on input images.
- 3D Human Contact (hcontact): Estimating human contact for video frames.

Sample Data

The data/demo_samples/ directory contains ready-to-use samples for testing both human contact and object affordance estimation. Running the demo should produce results similar to those shown in the README file.

Training and Evaluation

Data Generation

To generate the data needed for training, run the following script. Currently, only the preprocessed dataset for DAMON is provided; datasets for LEMON, PIAD, and PICO will be released in the future.

# Generate preprocessed data
bash scripts/run_datagen.sh

Training

To train the 3D human contact estimation model using the DAMON dataset, first download the preprocessed dataset using the following command and place it under the data/damon directory, then run the training script:

# Download preprocessed DAMON dataset
bash fetch_data.sh damon-dataset

# Train human contact with DAMON dataset
bash scripts/run_train.sh hcontact-damon

Evaluation

Model Weight Preparation

If you have trained a new model, prepare the weights for evaluation:

# Prepare weights for model 0 (adjust number as needed)
bash scripts/run_prepare_weights.sh 0

Running Evaluation

Run evaluation on pre-trained models:

# Evaluate the model on either DAMON or PIAD. Adjust the configuration accordingly
bash scripts/run_eval.sh

Code Release Status

Released Components

3D Human Contact Estimation: Training, evaluation, and demo code are available.
3D Object Contact/Affordance Estimation: Training, evaluation, and demo code are available.

Pending Releases

Object Shape Retrieval from Single Image: Code release is pending.
Optimization Pipeline for Joint Reconstruction: Code release is pending.

Acknowledgments and Citation

Acknowledgments

The research team thanks Alpár Cseke for assistance with evaluating joint human-object reconstruction, Tsvetelina Alexiadis and Taylor Obersat for MTurk evaluation, Yao Feng, Peter Kulits, and Markos Diomataris for their valuable feedback, and Benjamin Pellkofer for IT support. SKD is supported by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). The UvA part of the team is supported by an ERC Starting Grant (STRIPES, 101165317, PI: D. Tzionas).

Citation

If you use InteractVLM code in your research, please cite the following paper:

@inproceedings{dwivedi_interactvlm_2025,
    title     = {{InteractVLM}: {3D} Interaction Reasoning from {2D} Foundational Models},
    author    = {Dwivedi, Sai Kumar and Antić, Dimitrije and Tripathi, Shashank and Taheri, Omid and Schmid, Cordelia and Black, Michael J. and Tzionas, Dimitrios},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
}

License and Contact Information

License

This code is available for non-commercial scientific research purposes as defined in the LICENSE file. By downloading and using this code, you agree to the terms in the LICENSE. Third-party datasets and software are subject to their respective licenses.

Contact Information

For code-related questions, please contact sai.dwivedi@tuebingen.mpg.de.
For commercial licensing (and all related questions for business applications), please contact ps-licensing@tue.mpg.de.

Conclusion

InteractVLM provides a powerful solution for 3D interaction reasoning from 2D images. By introducing the Semantic Human Contact task and leveraging the knowledge of vision-language models, it achieves impressive results in human-object joint reconstruction and contact estimation. This article has covered various aspects of InteractVLM, including its features, model zoo, installation, usage, training, and evaluation, aiming to help readers better understand and apply this technology.

As computer vision and artificial intelligence technologies continue to advance, we expect InteractVLM to find applications in more fields, bringing new breakthroughs to related research and applications. We also hope that the research team will release the pending code components soon to further improve this technical system.

InteractVLM represents a significant step forward in bridging the gap between 2D images and 3D interaction understanding. Its ability to generalize to real-world scenarios and its comprehensive set of pre-trained models make it a valuable tool for researchers and developers working in computer vision, human-computer interaction, and related fields.

With the ongoing development of the technology and the planned release of additional code components, InteractVLM is poised to become an even more versatile and powerful resource. Whether for academic research or practical applications, it offers promising capabilities for advancing our ability to understand and reconstruct 3D interactions from simple 2D images.

The availability of training data, pre-trained models, and detailed documentation ensures that both experts and those new to the field can effectively utilize InteractVLM. By following the installation and setup instructions provided, users can quickly get started with running demos, training custom models, and evaluating performance on their own datasets.

In summary, InteractVLM stands as a notable achievement in the field of 3D interaction reasoning, offering a combination of innovation, practicality, and accessibility that makes it a valuable contribution to the computer vision community. As research in this area continues to evolve, InteractVLM is likely to play an important role in shaping future advancements.

InteractVLM 3D Interaction Reasoning: Breakthrough in 2D-to-3D Human-Object Contact Estimation