Reducing Hallucinations in Multimodal Large Language Models for Video Understanding Through Counterfactual Video Generation

Have you ever wondered why multimodal large language models sometimes give answers that sound logical but don’t match what’s actually happening in a video? For instance, if a video shows an object suddenly vanishing, the model might insist it’s still there, relying more on everyday common sense than on the visual evidence right in front of it. This is known as “visual ungrounded hallucinations.” In this article, we’ll explore a innovative approach that uses specially generated counterfactual videos to help these models better understand videos and minimize those hallucinations.

Picture this: You’re watching a video where a gift box is placed on a shelf, but then it inexplicably disappears. This defies logic, yet that’s what the video depicts. Standard models might overlook the anomaly and respond based on language habits, saying “the gift is still on the shelf.” But what if we could train models to spot these irregularities? That’s the heart of this research: employing a framework called DualityForge to create such videos and use them for training.

Challenges in Video Understanding for Multimodal Large Language Models

Let’s start with the core issue. Multimodal large language models (MLLMs) have made impressive strides in video understanding. However, they have a key weakness: an over-reliance on language priors. This means the models prefer reasoning based on textual patterns they’ve learned, rather than truly grounding their responses in the video’s visual content. This problem intensifies with counterfactual videos—those that violate common sense, like objects defying physical laws—leading to hallucinations.

Why does this happen? Training data often has an imbalance where text is far more abundant and diverse than video, causing models to take shortcuts in video tasks. For example, in a video where a girl places a gift on a shelf and walks away, if edited to show the gift vanishing, the model might still claim “the gift remains,” because language priors suggest gifts don’t disappear without reason.

The research highlights that these hallucinations stem from data imbalance. To fix it, we need counterfactual data to boost visual perception. But creating this data is challenging: editing videos is resource-heavy, and generating high-quality question-answer (QA) pairs relies on the models themselves, creating a loop.

DualityForge: A Controllable Framework for Counterfactual Video Generation

So, how do we address this? The researchers introduce DualityForge, a controllable video editing framework based on diffusion models. It transforms real-world videos into counterfactual scenarios while automatically generating paired QA data for contrastive training.

In simple terms, this framework uses diffusion models to edit videos—for example, making an object disappear midway to simulate a common-sense violation. What’s innovative is that it embeds structured context, like event type and timing, into the editing process. This helps models grasp these counterfactual phenomena, enabling scalable, high-quality QA pair creation.

As shown in the image above, the framework includes several pipelines:

❀

Visual Anomaly Pipeline: Pixel-level editing using OpenCV. A multimodal model selects an object to edit, generates a mask, applies VACE-based editing, and verifies with majority voting from multiple top models.
❀

Semantic Anomaly Pipeline: A multimodal model proposes common-sense violations, FLUX-Kontext edits frames, multiple models verify, and VACE interpolates the final video.
❀

Common Sense Anomaly Pipeline: Similar, but focused on common-sense breaches.

This process naturally produces paired data: original video vs. edited video. For the same QA question, the model must provide different answers, forcing it to focus on visual evidence over language priors.

DualityVidQA: A Large-Scale Dataset Designed to Reduce Hallucinations

Building on DualityForge, the researchers created DualityVidQA, a comprehensive video understanding dataset. It includes 104K supervised fine-tuning samples and 40K reinforcement learning samples, totaling 144K training samples from 81K unique videos, with a combined duration of about 100 hours.

The dataset’s standout feature is paired videos and contrastive QA: Each pair consists of a real video and a counterfactual one, with identical questions but differing answers. This trains models to differentiate based on visuals.

For evaluation, they developed DualityVidQA-Test, a rigorous benchmark with 600 manually curated paired samples, divided into four granular counterfactual categories.

DNA-Train: A Two-Stage Training Regime

Data alone isn’t enough; a tailored training method is key. The researchers propose Duality-Normalized Advantage Training (DNA-Train), a two-stage approach: supervised fine-tuning (SFT) followed by reinforcement learning (RL).

❀

SFT Stage: Trains on a mix of real and counterfactual videos to detect anomalies without degrading performance on real videos.
❀

RL Stage: Reinforces this by using pairwise contrastive tasks. During RL, ℓ1 normalization is applied to advantages for each real-counterfactual pair, ensuring stable gradients and avoiding bias toward real videos.

This leverages the data’s contrastive nature for balanced optimization.

Experiments show that on DualityVidQA-Test, a model based on Qwen2.5-VL-7B achieves a 24.0% relative improvement. Gains extend to other benchmarks like EventHallusion, TempCompass, MVBench, TOMATO, and TVBench, demonstrating strong generalization.

Reviewing Related Work

Before diving deeper, let’s look at the background in this field.

The Role of Language Priors in MLLMs

MLLMs inherit robust language priors from large language models, which can result in outputs that seem reasonable but clash with visual evidence. Training-free methods like contrastive decoding mitigate this by comparing original logits to auxiliary distributions, such as through image masking or instruction perturbation. However, these add inference costs, are hyperparameter-sensitive, and unstable for video tasks.

Training-based approaches build specialized datasets, like altering video captions, but require costly prompting and annotation. In contrast, this framework is automated and scalable, ideal for videos.

Datasets for Video Understanding

Existing datasets cover action recognition (e.g., Kinetics, ActivityNet), captioning (e.g., MSR-VTT, WebVid-10M). But annotation is expensive due to spatiotemporal complexity, limiting scale. Recent efforts use vision-language models to synthesize data, like LLaVA-Hound prompting GPT-4 for QA pairs. Yet, these rely on real videos, missing rare events or counterfactual scenarios.

Visual Reinforcement Learning

Recent work extends RL to multimodal settings, such as Vision-RL using multimodal chain-of-thought corpora and GRPO. Most focus on textual traces rather than visual evidence, limiting robustness to counterfactual content. This method emphasizes that video understanding requires distinguishing visually plausible from counterfactual cues.

Problem Formulation

The central issue is MLLMs favoring language priors over visual evidence, causing hallucinations. The goal is to create a large-scale video QA dataset with visually salient counterfactual events.

Each video depicts counterfactual events, like object disappearance. The dataset includes paired videos to ensure grounding in visuals.

Detailed Video Editing Pipelines

As illustrated in Figure 2:

![Video Editing Pipelines Overview] // Based on description; assume link if available.

Visual Anomaly: Pixel-level editing. Select object, generate mask, edit, verify.
Semantic Anomaly: Propose violations, edit frames, verify, interpolate.
Common Sense Anomaly: Similar focus on common-sense violations.

These pipelines ensure precise edits and embed context for QA generation.

Analyzing Experimental Results

The experiments validate the method’s effectiveness:

❀

24% reduction in hallucinations on DualityVidQA-Test.
❀

Improvements on EventHallusion.
❀

Gains on general benchmarks like TempCompass (temporal understanding), MVBench (multiple-choice), TOMATO, and TVBench.

This shows that generation can enhance understanding.

How to Implement This Approach

If you’re a researcher looking to apply this framework, here’s a step-by-step guide:

Set Up DualityForge: Install diffusion-based video editing tools.
Generate Videos: Use pipelines to edit real videos, embedding context.
Create QA Pairs: Automatically generate QA using embedded context.
Train the Model: Apply DNA-Train—SFT on mixed data, then RL with ℓ1-normalized advantages.

The code and dataset will be open-sourced soon.

FAQ: Common Questions Answered

What are hallucinations in multimodal large language models?

Hallucinations refer to outputs that don’t align with the input. In videos, they often occur when models ignore visuals and rely on language common sense.

What are counterfactual videos?

These are videos that violate common sense, such as objects disappearing or defying physics. They’re created by editing real videos.

How does DualityForge work?

It uses diffusion models to edit videos, embedding structured context like event types to generate high-quality counterfactual scenarios and QA pairs.

How large is the DualityVidQA dataset?

It has 144K samples from 81K videos, totaling around 100 hours.

What makes DNA-Train unique?

It’s a two-stage process: SFT with mixed real/counterfactual data, followed by RL using pairwise ℓ1-normalized advantages for balanced learning.

Does this method generalize well?

Yes, it improves performance on multiple benchmarks beyond just hallucinations.

How do you evaluate hallucinations?

Using DualityVidQA-Test, with 600 manual samples across four counterfactual categories.

Why use contrastive training?

It forces models to focus on visual differences rather than language priors.

What are the video editing pipelines?

There are three: visual, semantic, and common sense anomalies, each with steps like object selection, editing, and verification.

How much improvement does it offer?

A 24% relative boost on the test set, with notable gains on other benchmarks.

Deeper Insights: Why This Method Succeeds

Let’s consider why this approach works so well. Traditional training exposes models to abundant text patterns but limited video data, leading to biases. Introducing counterfactual videos acts like a stress test, compelling the model to pay attention to the visuals.

For example, in a pair: Original video shows a girl placing a gift and leaving. Edited version: Gift vanishes. Question: “What happened to the gift?” Original answer: It’s still there. Edited: It disappeared. The model learns to differentiate.

In RL, ℓ1 normalization ensures each pair contributes equally, preventing oversight of counterfactuals.

Potential Applications

This method doesn’t just curb hallucinations; it boosts overall video understanding. It can apply to video QA, action recognition, and more.

For instance, it enhances temporal reasoning on TempCompass and multiple-choice accuracy on MVBench.

Challenges and Solutions

One challenge: Generating high-quality counterfactuals is difficult. Solution: Embed context for automated QA.

Another: Training instability. Solution: DNA-Train’s normalization.

Conclusion

By leveraging DualityForge and DualityVidQA, training MLLMs with counterfactual videos significantly reduces hallucinations and improves understanding. DNA-Train ensures effective optimization. This underscores how generation can refine perception.

If you’re into video AI, this is worth exploring. Once the code is open-sourced, you can experiment yourself.

Counterfactual Video Generation: A Breakthrough to Reduce Hallucinations in Multimodal AI