Action100M: A Deep Dive into a Million-Scale Video Action Understanding Dataset

高效码农

2 months ago

In the field of artificial intelligence, particularly computer vision and video understanding, high-quality, large-scale datasets are the critical foundation for driving technological progress. Today, we take an in-depth look at a significant resource released by Meta FAIR in collaboration with several top academic institutions—Action100M. This is a project aimed at advancing fine-grained video action understanding through a massive dataset. This article will provide a comprehensive and thorough explanation, from the dataset’s composition and core features to its specific usage.

Dataset Overview: Scale and Source

Action100M, as the name suggests, targets a scale of one million annotated video segments. Currently, the research team has released a preview version on Hugging Face (facebook/action100m-preview). This preview contains 120,000 rows of complete data, representing 10% of the full dataset, allowing the community to experiment and conduct research.

The data is sourced from publicly available instructional and demonstration videos on YouTube, covering an exceptionally wide range of topics from life hacks, crafts, cooking, and gardening to car repair and project management. Each data sample corresponds to one complete YouTube video, identified by its unique video_uid (an 11-character string). The core value of this dataset lies not only in its scale but also in its unique and finely detailed hierarchical annotation structure—the Tree-of-Captions.

Core Data Structure: Three Tiers of Information and Hierarchical Annotation

When you load a data sample from Hugging Face, you receive a dictionary containing three core fields. Understanding these fields is the key to using Action100M.

1. Video Identifier and Metadata

video_uid (string): This is the source YouTube video’s ID, the unique credential for tracing the original video.
metadata (dictionary): Contains video-level information, typically including:
- title: The video title.
- description: The video description.
- ASR transcript: The automatically generated speech-to-text transcript (if available).
  This metadata provides rich textual information for understanding the video’s global context.

2. The Essence: Hierarchical “Node” Annotations

nodes (list[dict]): This is the most core and innovative part of the dataset. It is a list where each element represents a temporal segment within the video. The list length is dynamic; depending on the video’s content and complexity, the number of nodes per video varies between 9 and over 5,570.

Each node is a dictionary containing a comprehensive, multi-angle description of that time segment. Let’s break down each field within a node:
- Spatio-Temporal Localization:
  - start, end (float): Precise timestamps in seconds, defining the segment’s start and end points within the full video.
  - node_id (string): A globally unique identifier for the segment.
  - parent_id (string or null): Embodies the hierarchical structure. Each segment has a parent segment (a coarser-grained segment). The root node (representing the entire video) has a parent_id of null.
  - level (integer): Indicates the depth in the hierarchy. A smaller level value indicates a coarser, longer segment; a larger level value indicates a finer, shorter segment. This forms a tree-like decomposition view from the entire video down to specific action steps.
- Multi-Model Generated Descriptions:
  - plm_caption (string or null): A descriptive caption for this segment generated by the PLM-3B model.
  - plm_action (string or null): A short action label produced by the PLM-3B model.
  - llama3_caption (string or null): For the finest-grained “leaf nodes” only, an image description for the segment’s middle frame generated by the LLama-3.2-Vision-11B model.
- Core GPT Annotations:
  - gpt (dictionary or null): This is the primary annotation source for Action100M, available for segments that are not too short. It is further subdivided into two dimensions:
    
    Summary (summary):
    
    brief: A one-sentence concise summary.
    
    detailed: A longer, detailed paragraph summarization.
    
    Action (action):
    
    brief: A short verb phrase naming the current step (e.g., “cutting the cucumber”).
    
    detailed: A detailed imperative-style instruction describing how the action is performed (e.g., “Use a sharp knife to slice the cucumber into thin rounds”).
    
    actor: Who or what performs the action (a noun phrase, e.g., “the chef”).

This “trunk-branch-leaf” annotation system enables Action100M to answer not just “what is happening in the video,” but also “who is performing what specific action step, in what manner, during seconds X to Y of the video.” This significantly advances research in fine-grained video understanding and step-by-step reasoning models.

How to Get Started Quickly

Accessing and using Action100M is very straightforward, thanks to excellent support from the Hugging Face datasets library. Here is the recommended way to load the preview dataset:

from datasets import load_dataset

# Load using streaming mode, suitable for browsing and iterating over large datasets
dataset = load_dataset(
    "parquet",
    data_files="hf://datasets/facebook/Action100M-preview/data/*.parquet",
    streaming=True, # Enable streaming, don't download all data at once
)
# Get an iterator for the training set (preview has only one split)
data_iterator = iter(dataset["train"])

# Get the first sample
first_sample = next(data_iterator)

# Explore the sample structure
print(f"Video ID: {first_sample['video_uid']}")
print(f"Video Title: {first_sample['metadata'].get('title')}")
print(f"This video has {len(first_sample['nodes'])} annotated nodes")
# View the GPT action brief for the first node
if first_sample['nodes'][0].get('gpt'):
    print(f"First node action: {first_sample['nodes'][0]['gpt']['action']['brief']}")

Using the code above, you can begin iterating through the dataset, accessing the hierarchical annotation information for each video. The project’s official GitHub repository also provides examples for loading from local Parquet files and visualizing annotations, helping you understand the data more intuitively.

From Data to Insight: Example Interpretations

To give you a more concrete sense of the annotation content, let’s envision two hypothetical examples from the dataset (based on the file descriptions):

A Cooking Video: A video for making “Cauliflower Buffalo Wings.” Its tree annotation might be organized like this:
- Level 0 (Root Node): The entire recipe preparation process.
- Level 1: Major phases, like “Preparing the Cauliflower,” “Making the Batter,” “Baking,” “Making the Sauce.”
- Level 2 (Leaf Node): A specific step. For example, under the “Preparing the Cauliflower” phase, there might be a node with start=45.2, end=68.5, where gpt["action"]["brief"] is “Cut the cauliflower into florets,” and actor is “the cook.”
A Car Repair Video: A video on “Checking Battery & Alternator Issues.”
- One leaf node might be located at start=120.5, end=149.8. Its gpt["action"]["detailed"] might describe: “Set the multimeter to the voltage setting. Connect the red probe to the battery’s positive terminal and the black probe to the negative terminal. Read the resting voltage value.” The actor would be “the mechanic.”

The example GIFs provided by the developers also demonstrate this correspondence: video clips are precisely synchronized with the action described by gpt["action"]["brief"] (e.g., “add flour,” “mix the ingredients”), validating the temporal accuracy and relevance of the annotations.

Frequently Asked Questions (FAQ)

Q: Is the Action100M dataset free for commercial use?
A: No. According to the official information, Action100M is released under the “FAIR Noncommercial Research License.” This means it is for non-commercial research purposes only. Any commercial use requires additional licensing. Before using the dataset, you must read and comply with the terms in its license file.

Q: I’m only researching video action recognition. Do I need to care about all the annotation fields?
A: Not necessarily; you can choose flexibly based on your research goals. For example:

If researching action localization and classification, focus on start/end, gpt["action"]["brief"], and actor.
If researching video segment summarization, look at the fields under gpt["summary"].
If researching the quality of descriptions generated by multimodal models, you could compare plm_caption, llama3_caption, and gpt["summary"].
If researching video hierarchical structure parsing, you must utilize level and parent_id to build the tree relationships.

Q: Why is the number of “nodes” per video so variable (9 to 5,570+)?
A: This accurately reflects the diverse complexity of real-world videos. A simple “unboxing video” might have only a few steps, while a complete “garden shed assembly tutorial” or “software programming course” might be decomposed into thousands of fine-grained steps. This is precisely where the value of a large-scale dataset lies—it can cover real-world scenarios of varying complexity.

Q: How can I effectively utilize this hierarchical structure in my own model?
A: You can treat the hierarchy as a form of strong supervision signal. For example:

When training a model, you could design a loss function that brings nodes sharing the same parent_id closer together in the feature space.
For step prediction, you could use predictions from coarse-grained (high level) nodes to constrain or initialize predictions for fine-grained (low level) nodes.
Transform the tree structure into input for a Graph Neural Network (GNN) to explicitly model the temporal and hierarchical relationships between segments.

Academic Impact and Citation

Action100M, as an emerging large-scale benchmark dataset, is expected to have a significant impact on the video understanding community. By providing massive annotations with fine temporal granularity, rich semantic descriptions, and a clear hierarchical structure, it lays the groundwork for training more powerful and intelligent video understanding models.

If you use the Action100M dataset in your research, please cite its preprint as follows:

@article{chen2026action100m,
  title={Action100M: A Large-scale Video Action Dataset},
  author={Chen, Delong and Kasarla, Tejaswi and Bang, Yejin and Shukor, Mustafa and Chung, Willy and Yu, Jade and Bolourchi, Allen and Moutakanni, Théo and Fung, Pascale},
  journal={arXiv preprint arXiv:2601.xxxxx},
  year={2026}
}

Summary

Action100M is more than just a “video-label” paired dataset; it is a structured video knowledge base. It deconstructs lengthy, continuous video streams into an organized, queryable tree of action steps, each step accompanied by diverse descriptions generated by advanced AI models and validated by human processes. For researchers and developers working on video action recognition, temporal action localization, step-by-step reasoning, video summarization, instruction following, and multimodal large language models, Action100M provides an unprecedented experimental platform with both depth and breadth.

By embracing such open datasets, we can collectively advance machine understanding of the visual world, moving from “seeing what is there” to “understanding how it is happening step-by-step,” ultimately building intelligent systems capable of collaborating with humans on complex tasks.

Summary: Action100M is a large-scale video action dataset released by Meta FAIR and other institutions, with a preview containing 120,000 data points. Its core innovation lies in providing time-based hierarchical “Tree-of-Captions” annotations. Each video segment includes precise timestamps, hierarchical relationships, and action summaries, detailed instructions, and actor information generated by models like GPT, aiming to advance research in fine-grained video step understanding and reasoning. The dataset can be loaded and used directly via the Hugging Face platform.