Table of Contents


Introduction

Humor is a hallmark of human intelligence. It reflects our ability to grasp context, abstract meaning, and social nuance. Yet for artificial intelligence, humor remains a steep challenge.

Large Multimodal Models (LMMs) have advanced quickly in recent years, integrating text and visual inputs to solve increasingly complex tasks. But can these systems truly understand humor in online comics?

To explore this question, researchers developed PixelHumor, a benchmark dataset of 2,800 multi-panel comics. The goal is to test whether LMMs can interpret humor, classify styles, and reconstruct narrative sequences. Results show that while these models excel in detecting the presence of humor, they often fail to capture the deeper reasoning behind a joke.


Why Humor Matters in AI

Humor plays an important role in communication, creativity, and cognitive growth. Humans use humor to bond socially, relieve tension, and express abstract ideas.

For AI, being able to recognize and explain humor represents more than just entertainment. It signals progress toward social intelligence—the capacity to understand human interaction at a deeper level.

While language models such as GPT-4o perform strongly in text-based reasoning, they struggle with humor. Comics combine images, words, timing, and cultural context. Misinterpretations are common—for example, a rocket narrowly missing Santa Claus might be read by a model as Santa hijacking the rocket.

This gap highlights why a benchmark like PixelHumor is needed.


The PixelHumor Dataset

Data Sources

PixelHumor includes 2,800 comics drawn from seven different creators, each contributing unique humor styles:

  • Cyanide and Happiness – Known for dark, provocative humor.
  • Peanuts – Famous for gentle, character-driven stories.
  • Garfield – Centered on everyday humor and exaggerated personalities.
  • XKCD – Focused on science, technology, and satire.
  • PhD Comics – Satirical takes on academic life.
  • They Can Talk – Featuring anthropomorphized animals.
  • Saturday Morning Breakfast Cereal (SMBC) – Absurdist and philosophical humor.

This mix ensures diversity across humor genres, cultural references, and visual formats.


Humor Styles

PixelHumor classifies jokes into eight categories:

Style Description
Comparison Highlights similarities or contrasts between ideas.
Personification Gives human qualities to animals or objects.
Exaggeration Uses absurd overstatement for comedic effect.
Pun Relies on wordplay and linguistic ambiguity.
Sarcasm Expresses the opposite of intended meaning.
Silliness Includes nonsense or absurd scenarios.
Surprise Delivers unexpected twists.
Dark Plays on taboo or uncomfortable topics.

This taxonomy provides a structured way to evaluate how well LMMs recognize different humor mechanisms.


Annotation Process

Eight trained undergraduate annotators labeled the dataset. The process included:

  • Training period: Two weeks with practice comics.
  • Paired review: Each set labeled by two annotators, with disagreements resolved by a third.
  • Batch quality checks: Random samples verified accuracy.
  • Final aggregation: Labels confirmed by majority voting.

Annotators also had the option to skip dark or offensive material. This ensured consistency while respecting individual comfort levels.


Dataset Analysis

Key findings from the annotation phase include:

  1. Sound Effects

    • 85% of comics had no sound effects.
    • In those that did, 70% used onomatopoeia (e.g., “BAM!”, “POW!”), often tied to action sequences.
  2. Role of Text vs Visuals

    • 52% of humor depended mainly on text.
    • 32% relied on both text and visuals working together.
    • 16% were labeled as non-humorous.
  3. Distribution of Humor Styles

    • Surprise was the most frequent (35%).
    • Personification followed at 28%.
    • Dark humor was rare (5%), mostly in Cyanide and Happiness.

These insights confirm that successful humor comprehension requires integrating multiple modalities.


Experiment Design

Task Definitions

PixelHumor tests four main capabilities:

  1. Humor Identification – Can the model detect whether a comic is funny, identify sound effects, and locate the panel responsible for the punchline?
  2. Humor Classification – Can the model categorize humor into the correct style?
  3. Humor Interpretation – Can the model explain why the comic is funny in natural language?
  4. Sequence Recognition – Can the model reconstruct the correct order of panels and text to preserve narrative flow?

These tasks span perception, reasoning, and narrative comprehension.


Models Evaluated

Both closed- and open-source models were tested:

  • Closed-source: GPT-4o, Gemini-1.5-Pro.
  • Open-source (large): Qwen2-VL-72B, Gemma3-27B.
  • Open-source (small): LLaVA-OneVision-7B, Qwen2-VL-7B.

Models were evaluated in zero-shot settings, using standardized prompts without fine-tuning.


Evaluation Metrics

  • Precision, Recall, F1-score – For humor detection and classification.
  • Human ratings – For humor interpretation.
  • Accuracy, WER, CER – For sequence recognition.

This mix of automated and human evaluation provides both quantitative and qualitative insights.


Experiment Results

Humor Identification

  • All models achieved near-perfect results in detecting whether humor was present (F1 > 0.98).
  • However, identifying sound effects or the most important panel proved difficult.
  • GPT-4o performed best overall, but even it struggled to attribute humor accurately across modalities.

Humor Classification

  • GPT-4o again led the field, followed closely by Gemini-1.5-Pro.
  • Open-source models showed biases, often predicting only one humor style.
  • Personification was classified most reliably, while sarcasm and dark humor caused the most confusion.

Humor Interpretation

  • Human explanations were judged superior in 69% of cases.
  • GPT-4o produced the most coherent AI-generated interpretations, yet still fell short of human reasoning.
  • Smaller models often hallucinated details or gave formulaic answers.

Sequence Recognition

  • Narrative sequencing was the hardest task.
  • Best closed-source models achieved around 60% accuracy.
  • Open-source models lagged far behind, often defaulting to simple left-to-right reading patterns.

Discussion

The results reveal several challenges for LMMs:

  • Overreliance on surface cues – Models focus on words or objects rather than deeper context.
  • Weakness in long-sequence reasoning – Comics often build humor gradually across multiple panels.
  • Poor cross-modal fusion – Humor emerges from text and visuals interacting, not isolated elements.
  • Cultural bias – Models misinterpret sarcasm or dark humor due to limited cultural grounding.

True humor comprehension requires causal reasoning, narrative tracking, and cultural sensitivity—areas where current LMMs remain limited.


Limitations

  • Subjectivity – Humor varies across individuals, making full agreement difficult.
  • Scope – The dataset focuses on static comics, not animated or video humor.
  • Cultural representation – Sources are primarily Western, limiting cross-cultural analysis.

Ethical Considerations

  • Some comics include violent, explicit, or dark content. These are clearly flagged as potentially harmful.
  • Annotators were paid fairly, trained thoroughly, and allowed to skip uncomfortable material.
  • To respect intellectual property, only URLs are shared, not raw images.

Frequently Asked Questions

Q1: What is PixelHumor used for?
It serves as a benchmark to evaluate how well multimodal AI models can understand humor in comics.

Q2: Can PixelHumor be used to train AI?
No. The dataset is for evaluation only. Researchers access comics through original URLs.

Q3: Why do models perform poorly on sarcasm and dark humor?
These humor types rely heavily on cultural knowledge, irony, and subtle context that models cannot yet capture.

Q4: Which humor style do models recognize best?
Personification, because anthropomorphic animals and objects provide clear visual-text cues.

Q5: How do humans compare to AI in humor interpretation?
Human-written explanations are still preferred in the majority of cases, especially for complex narratives.


Conclusion

PixelHumor highlights a critical gap between human and machine humor understanding. While LMMs detect humor’s presence reliably, they struggle with deeper reasoning:

  • Humor style classification remains shallow, with poor recognition of nuanced categories.
  • Interpretation is formulaic compared to human creativity.
  • Narrative sequencing exposes weaknesses in long-context modeling.

This benchmark sets the stage for future research into multimodal reasoning, cultural adaptation, and narrative comprehension. Until then, when it comes to humor, machines still have a long way to go before they can laugh with us.