Building a Multi-Modal Research Assistant with Gemini 2.5: Auto-Generate Reports and Podcasts

Need instant expert analysis on any topic? Want to transform research into engaging podcasts? Discover how Google’s Gemini 2.5 models create comprehensive research workflows with zero manual effort.

What Makes This Research Assistant Unique?

This innovative system combines LangGraph workflow orchestration with Google Gemini 2.5’s multimodal capabilities to automate knowledge synthesis. Provide a research topic and optional YouTube link, and it delivers:

  1. Web research with verified sources
  2. Video content analysis
  3. Structured markdown report
  4. Natural-sounding podcast dialogue

Core Technology Integration

Capability Technical Implementation Output
🎥 Video Processing Native YouTube understanding Video transcripts & insights
🔍 Real-time Research Built-in Google Search tool Citations with source URLs
📝 Report Synthesis Multi-source data fusion Structured markdown document
🎙️ Audio Production Multi-speaker TTS technology Dialogue podcast (.wav)

Getting Started in 15 Minutes

Essential Preparation

Setup Walkthrough

# 1. Clone repository
git clone https://github.com/langchain-ai/multi-modal-researcher
cd multi-modal-researcher

# 2. Configure environment
cp .env.example .env
# Add your Gemini API key to .env
GEMINI_API_KEY=your_actual_key_here

# 3. Install dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh
uvx --refresh --from "langgraph-cli[inmem]" --with-editable . --python 3.11 langgraph dev --allow-blocking

# 4. Access interface

Successful launch displays:

╦  ┌─┐┌┐┌┌─┐╔═╗┬─┐┌─┐┌─┐┬ ┬
║  ├─┤││││ ┬║ ╦├┬┘├─┤├─┘├─┤
╩═╝┴ ┴┘└┘└─┘╚═╝┴└─┴ ┴┴  ┴ ┴

- API: http://127.0.0.1:2024
- Studio UI: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024

Practical Example

Input parameters:

{
  "topic": "Are LLMs becoming the new operating systems?",
  "video_url": "https://youtu.be/LCEmiRjPEtQ?si=raeMN2Roy5pESNG2"
}

Output samples:

Workflow interface

System Architecture Explained

Workflow Logic

[object Promise]

Four Processing Modules

  1. Search Research Module

    • Uses Gemini’s native Google Search integration
    • Retrieves current web content
    • Automatically attaches source references
  2. Video Analysis Module (conditional)

    • Processes YouTube videos through Gemini 2.5
    • Extracts key arguments and data points
    • Supports long-form video understanding
  3. Report Generation Module

    • Combines web and video insights
    • Structures findings in markdown format
    • Includes executive summary and citations
  4. Podcast Creation Module

    • Generates dialogue scripts
    • Implements dual-voice system:

      • Dr. Sarah (domain expert)
      • Mike (interviewer)
    • Outputs studio-quality WAV audio

Customization Guide

Model Selection Options

# configuration.py settings
class Configuration:
    search_model = "gemini-2.5-flash"         # Web research
    synthesis_model = "gemini-2.5-flash"      # Report generation
    video_model = "gemini-2.5-flash"          # Video processing
    tts_model = "gemini-2.5-flash-preview-tts" # Speech synthesis

Temperature Settings

Task Type Default Effect
Factual Search 0.0 Maximizes accuracy
Report Synthesis 0.3 Balances creativity/reliability
Podcast Scripting 0.4 Enhances conversational flow

Voice Customization

# TTS configuration
mike_voice = "Kore"    # Interviewer voice profile
sarah_voice = "Puck"   # Expert voice profile
audio_format = ...     # Professional audio settings

Project Structure Overview

multi-modal-researcher/
├── src/agent/
│   ├── state.py         # Workflow state definitions
│   ├── configuration.py # Runtime settings
│   ├── utils.py         # Core functionality
│   └── graph.py         # LangGraph implementation
├── langgraph.json       # Deployment configuration
├── pyproject.toml       # Dependency management
└── .env                 # API credentials

Key Functions

  • display_gemini_response(): Processes model output with metadata
  • create_podcast_discussion(): Generates dialogue + audio
  • create_research_report(): Synthesizes multi-source data
  • wave_file(): Audio encoding utility

Deployment Options

Environment Command Use Case
Local Development langgraph dev Testing & debugging
LangGraph Platform Cloud deployment Production use
Self-Hosted Docker container Enterprise installation

Technical FAQ

What programming skills are required?

  • Basic command-line knowledge for setup
  • Python familiarity for advanced customizations

Which video platforms are supported?

  • Currently only YouTube
  • Requires public video URLs

Can I modify podcast voices?

  • 8 predefined voice options available
  • Adjust voice assignments in configuration.py

How long of videos can it process?

  • Handles videos up to 2 hours
  • Gemini 2.5 supports million-token context

What report formats are supported?

  • Default output: Standard markdown
  • Custom templates via utils.py modifications

Dependencies

# pyproject.toml core packages
langgraph = ">=0.2.6"     # Workflow engine
google-genai = "*"         # Gemini SDK
langchain = ">=0.3.19"     # AI integration framework
rich = "*"                 # Terminal formatting
python-dotenv = "*"        # Environment management