Building a Multi-Modal Research Assistant with Gemini 2.5: Auto-Generate Reports and Podcasts
Need instant expert analysis on any topic? Want to transform research into engaging podcasts? Discover how Google’s Gemini 2.5 models create comprehensive research workflows with zero manual effort.
What Makes This Research Assistant Unique?
This innovative system combines LangGraph workflow orchestration with Google Gemini 2.5’s multimodal capabilities to automate knowledge synthesis. Provide a research topic and optional YouTube link, and it delivers:
-
Web research with verified sources -
Video content analysis -
Structured markdown report -
Natural-sounding podcast dialogue
Core Technology Integration
Capability | Technical Implementation | Output |
---|---|---|
🎥 Video Processing | Native YouTube understanding | Video transcripts & insights |
🔍 Real-time Research | Built-in Google Search tool | Citations with source URLs |
📝 Report Synthesis | Multi-source data fusion | Structured markdown document |
🎙️ Audio Production | Multi-speaker TTS technology | Dialogue podcast (.wav) |
Getting Started in 15 Minutes
Essential Preparation
-
Python 3.11+ environment -
uv package manager -
Google Gemini API key
Setup Walkthrough
# 1. Clone repository
git clone https://github.com/langchain-ai/multi-modal-researcher
cd multi-modal-researcher
# 2. Configure environment
cp .env.example .env
# Add your Gemini API key to .env
GEMINI_API_KEY=your_actual_key_here
# 3. Install dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh
uvx --refresh --from "langgraph-cli[inmem]" --with-editable . --python 3.11 langgraph dev --allow-blocking
# 4. Access interface
Successful launch displays:
╦ ┌─┐┌┐┌┌─┐╔═╗┬─┐┌─┐┌─┐┬ ┬
║ ├─┤││││ ┬║ ╦├┬┘├─┤├─┘├─┤
╩═╝┴ ┴┘└┘└─┘╚═╝┴└─┴ ┴┴ ┴ ┴
- API: http://127.0.0.1:2024
- Studio UI: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024
Practical Example
Input parameters:
{
"topic": "Are LLMs becoming the new operating systems?",
"video_url": "https://youtu.be/LCEmiRjPEtQ?si=raeMN2Roy5pESNG2"
}
Output samples:
System Architecture Explained
Workflow Logic
[object Promise]
Four Processing Modules
-
Search Research Module
-
Uses Gemini’s native Google Search integration -
Retrieves current web content -
Automatically attaches source references
-
-
Video Analysis Module (conditional)
-
Processes YouTube videos through Gemini 2.5 -
Extracts key arguments and data points -
Supports long-form video understanding
-
-
Report Generation Module
-
Combines web and video insights -
Structures findings in markdown format -
Includes executive summary and citations
-
-
Podcast Creation Module
-
Generates dialogue scripts -
Implements dual-voice system: -
Dr. Sarah (domain expert) -
Mike (interviewer)
-
-
Outputs studio-quality WAV audio
-
Customization Guide
Model Selection Options
# configuration.py settings
class Configuration:
search_model = "gemini-2.5-flash" # Web research
synthesis_model = "gemini-2.5-flash" # Report generation
video_model = "gemini-2.5-flash" # Video processing
tts_model = "gemini-2.5-flash-preview-tts" # Speech synthesis
Temperature Settings
Task Type | Default | Effect |
---|---|---|
Factual Search | 0.0 | Maximizes accuracy |
Report Synthesis | 0.3 | Balances creativity/reliability |
Podcast Scripting | 0.4 | Enhances conversational flow |
Voice Customization
# TTS configuration
mike_voice = "Kore" # Interviewer voice profile
sarah_voice = "Puck" # Expert voice profile
audio_format = ... # Professional audio settings
Project Structure Overview
multi-modal-researcher/
├── src/agent/
│ ├── state.py # Workflow state definitions
│ ├── configuration.py # Runtime settings
│ ├── utils.py # Core functionality
│ └── graph.py # LangGraph implementation
├── langgraph.json # Deployment configuration
├── pyproject.toml # Dependency management
└── .env # API credentials
Key Functions
-
display_gemini_response()
: Processes model output with metadata -
create_podcast_discussion()
: Generates dialogue + audio -
create_research_report()
: Synthesizes multi-source data -
wave_file()
: Audio encoding utility
Deployment Options
Environment | Command | Use Case |
---|---|---|
Local Development | langgraph dev |
Testing & debugging |
LangGraph Platform | Cloud deployment | Production use |
Self-Hosted | Docker container | Enterprise installation |
Technical FAQ
What programming skills are required?
-
Basic command-line knowledge for setup -
Python familiarity for advanced customizations
Which video platforms are supported?
-
Currently only YouTube -
Requires public video URLs
Can I modify podcast voices?
-
8 predefined voice options available -
Adjust voice assignments in configuration.py
How long of videos can it process?
-
Handles videos up to 2 hours -
Gemini 2.5 supports million-token context
What report formats are supported?
-
Default output: Standard markdown -
Custom templates via utils.py modifications
Dependencies
# pyproject.toml core packages
langgraph = ">=0.2.6" # Workflow engine
google-genai = "*" # Gemini SDK
langchain = ">=0.3.19" # AI integration framework
rich = "*" # Terminal formatting
python-dotenv = "*" # Environment management