PAPER2WEB: Bringing Your Academic Papers to Life

An integrated guide for turning static PDFs into interactive, structured academic websites and presentation materials.


Table of Contents

  1. Introduction

  2. What’s New

  3. Installation Guide

  4. Configuration

  5. Quick Start

  6. Generating Academic Presentation Videos (Paper2Video)

  7. Paper2Web Dataset Overview

  8. Benchmarking Paper2Web

  9. Contributing

  10. Acknowledgments

  11. FAQ


1. Introduction

Academic papers are highly structured and information-dense, but their PDF format often limits discoverability and interactivity. Researchers, students, and project teams face challenges such as:

  • Difficulty navigating complex content
  • Static figures and tables
  • Time-consuming manual website creation

PAPER2WEB addresses these challenges by providing an autonomous pipeline that converts academic papers into interactive, explorable project websites. The pipeline iteratively refines both content and layout, producing engaging websites that showcase the research in a structured, readable, and interactive format.

Key features include:

  • Automatic layout-aware content generation
  • Interactive navigation for users
  • Support for posters, presentation videos, and PR materials
  • Integration with advanced aesthetic agents (EvoPresent)

2. What’s New

Recent updates from the project:

  • EvoPresent Integration: Adds self-improving aesthetic agents for academic presentations.
  • Paper2Web Dataset & Benchmark: Tens of thousands of categorized papers, including metadata and citation counts, available for analysis and model training.
  • Paper2ALL Pipeline Release: Incorporates Paper2Video, Paper2Poster, and AutoPR, creating a unified toolchain for promotional materials.


3. Installation Guide

3.1 Prerequisites

Before installing, ensure the following:

  • Python ≥ 3.11
  • Conda (recommended for environment management)
  • LibreOffice (required for document conversion)
  • Poppler-utils (PDF rendering and parsing)

Tip: Conda environments help isolate dependencies and avoid conflicts between Python packages.


3.2 Creating Conda Environment

conda create -n p2w python=3.11
conda activate p2w

This creates an isolated environment named p2w for all Paper2Web dependencies.


3.3 Installing Dependencies

pip install -r requirements.txt

This installs Python packages required for the pipeline, including libraries for PDF processing, LLM interaction, and website generation.


3.4 System Dependencies

LibreOffice

sudo apt install libreoffice

If sudo is unavailable, download the executable version from LibreOffice and add it to your system PATH.

Poppler

conda install -c conda-forge poppler

Poppler is used for PDF parsing and rendering, enabling conversion from LaTeX/PDF to HTML content.


4. Configuration

Before running the pipeline, configure your API credentials in a .env file:

# OpenAI API
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_API_BASE=https://api.openai.com/v1

# Optional: OpenRouter
OPENAI_API_BASE=https://openrouter.ai/api/v1
OPENAI_API_KEY=sk-or-your-openrouter-key-here

AutoPR Component:

cp AutoPR/.env.example AutoPR/.env

Edit credentials as needed.

Optional: Google Search API (for logo search):

GOOGLE_SEARCH_API_KEY=your_google_search_api_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id

5. Quick Start

5.1 Input Directory Structure

The pipeline automatically detects target platforms based on folder names:

papers/
├── 12345/                    # Numeric → Twitter (English)
│   └── paper.pdf
└── research_project/         # Alphanumeric → Xiaohongshu (Chinese)
    └── paper.pdf

5.2 Running All Modules

python pipeline_all.py --input-dir "path/to/papers" --output-dir "path/to/output"

For a specific PDF:

python pipeline_all.py \
  --input-dir "path/to/papers" \
  --output-dir "path/to/output" \
  --pdf-path "path/to/paper.pdf"

5.3 Running Specific Modules

  • Website Generation Only:
python pipeline_all.py --model-choice 1
  • Poster Generation Only (default 48×36 inches):
python pipeline_all.py --model-choice 2
  • Poster with Custom Size:
python pipeline_all.py --model-choice 2 --poster-width-inches 60 --poster-height-inches 40
  • PR Material Generation Only:
python pipeline_all.py --model-choice 3

6. Generating Academic Presentation Videos (Paper2Video)

Paper2Video converts LaTeX papers into full presentation videos, including:

  • Slides
  • Subtitles
  • Audio narration
  • Cursor animations
  • Optional talking-head avatars


6.1 Environment Setup

cd paper2all/Paper2Video/src
conda create -n p2v python=3.10
conda activate p2v
pip install -r requirements.txt
conda install -c conda-forge tectonic ffmpeg poppler

6.2 Optional: Talking-Head Generation

Separate environment recommended to avoid package conflicts:

cd hallo2
conda create -n hallo python=3.10
conda activate hallo
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
huggingface-cli download fudan-generative-ai/hallo2 --local-dir ../pretrained_models

Find the Python executable path:

which python

Use this path as --talking_head_env in the pipeline.


6.3 Inference Pipeline

The pipeline consumes:

  • LaTeX sources of papers
  • Reference images
  • Reference audio

The pipeline outputs a complete academic presentation video. Minimum recommended GPU: NVIDIA A6000 (48GB).


6.4 Example Commands

Fast generation (without talking-head):

python pipeline_light.py \
  --model_name_t gpt-4.1 \
  --model_name_v gpt-4.1 \
  --result_dir /path/to/output \
  --paper_latex_root /path/to/latex_proj \
  --ref_img /path/to/ref_img.png \
  --ref_audio /path/to/ref_audio.wav \
  --gpu_list [0,1,2,3,4,5,6,7]

Full generation (with talking-head):

python pipeline.py \
  --model_name_t gpt-4.1 \
  --model_name_v gpt-4.1 \
  --model_name_talking hallo2 \
  --result_dir /path/to/output \
  --paper_latex_root /path/to/latex_proj \
  --ref_img /path/to/ref_img.png \
  --ref_audio /path/to/ref_audio.wav \
  --talking_head_env /path/to/hallo2_env \
  --gpu_list [0,1,2,3,4,5,6,7]

7. Paper2Web Dataset Overview

Dataset includes:

  • Metadata for papers with and without project websites
  • Citation counts
  • 13 main categories:
Category Description
3D Vision & Computational Graphics Papers on 3D reconstruction and graphics
Multimodal Learning Learning across images, text, and audio
Generation Models Generative AI models
Speech & Audio Processing and understanding audio signals
AI for Science AI applied to scientific domains
ML System & Infrastructure Frameworks and tools for ML
Deep Learning Architectures Neural network design
Probabilistic Inference Probabilistic reasoning methods
Natural Language Understanding NLP and language models
Information Retrieval & Recommendation Search engines, recommender systems
Reinforcement Learning RL algorithms and applications
Trustworthy AI Safety, fairness, explainability
ML Theory & Optimization Theoretical and optimization research

8. Benchmarking Paper2Web

Benchmark includes:

  • Original website source URLs
  • Paper metadata
  • Partial results from PWAgent
  • Visual comparison of original vs. generated websites


Evaluation metrics:

  • Informative quality
  • Aesthetic quality
  • QA accuracy
  • Content completeness
  • Connectivity and interactivity

9. Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Modify code
  4. Add tests (if applicable)
  5. Submit a pull request

10. Acknowledgments

Thanks to:

  • Authors and guiding advisors
  • Open-source community
  • Paper2AI ecosystem contributors
  • Paper2Video, Paper2Poster, AutoPR, EvoPresent teams

11. FAQ

Q1: How does PAPER2WEB determine the platform?

  • Numeric folder name → Twitter (English)
  • Alphanumeric → Xiaohongshu (Chinese)

Q2: Is LibreOffice required?

Yes, for document conversion. If sudo unavailable, download executable and add to PATH.


Q3: What is Poppler used for?

PDF parsing and rendering.


Q4: Can I use OpenRouter instead of OpenAI API?

Yes, configure OPENAI_API_BASE and OPENAI_API_KEY in .env.


Q5: Minimum GPU requirement for Paper2Video?

NVIDIA A6000 48GB recommended.


Q6: Can I skip talking-head generation?

Yes, use pipeline_light.py without hallo2 environment.


Q7: Poster default and custom sizes?

Default: 48×36 inches
Custom: Use --poster-width-inches and --poster-height-inches.


Q8: What data is in the Paper2Web dataset?

  • Metadata
  • Website existence
  • Citation counts
  • Categories (13 classes)

Q9: What can the benchmark do?

  • Compare original vs. generated websites
  • Evaluate visual design, content completeness, connectivity

Q10: How to contribute?

Follow standard open-source workflow: fork → branch → modify → test → PR


Next Steps / Tips for Users

  • Always use a separate Conda environment per module to avoid dependency conflicts
  • Prepare LaTeX, images, and audio references before running Paper2Video
  • Use the benchmark dataset to evaluate and improve generated websites
  • Keep .env secure; API keys should not be shared publicly