From Screenshot to Website: A Complete, Plain-English Guide to ScreenCoder

Keywords: UI-to-code, visual language model, front-end automation, HTML/CSS generation, ScreenCoder tutorial


Why This Guide Exists

Designers send screenshots. Engineers still code by hand.
ScreenCoder ends that loop.
It is an open-source toolkit that turns any UI image into clean, production-ready HTML/CSS.
Below you will find everything you need to understand, install, and extend it—no PhD required.


1. Three-Minute Overview: How ScreenCoder Works

Stage What It Does Plain-English Analogy Core Tech
Grounding Agent Sees the picture “That box is the sidebar, this one is the header.” Vision-language model + bounding boxes
Planning Agent Draws a floor plan Puts each box in the right room of a house Spatial heuristics + CSS Grid
Generation Agent Writes the blueprints Turns the plan into actual walls and paint Prompt-driven language model
Bonus Placeholder Mapping Hangs real photos Swaps gray squares for the original images UI element detector + crop-and-replace

2. Deep Dive: Each Stage Explained

2.1 Grounding Agent — “Where Is Everything?”

  • Labels it looks for
    sidebar, header, navigation, main_content.
  • How it finds them
    It asks the vision-language model questions such as “Where is the sidebar?”
    The model returns bounding boxes with pixel coordinates.
  • Edge cases handled

    • Duplicate boxes → non-maximum suppression.
    • Missing header → fallback rules (wide box near top = header).
    • Main content → largest empty rectangle left after other boxes are removed.

2.2 Planning Agent — “How Do Pieces Fit?”

  1. Converts raw pixels into percentages so the layout works on any screen size.
  2. Creates a root container that fills the viewport.
  3. Drops each labeled box into the container using absolute positioning.
  4. If a box has sub-components, adds an inner div class="container grid"; children then use Tailwind classes (grid-cols-3, gap-4, etc.).
  5. Outputs a small JSON tree that the next stage can read.

2.3 Generation Agent — “Write the Code”

  • For every node in the tree, it builds a short prompt:

    • semantic label (e.g., “header”)
    • layout context (e.g., “spans full width, sits at top”)
    • optional user note (e.g., “make it sticky on scroll”)
  • The language model returns a block of HTML/CSS.
  • Blocks are stitched together in order, keeping hierarchy intact.

2.4 Placeholder Mapping — “Put Real Images Back”

  • Runs a lightweight UI element detector on the original screenshot.
  • Matches detected images to the gray placeholders in the generated HTML using the Hungarian algorithm.
  • Crops the real images and replaces the gray squares, so the final page looks like the original design.

3. Performance in Plain Numbers

Model Block Match ↑ Text Match ↑ Position ↑ CLIP ↑
GPT-4o 0.730 0.926 0.811 0.869
Gemini-2.5-Pro 0.727 0.924 0.809 0.867
ScreenCoder (Agentic) 0.755 0.946 0.840 0.877

Tested on 3 000 real-world screenshots.
Higher is better. ScreenCoder tops all open-source models and edges out GPT-4o on layout and text fidelity.


4. Installation Guide — Works on macOS, Linux, Windows

4.1 Clone the Repository

git clone https://github.com/leigest519/ScreenCoder.git
cd screencoder

4.2 Create a Virtual Environment

python3 -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate

4.3 Install Dependencies

pip install -r requirements.txt

4.4 Pick Your Model & Add API Key

  • Open block_parsor.py and html_generator.py.
  • Choose one: doubao, qwen, gpt, or gemini.
  • In the project root, create a plain-text file named after your choice:

    • doubao_api.txt, qwen_api.txt, gpt_api.txt, or gemini_api.txt
  • Paste your single-line API key inside. The file is already ignored by .gitignore.

4.5 One-Command Run

python main.py

The script walks through all four stages and drops a finished final.html in the project folder.


5. Step-by-Step Manual Mode

Step Command Output File
Detect UI elements python UIED/run_single.py components.json
Generate placeholder page python html_generator.py index_with_placeholder.html
Detect placeholders python image_box_detection.py placeholders.json
Map images to placeholders python mapping.py mapping table
Replace placeholders python image_replacer.py final.html

Use this pipeline when you want to inspect or tweak any intermediate file.


6. Frequently Asked Questions

Q1. Does it only work for websites?

No. The same pipeline can be pointed at mobile or desktop screenshots by swapping the label set and layout rules.

Q2. Will low-resolution images break it?

The benchmark includes blurry, compressed screenshots. Accuracy drops slightly but still beats prior work. For extreme cases, upscale first.

Q3. Can I get React or Vue code?

Today the output is plain HTML + Tailwind. Pipe it through tools like `html-to-jsx` if you need components.

Q4. Do I need a GPU?

Inference calls external APIs, so a laptop CPU is enough. Training your own model would require a GPU.

Q5. Is my design data safe?

All processing can run locally; only the image is sent to the API provider you choose. For sensitive work, host the model yourself.

Q6. How about right-to-left (RTL) languages?

Layout percentages adapt automatically. You can add a prompt note “direction: rtl” to flip flex order.

Q7. Can I change the color theme later?

Yes. Add a user prompt such as “switch to dark mode” and regenerate only the affected blocks.

Q8. Why did it miss a tiny button?

Very small buttons sometimes fall below the detector’s threshold. You can manually add a node in the JSON layout and rerun generation.

Q9. Where does the training data come from?

The authors generated 50 000 pairs from public websites, dashboards, e-commerce pages, and synthetic templates. All personal data was removed.

Q10. Can I retrain the model with my own data?

Absolutely. The repo includes a data engine that produces (image, code) pairs. Use supervised fine-tuning followed by reinforcement learning as described in the paper.

Q11. Is there a Figma plugin?

Not yet, but community developers have prototyped a Figma widget that exports screenshots directly into ScreenCoder.

Q12. License details?

Main repo is MIT. UIED sub-folder keeps its original permissive license; commercial use is allowed.


7. Continuous Learning Loop for Teams

  1. Designer exports screenshot →
  2. ScreenCoder produces first draft →
  3. Engineer tweaks code, commits to repo →
  4. Screenshot + final code are fed back into the data engine →
  5. New model version is fine-tuned overnight →
  6. Next morning everyone gets a slightly smarter assistant.

8. Key Takeaway

Upload a screenshot, grab a coffee, and come back to production-ready HTML/CSS that you can hand to any junior developer.
If you want more, use the same pipeline to train a model that understands your company’s unique design language.


References & Further Reading

  • Jiang Y. et al., “ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents,” arXiv:2507.22827, 2025
  • GitHub repository: https://github.com/leigest519/ScreenCoder
  • Live demo: https://huggingface.co/spaces/Jimmyzheng-10/ScreenCoder