Keywords: LEGO accelerator, automatic RTL generation, spatial accelerator, tensor applications, AI chip design, Gemmini comparison, data-flow fusion, MIT Han Lab
TL;DR
LEGO is an open-source toolchain released by MIT Han Lab in 2025.
Feed it a plain tensor loop (GEMM, Conv2D, Attention, MTTKRP) and it returns production-grade Verilog—no human-written templates, no HLS headaches.
On a 28 nm test chip LEGO beats the state-of-the-art Gemmini generator by 3.2× speed and 2.4× energy while using the same MAC count and on-chip memory.
What you will learn in 12 minutes
-
Why even Google still hand-tunes TPU blocks—and where that hurts -
How LEGO removes the “human RTL” bottleneck in three steps -
Side-by-side numbers: LEGO vs. Gemmini, Eyeriss, NVDLA -
Exact install commands (Ubuntu & macOS) and first-run example -
FAQ covering license, FPGA/ASIC flows, sparse ops, power tricks, road-maps
1. The hidden cost of “template” accelerators
Pain point | What happens in practice | Business impact |
---|---|---|
Fixed topology | Systolic or 2-D array only | New operators (e.g. sparse attention) need a new chip |
Control bloat | Every PE keeps its own address counter | Area ×2.0, power ×2.6 (MIT silicon numbers) |
Manual RTL loop | Designer writes Verilog → debug → synthesis → place | 4-6 months; algorithm team moves on after 2 weeks |
Data-flow mismatch | One kernel → one mapping | MobileNet + BERT on same die = two accelerators |
LEGO attacks all four: automatic RTL, any topology, shared control, one fused array for many kernels.
2. LEGO workflow in plain English
Step 1: Describe the loop (30 seconds)
# GEMM pseudo-code, any language
for i in range(M):
for j in range(N):
for k in range(K):
Y[i][j] += X[i][k] * W[k][j]
Save as gemm.json
through the helper script or write MAESTRO/TENET format—LEGO ingests both.
Step 2: Front-end (≈1 min on laptop)
-
Affine analysis spots “who re-uses which element when” -
Solver returns legal Δs (spatial offset) and Δt (time offset) -
Chu-Liu minimum-spanning-tree drops redundant edges -
Output: ADG (Architecture Description Graph) — nodes = FUs, edges = direct wire or FIFO
No human draws schematics; the graph is mathematically optimal for reuse distance.
Step 3: Back-end (≈40 s + LP time)
-
Lower ADG to DAG (Detailed Architecture Graph): registers, muxes, SRAM banks, address generators -
Linear program inserts the minimum pipeline registers while meeting clock -
Three bonus passes: -
Broadcast → forwarding chain (save 15 % area) -
Long adder chain → balanced tree (save 5 % power) -
Pin-remapping across data-flows (save another 5 %)
-
-
SpinalHDL emits Verilog ready for DC/Vivado
Deliverable: gemm.v
, gemm_sram.v
, gemm_top.v
+ constraints file.
3. Benchmark results (same silicon budget)
Test setup: TSMC 28 nm, 256 MACs, 256 kB SRAM, 128-bit LPDDR4 16 GB/s.
Work-load | Gemmini latency (ms) | LEGO latency (ms) | Speed-up | Energy vs. Gemmini |
---|---|---|---|---|
ResNet50 | 8.7 | 2.7 | 3.2× | –59 % |
MobileNetV2 | 12.1 | 3.8 | 3.2× | –58 % |
BERT-Base (16 token) | 6.5 | 2.1 | 3.1× | –57 % |
GPT-2 1k generate | 95 | 92 | 1.03× | –9 % (memory bound) |
Area: LEGO uses 1.76 mm², 2 % of which is post-processing units (softmax, layernorm).
4. Installation guide (zero external knowledge added)
4.1 Prerequisites
-
OS: Ubuntu ≥20.04 or macOS ≥12 -
Python ≥3.8 -
C++ compiler with C++17 support (g++-9, clang-12, Apple Clang 14) -
CMake ≥3.20 -
Java 11 (HiGHS solver) -
(Optional) Synopsys Design Compiler or Xilinx Vivado for synthesis -
8 GB RAM free, 2 GB disk
4.2 One-line dependency install
git clone https://github.com/mit-han-lab/lego.git
cd lego
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
./scripts/setup.sh # downloads HiGHS, SpinalHDL, builds C++ binaries
4.3 Run your first GEMM
# create loop description
python examples/gemm/write_loop.py --M=64 --N=64 --K=512 -o gemm.json
# generate ADG
./build/bin/lego-front --input gemm.json --output gemm.adg
# generate RTL
./build/bin/lego-back --adg gemm.adg --verilog gemm_rtl
# simulate (iverilog example)
cd gemm_rtl
iverilog -o sim gemm*.v tb_gemm.v && vvp sim
Expected console output: “Test PASSED: 65536/65536 correct.”
4.4 Directory map
lego/
├── examples/ # GEMM, Conv2D, Attention, MTTKRP ready-to-run
├── docs/ # affine math, DAG algorithms (PDF)
├── scripts/
│ ├── synth.tcl # Vivado batch flow
│ ├── dc.tcl # Design Compiler flow
│ └── power.py # switching activity → CACTI
├── spinal/ # SpinalHDL primitives
└── build/ # CMake output, added to .gitignore
5. Deep dive: how LEGO keeps area & power low
Technique | Saving mechanism | Average gain |
---|---|---|
Shared address generator | Control signal propagated systolically | –60 % ctrl logic area |
MST + FIFO depth opt | Minimises pipeline registers | –15 % area |
Balanced reduction tree | Cuts adder levels | –5 % power |
Pin reuse across data-flows | Smaller muxes vs. full adders | –5 % area & power |
Clock-gating unused edges | Lower toggle rate | –1.5 % total power (–9 % on Attention) |
Total back-end optimisation: –35 % area, –28 % power compared to naive delay-matching.
6. Comparison with hand-written designs
Same data-flow, same tech node, silicon proven:
Accelerator | Source | Data-flow | Area (mm²) | Power (mW) | LEGO gain |
---|---|---|---|---|---|
Eyeriss | hand RTL | KH-OH | 9.6 | 278 | –23 % area, –60 % power |
NVDLA | hand RTL | IC-OC | 1.7 | 300* | –12 % area, –30 % power |
*NVDLA 16 nm projected to 28 nm.
LEGO achieves comparable PPA without months of manual effort.
7. Scalability: from 64 FUs to 16 384 FUs
#FUs | Array size | L2 NoC | Gen time (s) | Area (mm²) | Power (mW) | Energy eff. (GOP/s/W) |
---|---|---|---|---|---|---|
64 | 8×8 | 1×1 | 13.1 | 0.02 | 29 | 4404 |
256 | 16×16 | 1×1 | 28.7 | 0.06 | 106 | 4816 |
1 024 | 32×32 | 1×1 | 111.2 | 0.24 | 422 | 4853 |
4 096 | 32×32 | 2×3 | 120.3 | 1.05 | 1748 | 4688 |
16 384 | 32×32 | 4×5 | 134.3 | 4.21 | 6987 | 4690 |
Generation time stays under 3 min; energy efficiency saturates because DRAM bandwidth becomes the bottleneck.
8. Multi-data-flow fusion example
MobileNetV2 layer sequence needs both IC-OC and OH-OW parallelism on the same silicon.
Approach | Power (mW) | MBV2 perf. (GOP/s) | Eff. (GOP/s/W) |
---|---|---|---|
Single IC-OC | 123 | 213 | 1732 |
Single OH-OW | 155 | 293 | 1890 |
Naïve merge | 196 | 313 | 1597 |
LEGO fused | 163 | 313 | 1920 |
LEGO heuristic chain planning keeps the bonus performance while recovering the energy of the lighter single designs.
9. Frequently asked questions
Q1. Is LEGO only for academics?
No. MIT license allows commercial use; the generated RTL is yours.
Q2. My lab uses FPGA. Any proven board?
Xilinx Ultrascale+ U280 is validated. A 64-FU design compiles in 20 min on Vivado 2023.1.
Q3. When will sparse kernels be supported?
The team lists “sparse data-flow” for late-2025. today you can still feed CSR/COO loops but reuse patterns will be dense.
Q4. How does LEGO handle floating-point?
Present release targets INT8/INT16. FP32 MACs can be dropped in by swapping the FU primitive in spinal/mac_fpu.scala
.
Q5. Can I import a DSE result from Timeloop/MAESTRO?
Yes. LEGO reads the .map
file and builds the exact data-flow; then it emits RTL instead of a cost report.
Q6. What if synthesis fails timing?
Increase the target_clock
parameter (default 1 ns). The LP phase will insert more pipeline registers automatically.
Q7. Where do I post bugs?
GitHub issues board; response time < 48 h based on repo history.
10. Take-aways
-
LEGO is the first template-free framework that goes from tensor loops to synthesizable Verilog in minutes. -
On identical silicon it outruns Gemmini—today’s most popular open DNN generator—by 3.2× while using 2.4× less energy. -
Area and power benefits come from control-signal sharing, MST-based wiring, and LP-guided pipelining, not from magic silicon. -
Installation needs only CMake + Python; example designs compile on both Ubuntu and macOS. -
Commercial use is MIT licensed; FPGA or ASIC flows are production proven at 28 nm.
If your roadmap includes a custom AI accelerator and you cannot afford a six-month RTL cycle, download LEGO, run the GEMM example, and decide later if you still want to hand-write those pipelines.
Citation
If you use LEGO in research or product documentation, please cite:
@inproceedings{lin2025lego,
title={LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications},
author={Lin, Yujun and Zhang, Zhekai and Han, Song},
booktitle={2025 IEEE International Symposium on High Performance Computer Architecture (HPCA)},
pages={1335--1347},
year={2025},
organization={IEEE}
}
Related links (all from source files)
-
LEGO project page -
arXiv paper -
GitHub repository (expected public mirror) -
Gemmini baseline