LEGO: The Open-Source Framework That Turns AI Loops into Silicon—No RTL Templates Required

高效码农

3 months ago

Keywords: LEGO accelerator, automatic RTL generation, spatial accelerator, tensor applications, AI chip design, Gemmini comparison, data-flow fusion, MIT Han Lab

TL;DR

LEGO is an open-source toolchain released by MIT Han Lab in 2025.
Feed it a plain tensor loop (GEMM, Conv2D, Attention, MTTKRP) and it returns production-grade Verilog—no human-written templates, no HLS headaches.
On a 28 nm test chip LEGO beats the state-of-the-art Gemmini generator by 3.2× speed and 2.4× energy while using the same MAC count and on-chip memory.

What you will learn in 12 minutes

Why even Google still hand-tunes TPU blocks—and where that hurts
How LEGO removes the “human RTL” bottleneck in three steps
Side-by-side numbers: LEGO vs. Gemmini, Eyeriss, NVDLA
Exact install commands (Ubuntu & macOS) and first-run example
FAQ covering license, FPGA/ASIC flows, sparse ops, power tricks, road-maps

1. The hidden cost of “template” accelerators

Pain point	What happens in practice	Business impact
Fixed topology	Systolic or 2-D array only	New operators (e.g. sparse attention) need a new chip
Control bloat	Every PE keeps its own address counter	Area ×2.0, power ×2.6 (MIT silicon numbers)
Manual RTL loop	Designer writes Verilog → debug → synthesis → place	4-6 months; algorithm team moves on after 2 weeks
Data-flow mismatch	One kernel → one mapping	MobileNet + BERT on same die = two accelerators

LEGO attacks all four: automatic RTL, any topology, shared control, one fused array for many kernels.

2. LEGO workflow in plain English

Step 1: Describe the loop (30 seconds)

# GEMM pseudo-code, any language
for i in range(M):
  for j in range(N):
    for k in range(K):
      Y[i][j] += X[i][k] * W[k][j]

Save as gemm.json through the helper script or write MAESTRO/TENET format—LEGO ingests both.

Step 2: Front-end (≈1 min on laptop)

Affine analysis spots “who re-uses which element when”
Solver returns legal Δs (spatial offset) and Δt (time offset)
Chu-Liu minimum-spanning-tree drops redundant edges
Output: ADG (Architecture Description Graph) — nodes = FUs, edges = direct wire or FIFO

No human draws schematics; the graph is mathematically optimal for reuse distance.

Step 3: Back-end (≈40 s + LP time)

Lower ADG to DAG (Detailed Architecture Graph): registers, muxes, SRAM banks, address generators
Linear program inserts the minimum pipeline registers while meeting clock
Three bonus passes:
- Broadcast → forwarding chain (save 15 % area)
- Long adder chain → balanced tree (save 5 % power)
- Pin-remapping across data-flows (save another 5 %)
SpinalHDL emits Verilog ready for DC/Vivado

Deliverable: gemm.v, gemm_sram.v, gemm_top.v + constraints file.

3. Benchmark results (same silicon budget)

Test setup: TSMC 28 nm, 256 MACs, 256 kB SRAM, 128-bit LPDDR4 16 GB/s.

Work-load	Gemmini latency (ms)	LEGO latency (ms)	Speed-up	Energy vs. Gemmini
ResNet50	8.7	2.7	3.2×	–59 %
MobileNetV2	12.1	3.8	3.2×	–58 %
BERT-Base (16 token)	6.5	2.1	3.1×	–57 %
GPT-2 1k generate	95	92	1.03×	–9 % (memory bound)

Area: LEGO uses 1.76 mm², 2 % of which is post-processing units (softmax, layernorm).

4. Installation guide (zero external knowledge added)

4.1 Prerequisites

OS: Ubuntu ≥20.04 or macOS ≥12
Python ≥3.8
C++ compiler with C++17 support (g++-9, clang-12, Apple Clang 14)
CMake ≥3.20
Java 11 (HiGHS solver)
(Optional) Synopsys Design Compiler or Xilinx Vivado for synthesis
8 GB RAM free, 2 GB disk

4.2 One-line dependency install

git clone https://github.com/mit-han-lab/lego.git
cd lego
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
./scripts/setup.sh      # downloads HiGHS, SpinalHDL, builds C++ binaries

4.3 Run your first GEMM

# create loop description
python examples/gemm/write_loop.py --M=64 --N=64 --K=512 -o gemm.json

# generate ADG
./build/bin/lego-front --input gemm.json --output gemm.adg

# generate RTL
./build/bin/lego-back --adg gemm.adg --verilog gemm_rtl

# simulate (iverilog example)
cd gemm_rtl
iverilog -o sim gemm*.v tb_gemm.v && vvp sim

Expected console output: “Test PASSED: 65536/65536 correct.”

4.4 Directory map

lego/
├── examples/          # GEMM, Conv2D, Attention, MTTKRP ready-to-run
├── docs/              # affine math, DAG algorithms (PDF)
├── scripts/
│   ├── synth.tcl      # Vivado batch flow
│   ├── dc.tcl         # Design Compiler flow
│   └── power.py       # switching activity → CACTI
├── spinal/            # SpinalHDL primitives
└── build/             # CMake output, added to .gitignore

5. Deep dive: how LEGO keeps area & power low

Technique	Saving mechanism	Average gain
Shared address generator	Control signal propagated systolically	–60 % ctrl logic area
MST + FIFO depth opt	Minimises pipeline registers	–15 % area
Balanced reduction tree	Cuts adder levels	–5 % power
Pin reuse across data-flows	Smaller muxes vs. full adders	–5 % area & power
Clock-gating unused edges	Lower toggle rate	–1.5 % total power (–9 % on Attention)

Total back-end optimisation: –35 % area, –28 % power compared to naive delay-matching.

6. Comparison with hand-written designs

Same data-flow, same tech node, silicon proven:

Accelerator	Source	Data-flow	Area (mm²)	Power (mW)	LEGO gain
Eyeriss	hand RTL	KH-OH	9.6	278	–23 % area, –60 % power
NVDLA	hand RTL	IC-OC	1.7	300*	–12 % area, –30 % power

*NVDLA 16 nm projected to 28 nm.

LEGO achieves comparable PPA without months of manual effort.

7. Scalability: from 64 FUs to 16 384 FUs

#FUs	Array size	L2 NoC	Gen time (s)	Area (mm²)	Power (mW)	Energy eff. (GOP/s/W)
64	8×8	1×1	13.1	0.02	29	4404
256	16×16	1×1	28.7	0.06	106	4816
1 024	32×32	1×1	111.2	0.24	422	4853
4 096	32×32	2×3	120.3	1.05	1748	4688
16 384	32×32	4×5	134.3	4.21	6987	4690

Generation time stays under 3 min; energy efficiency saturates because DRAM bandwidth becomes the bottleneck.

8. Multi-data-flow fusion example

MobileNetV2 layer sequence needs both IC-OC and OH-OW parallelism on the same silicon.

Approach	Power (mW)	MBV2 perf. (GOP/s)	Eff. (GOP/s/W)
Single IC-OC	123	213	1732
Single OH-OW	155	293	1890
Naïve merge	196	313	1597
LEGO fused	163	313	1920

LEGO heuristic chain planning keeps the bonus performance while recovering the energy of the lighter single designs.

9. Frequently asked questions

Q1. Is LEGO only for academics?
No. MIT license allows commercial use; the generated RTL is yours.

Q2. My lab uses FPGA. Any proven board?
Xilinx Ultrascale+ U280 is validated. A 64-FU design compiles in 20 min on Vivado 2023.1.

Q3. When will sparse kernels be supported?
The team lists “sparse data-flow” for late-2025. today you can still feed CSR/COO loops but reuse patterns will be dense.

Q4. How does LEGO handle floating-point?
Present release targets INT8/INT16. FP32 MACs can be dropped in by swapping the FU primitive in spinal/mac_fpu.scala.

Q5. Can I import a DSE result from Timeloop/MAESTRO?
Yes. LEGO reads the .map file and builds the exact data-flow; then it emits RTL instead of a cost report.

Q6. What if synthesis fails timing?
Increase the target_clock parameter (default 1 ns). The LP phase will insert more pipeline registers automatically.

Q7. Where do I post bugs?
GitHub issues board; response time < 48 h based on repo history.

10. Take-aways

LEGO is the first template-free framework that goes from tensor loops to synthesizable Verilog in minutes.
On identical silicon it outruns Gemmini—today’s most popular open DNN generator—by 3.2× while using 2.4× less energy.
Area and power benefits come from control-signal sharing, MST-based wiring, and LP-guided pipelining, not from magic silicon.
Installation needs only CMake + Python; example designs compile on both Ubuntu and macOS.
Commercial use is MIT licensed; FPGA or ASIC flows are production proven at 28 nm.

If your roadmap includes a custom AI accelerator and you cannot afford a six-month RTL cycle, download LEGO, run the GEMM example, and decide later if you still want to hand-write those pipelines.

Citation

If you use LEGO in research or product documentation, please cite:

@inproceedings{lin2025lego,
  title={LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications},
  author={Lin, Yujun and Zhang, Zhekai and Han, Song},
  booktitle={2025 IEEE International Symposium on High Performance Computer Architecture (HPCA)},
  pages={1335--1347},
  year={2025},
  organization={IEEE}
}