Site icon Efficient Coder

LEGO: The Open-Source Framework That Turns AI Loops into Silicon—No RTL Templates Required

Keywords: LEGO accelerator, automatic RTL generation, spatial accelerator, tensor applications, AI chip design, Gemmini comparison, data-flow fusion, MIT Han Lab


TL;DR

LEGO is an open-source toolchain released by MIT Han Lab in 2025.
Feed it a plain tensor loop (GEMM, Conv2D, Attention, MTTKRP) and it returns production-grade Verilog—no human-written templates, no HLS headaches.
On a 28 nm test chip LEGO beats the state-of-the-art Gemmini generator by 3.2× speed and 2.4× energy while using the same MAC count and on-chip memory.


What you will learn in 12 minutes

  1. Why even Google still hand-tunes TPU blocks—and where that hurts
  2. How LEGO removes the “human RTL” bottleneck in three steps
  3. Side-by-side numbers: LEGO vs. Gemmini, Eyeriss, NVDLA
  4. Exact install commands (Ubuntu & macOS) and first-run example
  5. FAQ covering license, FPGA/ASIC flows, sparse ops, power tricks, road-maps

1. The hidden cost of “template” accelerators

Pain point What happens in practice Business impact
Fixed topology Systolic or 2-D array only New operators (e.g. sparse attention) need a new chip
Control bloat Every PE keeps its own address counter Area ×2.0, power ×2.6 (MIT silicon numbers)
Manual RTL loop Designer writes Verilog → debug → synthesis → place 4-6 months; algorithm team moves on after 2 weeks
Data-flow mismatch One kernel → one mapping MobileNet + BERT on same die = two accelerators

LEGO attacks all four: automatic RTL, any topology, shared control, one fused array for many kernels.


2. LEGO workflow in plain English

Step 1: Describe the loop (30 seconds)

# GEMM pseudo-code, any language
for i in range(M):
  for j in range(N):
    for k in range(K):
      Y[i][j] += X[i][k] * W[k][j]

Save as gemm.json through the helper script or write MAESTRO/TENET format—LEGO ingests both.

Step 2: Front-end (≈1 min on laptop)

  1. Affine analysis spots “who re-uses which element when”
  2. Solver returns legal Δs (spatial offset) and Δt (time offset)
  3. Chu-Liu minimum-spanning-tree drops redundant edges
  4. Output: ADG (Architecture Description Graph) — nodes = FUs, edges = direct wire or FIFO

No human draws schematics; the graph is mathematically optimal for reuse distance.

Step 3: Back-end (≈40 s + LP time)

  1. Lower ADG to DAG (Detailed Architecture Graph): registers, muxes, SRAM banks, address generators
  2. Linear program inserts the minimum pipeline registers while meeting clock
  3. Three bonus passes:
    • Broadcast → forwarding chain (save 15 % area)
    • Long adder chain → balanced tree (save 5 % power)
    • Pin-remapping across data-flows (save another 5 %)
  4. SpinalHDL emits Verilog ready for DC/Vivado

Deliverable: gemm.v, gemm_sram.v, gemm_top.v + constraints file.


3. Benchmark results (same silicon budget)

Test setup: TSMC 28 nm, 256 MACs, 256 kB SRAM, 128-bit LPDDR4 16 GB/s.

Work-load Gemmini latency (ms) LEGO latency (ms) Speed-up Energy vs. Gemmini
ResNet50 8.7 2.7 3.2× –59 %
MobileNetV2 12.1 3.8 3.2× –58 %
BERT-Base (16 token) 6.5 2.1 3.1× –57 %
GPT-2 1k generate 95 92 1.03× –9 % (memory bound)

Area: LEGO uses 1.76 mm², 2 % of which is post-processing units (softmax, layernorm).

LEGO vs Gemmini

4. Installation guide (zero external knowledge added)

4.1 Prerequisites

  • OS: Ubuntu ≥20.04 or macOS ≥12
  • Python ≥3.8
  • C++ compiler with C++17 support (g++-9, clang-12, Apple Clang 14)
  • CMake ≥3.20
  • Java 11 (HiGHS solver)
  • (Optional) Synopsys Design Compiler or Xilinx Vivado for synthesis
  • 8 GB RAM free, 2 GB disk

4.2 One-line dependency install

git clone https://github.com/mit-han-lab/lego.git
cd lego
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
./scripts/setup.sh      # downloads HiGHS, SpinalHDL, builds C++ binaries

4.3 Run your first GEMM

# create loop description
python examples/gemm/write_loop.py --M=64 --N=64 --K=512 -o gemm.json

# generate ADG
./build/bin/lego-front --input gemm.json --output gemm.adg

# generate RTL
./build/bin/lego-back --adg gemm.adg --verilog gemm_rtl

# simulate (iverilog example)
cd gemm_rtl
iverilog -o sim gemm*.v tb_gemm.v && vvp sim

Expected console output: “Test PASSED: 65536/65536 correct.”

4.4 Directory map

lego/
├── examples/          # GEMM, Conv2D, Attention, MTTKRP ready-to-run
├── docs/              # affine math, DAG algorithms (PDF)
├── scripts/
│   ├── synth.tcl      # Vivado batch flow
│   ├── dc.tcl         # Design Compiler flow
│   └── power.py       # switching activity → CACTI
├── spinal/            # SpinalHDL primitives
└── build/             # CMake output, added to .gitignore

5. Deep dive: how LEGO keeps area & power low

Technique Saving mechanism Average gain
Shared address generator Control signal propagated systolically –60 % ctrl logic area
MST + FIFO depth opt Minimises pipeline registers –15 % area
Balanced reduction tree Cuts adder levels –5 % power
Pin reuse across data-flows Smaller muxes vs. full adders –5 % area & power
Clock-gating unused edges Lower toggle rate –1.5 % total power (–9 % on Attention)

Total back-end optimisation: –35 % area, –28 % power compared to naive delay-matching.


6. Comparison with hand-written designs

Same data-flow, same tech node, silicon proven:

Accelerator Source Data-flow Area (mm²) Power (mW) LEGO gain
Eyeriss hand RTL KH-OH 9.6 278 –23 % area, –60 % power
NVDLA hand RTL IC-OC 1.7 300* –12 % area, –30 % power

*NVDLA 16 nm projected to 28 nm.

LEGO achieves comparable PPA without months of manual effort.


7. Scalability: from 64 FUs to 16 384 FUs

#FUs Array size L2 NoC Gen time (s) Area (mm²) Power (mW) Energy eff. (GOP/s/W)
64 8×8 1×1 13.1 0.02 29 4404
256 16×16 1×1 28.7 0.06 106 4816
1 024 32×32 1×1 111.2 0.24 422 4853
4 096 32×32 2×3 120.3 1.05 1748 4688
16 384 32×32 4×5 134.3 4.21 6987 4690

Generation time stays under 3 min; energy efficiency saturates because DRAM bandwidth becomes the bottleneck.


8. Multi-data-flow fusion example

MobileNetV2 layer sequence needs both IC-OC and OH-OW parallelism on the same silicon.

Approach Power (mW) MBV2 perf. (GOP/s) Eff. (GOP/s/W)
Single IC-OC 123 213 1732
Single OH-OW 155 293 1890
Naïve merge 196 313 1597
LEGO fused 163 313 1920

LEGO heuristic chain planning keeps the bonus performance while recovering the energy of the lighter single designs.


9. Frequently asked questions

Q1. Is LEGO only for academics?
No. MIT license allows commercial use; the generated RTL is yours.

Q2. My lab uses FPGA. Any proven board?
Xilinx Ultrascale+ U280 is validated. A 64-FU design compiles in 20 min on Vivado 2023.1.

Q3. When will sparse kernels be supported?
The team lists “sparse data-flow” for late-2025. today you can still feed CSR/COO loops but reuse patterns will be dense.

Q4. How does LEGO handle floating-point?
Present release targets INT8/INT16. FP32 MACs can be dropped in by swapping the FU primitive in spinal/mac_fpu.scala.

Q5. Can I import a DSE result from Timeloop/MAESTRO?
Yes. LEGO reads the .map file and builds the exact data-flow; then it emits RTL instead of a cost report.

Q6. What if synthesis fails timing?
Increase the target_clock parameter (default 1 ns). The LP phase will insert more pipeline registers automatically.

Q7. Where do I post bugs?
GitHub issues board; response time < 48 h based on repo history.


10. Take-aways

  1. LEGO is the first template-free framework that goes from tensor loops to synthesizable Verilog in minutes.
  2. On identical silicon it outruns Gemmini—today’s most popular open DNN generator—by 3.2× while using 2.4× less energy.
  3. Area and power benefits come from control-signal sharing, MST-based wiring, and LP-guided pipelining, not from magic silicon.
  4. Installation needs only CMake + Python; example designs compile on both Ubuntu and macOS.
  5. Commercial use is MIT licensed; FPGA or ASIC flows are production proven at 28 nm.

If your roadmap includes a custom AI accelerator and you cannot afford a six-month RTL cycle, download LEGO, run the GEMM example, and decide later if you still want to hand-write those pipelines.


Citation

If you use LEGO in research or product documentation, please cite:

@inproceedings{lin2025lego,
  title={LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications},
  author={Lin, Yujun and Zhang, Zhekai and Han, Song},
  booktitle={2025 IEEE International Symposium on High Performance Computer Architecture (HPCA)},
  pages={1335--1347},
  year={2025},
  organization={IEEE}
}

Related links (all from source files)

Exit mobile version