Exploring Powerful Ways to Generate: Autoregression, Diffusion, and Beyond
Have you ever wondered how AI models like those behind chatbots or code generators create new content? It’s not magic—it’s all about the generation process, the step-by-step method the model uses to build sequences like sentences, puzzles, or even graphs. Traditional approaches, like predicting the next word one at a time, work well for everyday language but can stumble on tougher tasks, such as solving complex puzzles or designing molecular structures. A recent paper dives deep into this, comparing classic autoregressive models with newer masked diffusion techniques and proposing an enhanced version called Any-Process Masked Diffusion Models (AP-MDM). As someone who’s spent years unpacking these ideas for engineers and students, I’ll walk you through it simply—think of it as a guided tour, no PhD required. We’ll cover the basics, the breakthroughs, and even how to try it yourself with code snippets from the paper’s repository.
Why Generation Processes Matter More Than You Think
Picture this: nature’s building blocks—from spoken words to DNA strands—don’t always follow a straight line. Language flows left to right, sure, but solving a puzzle or editing code often means jumping around, fixing mistakes, or adding bits midstream. That’s the core idea here. Auto-Regressive Models (ARM), the backbone of tools like GPT, stick to next-token prediction: given what came before, guess what’s next. This mirrors how we talk, building causally on prior context. Scaled up with massive data, ARMs handle general tasks and reasoning impressively.
But real-world challenges get messy. Humans don’t just append; we revise, backtrack, and restructure. Code needs balanced brackets and types from the start—slip up, and the whole thing crumbles. In biology, proteins or genes are graphs or trees with physical rules, best tweaked by swapping parts or inserting motifs, not linear stacking. Current ARMs struggle here, as do many large language models venturing beyond text.
The paper asks: How do we measure and improve these generation methods? It abstracts away hardware details, focusing on computational limits and learning ease. Enter Masked Diffusion Models (MDM): instead of sequential guesses, start with a fully masked sequence (all unknowns) and iteratively reveal tokens in any order, often in parallel. This draws from diffusion ideas in images but adapted for discrete text, enabling faster decoding—up to 10 times quicker—and better handling of order-sensitive puzzles like reversed poems or Sudoku.
Yet, as we’ll see, MDMs shine in parallelism but hit walls on rewrites and space. The paper’s big idea? Any-Process Generation, extending MDMs with remask (rewrite), insert (add space), and delete (trim waste). This lets models edit like a human editor, scaling to harder problems with less resources.
This figure sketches the shift: ARM builds left-to-right, standard MDM unmasks anywhere, and AP-MDM adds edits for true flexibility.
Breaking Down the Players: ARM, MDM, and the New Kid
Let’s start with the familiar: ARM. Rooted in Claude Shannon’s 1951 work on prediction, it trains on huge datasets to forecast the next token. Models like GPT series excel at tasks from writing to reasoning, thanks to causal attention in Transformers. But for non-linear domains? Not so much. Proving ARM Turing-complete (able to compute anything computable) took intermediate steps, yet it runs serially—linear time in sequence length.
MDM flips this. From a noise of masks, it denoises in parallel steps. Recent giants like Gemini Diffusion match ARMs on language while speeding up. Why? Parallelism: simulate parallel machines (PRAM) in optimal time, solving graph connectivity in log n steps versus ARM’s linear slog. But flexibility has limits—MDM can’t rewrite decoded tokens, bloating space to total work (processors times time), capping at P-class problems (polynomial time) with polynomial context.
Any-order generation (MDM’s hallmark) sounds freer than left-to-right, but controlling for parallelism and architecture? It’s reorganizable into ARM order. Theorem 3 proves: any one-token-per-step AO-MDM computation simulates via Masked-ARM (padded ARM) with O(S(n)) length and constant extra layers. No new power, just efficiency.
Enter AP-MDM: same backbone, plus three ops via 3-bit controls per position.
-
Remask: Overwrite with mask for backtracking/self-correction. -
Insert: Add mask anywhere, growing length exponentially. -
Delete: Remove masks, freeing space.
No physics ties—flexible steps, lengths, stops. Theoretically, simulates PRAM with optimal time and space (O(S(n))), hitting PSPACE (polynomial space) versus P. Empirically? Game-changer.
| Aspect | ARM | Standard MDM | AP-MDM |
|---|---|---|---|
| Parallel Time | Linear in n | O(T(n)) for PRAM | O(T(n)) for PRAM |
| Space | O(S(n)) total work | O(P(n)·T(n)) total work | O(S(n)) true space |
| Hard Tasks | P-class limit | P-class limit | PSPACE via scaling |
| Edits | None | Unmask only | Remask/Insert/Delete |
This table highlights why AP-MDM edges out: it rewrites without waste.
Theoretical Wins: From Parallel Speed to Space Smarts
Theory first: MDMs parallelize well (Theorem 1), matching PRAM time but at work-proportional space cost. For NC problems (parallelizable, like connectivity), log n steps beat ARM’s n. But hard tasks? Both cap at cubic serial time (Theorem 2)—no superpolynomial context means no NP-hard wins.
Any-order? Limited (Theorem 3)—reorder to left-to-right, simulate with Masked-ARM. AP-MDM breaks through (Theorem 4): optimal time/space, PSPACE power. Constant-depth AP-MDM evades constant-depth ARM simulation (Theorem 6, under TC0 ≠ NC1).
For code’s Dyck languages (matched brackets), ARM fails arbitrary lengths—recognition is NC1-hard, beyond TC0. AP-MDM? Inserts pairs, remasks mismatches, exactly generates two-sided Dyck-k (Theorem 5).
These aren’t abstractions; they guide real designs for coding, math, science.
// Note: Placeholder for Figure 2 from paper
Figure 2 shows AP-MDM on Sudoku (backtrack via remask), Dyck (insert pairs), parity (delete matches), DNA splicing (insert/delete), graphs (parallel edits).
Hands-On Results: Sudoku, Graphs, Parity, and More
Experiments ground the theory on controlled tasks.
Sudoku (NP-Complete): Generalize to n²×n² grids. AP-MDM (1.2M params, 100 samples) hits 99.28% accuracy; ARM (42M params, 1.8M samples) 87.18%; AO-MDM 89.49%. Remask enables scaling—learn easy fills first, backtrack dead-ends. Losses converge fast on unmask, slower on order (Figure 3).
// Figure 3
Dyck Languages: ARM can’t guarantee arbitrary matched brackets (Theorem 5). AP-MDM inserts/remasks for exact support.
Graph Editing: Min-cut via Edmonds-Karp. AP-MDM edits structures—insert for expansion, delete for cuts, remask for reversals—near-100% on scaling graphs. ARM degrades as size grows (Table 1). Sequences: AP-MDM state-transitions; ARM explicit ops.
| Nodes | Edges | AP-MDM Avg Len | AP-MDM Max Len | ARM Avg Len | ARM Max Len |
|---|---|---|---|---|---|
| 4 | 12 | 56 | 65 | 932 | 932 |
| 5 | 17 | 81 | 88 | 1,375 | 1,403 |
| … | … | … | … | … | … |
| 10 | 50 | 270 | 305 | 4,915 | 5,392 |
ARM bloats with explicit steps.
Parity: Count 1s mod 2. AP-MDM trains on length-2 (4 samples, 200 params), 100% OOD to arbitrary lengths via delete elimination. ARM (10K samples) fails generalization.
// Figure 4b
These show AP-MDM’s edge: sample-efficient, structure-aware, generalizable.
Getting Started: Code and Datasets
The paper’s code (GitHub: chr26195/AP-MDM) adapts MDLM pipelines. Python 3.12+, PyTorch. Install: pip install -r requirements.txt; flash-attn for speed: pip install flash-attn --no-build-isolation.
Datasets are synthetic for control:
Sudoku:
cd dataset/sudoku
python sudoku_generator.py sudoku-100.npy
Yields sudoku-100.pkl.gz (samples), vocab_cache.pkl (vocab). Visualize: python serve.py; view http://localhost:8001/apmdm.
Parity:
cd dataset/parity
python parity_generator.py
parity_train.pkl.gz (4 samples), parity_vocab_cache.pkl (6 tokens).
Graphs (Max-Flow):
cd dataset/max_flow
python maxflow_solver.py --num_instances 10000 --min_nodes 10 --max_nodes 10 --min_edges 50 --max_edges 50 --output graph.pkl.gz
Tune params for flow guarantees.
Train/eval in train/scripts. Example Sudoku train: python train_apmdm.py --dataset sudoku-100.pkl.gz --epochs 10.
| Task | Samples | Params | Acc. | Why It Works |
|---|---|---|---|---|
| Sudoku | 100 | 1.2M | 99.28% | Remask backtracks |
| Graphs | 100K | 2M | ~100% | Insert/delete structures |
| Parity | 4 | 200 | 100% OOD | Delete eliminates |
Common Questions: Your Guide to Digging Deeper
What if you’re tinkering? Here’s quick answers.
How does AP-MDM beat ARM on space?
ARM/MDM hoard all steps; AP-MDM deletes junk, remasks errors—true space O(S(n)), not work.
Can I adapt this for code gen?
Yes—insert for functions, remask for fixes. Dyck proof shows bracket balance.
Training tips for beginners?
Start Sudoku: small data, fast convergence. Use self-supervised losses (Appendix C).
Why parallel in graphs?
BFS layers unmask in parallel; delete cuts post-flow.
Limits?
PSPACE with poly context, but real hardware caps S(n). Future: edit histories in data.
This framework rethinks generation as editable algorithms, not rigid prediction. For coders, biologists, puzzle fans—it’s a toolkit for structured creation. Grab the repo, run a Sudoku solver, see the magic. What’s your first test?
(BibTeX for citation:)
@article{yang2025powerful,
title={On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond},
author={Yang, Chenxiao and Zhou, Cai and Wipf, David and Li, Zhiyuan},
journal={arXiv preprint arXiv:2510.06190},
year={2025}
}

