# CLaRa: Teaching a Language Model to Compress, Retrieve, and Answer in One Breath
How to shrink Wikipedia 128× and still beat full-text baselines—without ever labeling “relevant” documents.
## TL;DR
CLaRa (Continuous Latent Reasoning) unifies retrieval and generation inside a single LLM by:
-
Offline-compressing every document into 32–256 “memory tokens”; -
Learning to retrieve with a differentiable top-k operator; -
Training everything end-to-end with nothing more than next-token prediction loss.
On four open QA data sets the framework matches or outperforms full-text RAG while using 1–2 % of the usual context length.
## Table of Contents
-
The Two Walls Hitting Every RAG Pipeline -
Shared Continuous Space—Our Single-Model Bet -
Stage I: SCP Pre-training—Making Compact Vectors Worth Reading -
Stage II: CLaRa Joint Training—Gradient-Friendly Top-k Selection -
Experiments: Shorter Context, Higher Scores -
A Walk-Through: Multi-Hop Question in One Hop -
Author’s Reflections: Three Regrets and One Surprise -
Action Checklist / Implementation Steps -
One-Page Overview -
FAQ
## 1. The Two Walls Hitting Every RAG Pipeline
Core question: Why do classic “retrieve-then-read” systems plateau even with great retrievers?
-
Optimization wall: The retriever is optimized for cosine similarity, the generator for log-likelihood. No gradient flows between them because top-k is discrete. -
Efficiency wall: The retriever embeds once, but the generator re-ingests raw text every time, blowing up latency and GPU memory.
CLaRa removes both walls by collapsing retrieval and generation into one continuous representation space and one loss function.
## 2. Shared Continuous Space—Our Single-Model Bet
Core question: What single architectural choice lets gradients travel from answer tokens back to the query encoder?
We abandon the text-on-the-wire tradition:
| Component | Input | Output | Role |
|---|---|---|---|
| SCP Compressor | Document text | Memory tokens M | Fixed-size semantic summary |
| Query Reasoner | Question text | Query vector q | Lives in the same embedding space |
| Generator | [q; M_top-k] | Answer tokens | Shares weights with the two modules above |
Because every stage operates on the same transformer backbone, the next-token loss supervises both “what to pick” (retrieval) and “what to write” (generation).
## 3. Stage I: SCP Pre-training—Making Compact Vectors Worth Reading
Core question: How can a few floating-point numbers replace an entire Wikipedia page without catastrophic loss?
### 3.1 Salient-data Factory (fully automatic)
For each Wikipedia article we prompt a local 32-B model to emit:
-
Simple QA – single atomic facts (to keep entities & numbers) -
Complex QA – multi-hop questions (to keep relational arcs) -
Paraphrase – sentence-order-shuffled rewrite (to drop surface cues)
An iterative self-check loop fills coverage gaps for up to ten rounds; failing samples are dropped.
### 3.2 Compression + Reconstruction
We append l learnable “memory tokens” to the document, activate compressor LoRA, and take the final hidden states of those l tokens as the compressed vector M.
A separate generator LoRA is then trained to answer the synthetic QAs or reproduce the paraphrase using only M and a task prefix.
Loss:
ℒ_total = ℒ_CE(answers) + λ·ℒ_MSE(mean_doc_hidden, mean_memory_hidden)
With λ = 0.1 the two representations overlap in t-SNE; without it they drift apart and downstream QA drops ≈ 3–5 F1.
### 3.3 Domain-transfer tip (from our pain)
SCP is domain-agnostic in form but not in statistics—we pre-trained only on Wikipedia. When we plugged in medical journals, recall@5 fell 12 %. The fix: continue SCP for 2 k steps on target-domain paragraphs; scores rebound without rewriting the rest of the stack.
## 4. Stage II: CLaRa Joint Training—Gradient-Friendly Top-k Selection
Core question: Discrete top-k breaks back-prop—how did we sneak gradients through?
### 4.1 Straight-Through Gumbel Estimator
Forward: hard top-k indices (exactly k vectors)
Backward: soft-maximum relaxation (temperature τ)
Pythonic sketch:
scores = cosine(q, M) # [B, D]
soft = softmax(scores / tau) # backward path
hard = one_hot(top_k_indices) # forward path
Z = hard + (soft - soft.detach()) # ST estimator
M_topk = einsum("bkd,bdv->bkv", Z, M) # weighted vectors
Because M is frozen after SCP, we only update q and the generator—stabilizing training and allowing million-scale vector indexes without re-embedding.
### 4.2 The Only Loss in Town
ℒ_CLaRa = − Σ_t log p(a*t | Q, M_top-k, a*{<t})
No relevance labels, no reinforcement tricks—just language modeling. Yet the gradient couples retrieval quality with answer likelihood:
-
If the correct document is missing, the generator loss increases, pushing q closer to that document’s vector. -
If a retrieved document hurts correctness, the generator signal moves q away.
## 5. Experiments: Shorter Context, Higher Scores
Core question: Does the fancy optimizer actually translate into better answers?
### 5.1 Compression Quality (Oracle = gold paragraph always in pool)
| Method | CR | NQ | Hotpot | Musique | 2Wiki | Δ avg |
|---|---|---|---|---|---|---|
| PISCO | 16× | 73.44 | 66.53 | 33.80 | 60.45 | baseline |
| SCP-Mistral | 16× | 75.48 | 70.79 | 43.15 | 66.16 | +5.35 |
| SCP-Mistral | 128× | 69.96 | 62.09 | 30.86 | 59.08 | still +1.13 vs PISCO |
Even at 128× we stay ahead of the best comparable soft-compression model; beyond that, scores erode but latency savings plateau (vector I/O becomes negligible).
### 5.2 End-to-End QA (Normal = top-20 live retrieval from Wikipedia-2021)
| Model | Context | NQ F1 | 2Wiki F1 |
|---|---|---|---|
| DRO-Mistral-7B | 1× | 51.01 | 43.65 |
| CLaRa-Mistral-7B | 16× | 51.41 | 47.18 |
Reflection: We expected parity; beating a strong text-based pipeline while using 6 % of the tokens was the moment we stopped worrying about “information leakage” and started worrying about over-compression aesthetics.
## 6. A Walk-Through: Multi-Hop Question in One Hop
Core question: How does CLaRa handle the classic “nephew-of-X” multi-hop query without explicit chaining?
Question:
“How many yards did the nephew of Ivory Lee Brown get during his 2004 true freshman season?”
Traditional two-hop:
-
Retrieve “Ivory Lee Brown nephew” → Adrian Peterson -
Retrieve “Adrian Peterson 2004 freshman yards” → 1,925
CLaRa single-hop:
-
Query Reasoner encodes the full question; logit-lens decoding already shows tokens “NFL,” “Oklahoma.” -
Top-1 vector segment contains the Peterson NCAA record paragraph. -
Generator reads q + that single segment and outputs “1,925 yards.”
Because the query vector was taught (via next-token loss) to anticipate which evidence will help the generator, it implicitly packs both hops into one retrieval step—no intermediate text explosion.
## 7. Author’s Reflections: Three Regrets and One Surprise
-
MSE alignment looked like a nicety—until we removed it. Overnight, NQ dropped 4 F1. Cosmetic losses sometimes guard the semantic border. -
Instruction-tuning before joint training hurt retrieval. We thought “better at answering” equals “better at finding.” Nope: task-specific tuning overfits queries to localized answer spans and erodes global semantic breadth. -
We burned a month on RL-based retrieval rewards. Sampling 32 documents per step felt heroic; training blew up 3 GPUs and gave +0.3 F1. One-line ST estimator delivered the same gradient soup for 1 % of the carbon.
Surprise: At 128× compression, human evaluators preferred CLaRa answers over full-text RAG 52 % of the time—because shorter context forced the model to focus on salient facts instead of being distracted by redundant prose.
## 8. Action Checklist / Implementation Steps
-
Install dependencies (same env as Mistral-7B inference). -
Run SCP pre-training: python scp_pretrain.py \ --base_model mistral-7b \ --corpus wiki2021.jsonl \ --qa_pairs 3 \ # simple + complex + para --memory_tokens 32 \ --lambda_mse 0.1 \ --max_steps 100000 -
Build vector index: python compress_corpus.py \ --input wiki2021.jsonl \ --output wiki_mem32.faiss \ --batch_size 256 -
Launch CLaRa joint tuning: python clara_train.py \ --train_data qa_pairs.jsonl \ --index wiki_mem32.faiss \ --lr 2e-5 \ --tau 0.3 \ --topk 5 \ --max_epochs 3 -
Serve: answer = clara.answer("Your question here")
## 9. One-Page Overview
-
Problem: RAG pipelines optimize retrieval and generation separately, wasting context length and blocking gradient flow. -
Solution: CLaRa compresses documents into memory-token vectors, retrieves via differentiable top-k, and trains everything with a single language-modeling loss. -
Benefits: 4×–128× shorter context, no relevance labels required, end-to-end supervision, state-of-the-art QA scores. -
Limitations: Compressor currently Wikipedia-only; extreme compression (>128×) loses nuances; larger backbones yet to be explored. -
Next Steps: Domain-adaptive SCP, reasoning over compressed memory, multimodal extension.
## 10. FAQ
Q1. Do I need to re-compress the whole corpus if I add new documents?
A: No—just run the SCP compressor on new pages and append vectors to the FAISS index.
Q2. How much GPU memory is required at inference?
A: With 16× compression, a 7 B model + 200 k-document index fits in 24 GB VRAM.
Q3. Is the method language-specific?
A: The framework is language-agnostic, but SCP pre-training should be repeated on target-language corpora for best recall.
Q4. What happens if the compressor is applied to non-encyclopedic text (dialogs, code)?
A: Out-of-domain text drops recall; continue SCP for a few thousand steps on representative paragraphs to recover performance.
Q5. Can I use a smaller backbone than 7 B?
A: Yes—SCP-Phi4-mini is already provided and beats comparable baselines, but extreme compression (>64×) benefits from larger representational capacity.
Q6. Does CLaRa support multi-modal documents (images, tables)?
A: The current work is text-only; integrating vision encoders into the compressed space is ongoing research.
Q7. How sensitive is τ (temperature)?
A: 0.3–0.5 works robustly for up to 200 k candidates; larger corpora may require linear warmup of τ to reduce gradient noise.
