Site icon Efficient Coder

CLaRa: How 128x Document Compression Supercharges RAG Without Labels

# CLaRa: Teaching a Language Model to Compress, Retrieve, and Answer in One Breath

How to shrink Wikipedia 128× and still beat full-text baselines—without ever labeling “relevant” documents.


## TL;DR

CLaRa (Continuous Latent Reasoning) unifies retrieval and generation inside a single LLM by:

  1. Offline-compressing every document into 32–256 “memory tokens”;
  2. Learning to retrieve with a differentiable top-k operator;
  3. Training everything end-to-end with nothing more than next-token prediction loss.
    On four open QA data sets the framework matches or outperforms full-text RAG while using 1–2 % of the usual context length.

## Table of Contents

  1. The Two Walls Hitting Every RAG Pipeline
  2. Shared Continuous Space—Our Single-Model Bet
  3. Stage I: SCP Pre-training—Making Compact Vectors Worth Reading
  4. Stage II: CLaRa Joint Training—Gradient-Friendly Top-k Selection
  5. Experiments: Shorter Context, Higher Scores
  6. A Walk-Through: Multi-Hop Question in One Hop
  7. Author’s Reflections: Three Regrets and One Surprise
  8. Action Checklist / Implementation Steps
  9. One-Page Overview
  10. FAQ

## 1. The Two Walls Hitting Every RAG Pipeline

Core question: Why do classic “retrieve-then-read” systems plateau even with great retrievers?

  • Optimization wall: The retriever is optimized for cosine similarity, the generator for log-likelihood. No gradient flows between them because top-k is discrete.
  • Efficiency wall: The retriever embeds once, but the generator re-ingests raw text every time, blowing up latency and GPU memory.

CLaRa removes both walls by collapsing retrieval and generation into one continuous representation space and one loss function.


## 2. Shared Continuous Space—Our Single-Model Bet

Core question: What single architectural choice lets gradients travel from answer tokens back to the query encoder?

We abandon the text-on-the-wire tradition:

Component Input Output Role
SCP Compressor Document text Memory tokens M Fixed-size semantic summary
Query Reasoner Question text Query vector q Lives in the same embedding space
Generator [q; M_top-k] Answer tokens Shares weights with the two modules above

Because every stage operates on the same transformer backbone, the next-token loss supervises both “what to pick” (retrieval) and “what to write” (generation).


## 3. Stage I: SCP Pre-training—Making Compact Vectors Worth Reading

Core question: How can a few floating-point numbers replace an entire Wikipedia page without catastrophic loss?

### 3.1 Salient-data Factory (fully automatic)

For each Wikipedia article we prompt a local 32-B model to emit:

  • Simple QA – single atomic facts (to keep entities & numbers)
  • Complex QA – multi-hop questions (to keep relational arcs)
  • Paraphrase – sentence-order-shuffled rewrite (to drop surface cues)

An iterative self-check loop fills coverage gaps for up to ten rounds; failing samples are dropped.

### 3.2 Compression + Reconstruction

We append l learnable “memory tokens” to the document, activate compressor LoRA, and take the final hidden states of those l tokens as the compressed vector M.

A separate generator LoRA is then trained to answer the synthetic QAs or reproduce the paraphrase using only M and a task prefix.

Loss:
ℒ_total = ℒ_CE(answers) + λ·ℒ_MSE(mean_doc_hidden, mean_memory_hidden)

With λ = 0.1 the two representations overlap in t-SNE; without it they drift apart and downstream QA drops ≈ 3–5 F1.

### 3.3 Domain-transfer tip (from our pain)

SCP is domain-agnostic in form but not in statistics—we pre-trained only on Wikipedia. When we plugged in medical journals, recall@5 fell 12 %. The fix: continue SCP for 2 k steps on target-domain paragraphs; scores rebound without rewriting the rest of the stack.


## 4. Stage II: CLaRa Joint Training—Gradient-Friendly Top-k Selection

Core question: Discrete top-k breaks back-prop—how did we sneak gradients through?

### 4.1 Straight-Through Gumbel Estimator

Forward: hard top-k indices (exactly k vectors)
Backward: soft-maximum relaxation (temperature τ)

Pythonic sketch:

scores = cosine(q, M)                       # [B, D]
soft = softmax(scores / tau)                # backward path
hard = one_hot(top_k_indices)               # forward path
Z = hard + (soft - soft.detach())           # ST estimator
M_topk = einsum("bkd,bdv->bkv", Z, M)       # weighted vectors

Because M is frozen after SCP, we only update q and the generator—stabilizing training and allowing million-scale vector indexes without re-embedding.

### 4.2 The Only Loss in Town

ℒ_CLaRa = − Σ_t log p(a*t | Q, M_top-k, a*{<t})

No relevance labels, no reinforcement tricks—just language modeling. Yet the gradient couples retrieval quality with answer likelihood:

  • If the correct document is missing, the generator loss increases, pushing q closer to that document’s vector.
  • If a retrieved document hurts correctness, the generator signal moves q away.

## 5. Experiments: Shorter Context, Higher Scores

Core question: Does the fancy optimizer actually translate into better answers?

### 5.1 Compression Quality (Oracle = gold paragraph always in pool)

Method CR NQ Hotpot Musique 2Wiki Δ avg
PISCO 16× 73.44 66.53 33.80 60.45 baseline
SCP-Mistral 16× 75.48 70.79 43.15 66.16 +5.35
SCP-Mistral 128× 69.96 62.09 30.86 59.08 still +1.13 vs PISCO

Even at 128× we stay ahead of the best comparable soft-compression model; beyond that, scores erode but latency savings plateau (vector I/O becomes negligible).

### 5.2 End-to-End QA (Normal = top-20 live retrieval from Wikipedia-2021)

Model Context NQ F1 2Wiki F1
DRO-Mistral-7B 51.01 43.65
CLaRa-Mistral-7B 16× 51.41 47.18

Reflection: We expected parity; beating a strong text-based pipeline while using 6 % of the tokens was the moment we stopped worrying about “information leakage” and started worrying about over-compression aesthetics.


## 6. A Walk-Through: Multi-Hop Question in One Hop

Core question: How does CLaRa handle the classic “nephew-of-X” multi-hop query without explicit chaining?

Question:
“How many yards did the nephew of Ivory Lee Brown get during his 2004 true freshman season?”

Traditional two-hop:

  1. Retrieve “Ivory Lee Brown nephew” → Adrian Peterson
  2. Retrieve “Adrian Peterson 2004 freshman yards” → 1,925

CLaRa single-hop:

  1. Query Reasoner encodes the full question; logit-lens decoding already shows tokens “NFL,” “Oklahoma.”
  2. Top-1 vector segment contains the Peterson NCAA record paragraph.
  3. Generator reads q + that single segment and outputs “1,925 yards.”

Because the query vector was taught (via next-token loss) to anticipate which evidence will help the generator, it implicitly packs both hops into one retrieval step—no intermediate text explosion.


## 7. Author’s Reflections: Three Regrets and One Surprise

  1. MSE alignment looked like a nicety—until we removed it. Overnight, NQ dropped 4 F1. Cosmetic losses sometimes guard the semantic border.
  2. Instruction-tuning before joint training hurt retrieval. We thought “better at answering” equals “better at finding.” Nope: task-specific tuning overfits queries to localized answer spans and erodes global semantic breadth.
  3. We burned a month on RL-based retrieval rewards. Sampling 32 documents per step felt heroic; training blew up 3 GPUs and gave +0.3 F1. One-line ST estimator delivered the same gradient soup for 1 % of the carbon.

Surprise: At 128× compression, human evaluators preferred CLaRa answers over full-text RAG 52 % of the time—because shorter context forced the model to focus on salient facts instead of being distracted by redundant prose.


## 8. Action Checklist / Implementation Steps

  1. Install dependencies (same env as Mistral-7B inference).
  2. Run SCP pre-training:
    python scp_pretrain.py \
       --base_model mistral-7b \
       --corpus wiki2021.jsonl \
       --qa_pairs 3 \          # simple + complex + para
       --memory_tokens 32 \
       --lambda_mse 0.1 \
       --max_steps 100000
    
  3. Build vector index:
    python compress_corpus.py \
       --input wiki2021.jsonl \
       --output wiki_mem32.faiss \
       --batch_size 256
    
  4. Launch CLaRa joint tuning:
    python clara_train.py \
       --train_data qa_pairs.jsonl \
       --index wiki_mem32.faiss \
       --lr 2e-5 \
       --tau 0.3 \
       --topk 5 \
       --max_epochs 3
    
  5. Serve:
    answer = clara.answer("Your question here")
    

## 9. One-Page Overview

  • Problem: RAG pipelines optimize retrieval and generation separately, wasting context length and blocking gradient flow.
  • Solution: CLaRa compresses documents into memory-token vectors, retrieves via differentiable top-k, and trains everything with a single language-modeling loss.
  • Benefits: 4×–128× shorter context, no relevance labels required, end-to-end supervision, state-of-the-art QA scores.
  • Limitations: Compressor currently Wikipedia-only; extreme compression (>128×) loses nuances; larger backbones yet to be explored.
  • Next Steps: Domain-adaptive SCP, reasoning over compressed memory, multimodal extension.

## 10. FAQ

Q1. Do I need to re-compress the whole corpus if I add new documents?
A: No—just run the SCP compressor on new pages and append vectors to the FAISS index.

Q2. How much GPU memory is required at inference?
A: With 16× compression, a 7 B model + 200 k-document index fits in 24 GB VRAM.

Q3. Is the method language-specific?
A: The framework is language-agnostic, but SCP pre-training should be repeated on target-language corpora for best recall.

Q4. What happens if the compressor is applied to non-encyclopedic text (dialogs, code)?
A: Out-of-domain text drops recall; continue SCP for a few thousand steps on representative paragraphs to recover performance.

Q5. Can I use a smaller backbone than 7 B?
A: Yes—SCP-Phi4-mini is already provided and beats comparable baselines, but extreme compression (>64×) benefits from larger representational capacity.

Q6. Does CLaRa support multi-modal documents (images, tables)?
A: The current work is text-only; integrating vision encoders into the compressed space is ongoing research.

Q7. How sensitive is τ (temperature)?
A: 0.3–0.5 works robustly for up to 200 k candidates; larger corpora may require linear warmup of τ to reduce gradient noise.

Exit mobile version