CLaRa: How 128x Document Compression Supercharges RAG Without Labels

2 hours ago 高效码农

# CLaRa: Teaching a Language Model to Compress, Retrieve, and Answer in One Breath How to shrink Wikipedia 128× and still beat full-text baselines—without ever labeling “relevant” documents. ## TL;DR CLaRa (Continuous Latent Reasoning) unifies retrieval and generation inside a single LLM by: Offline-compressing every document into 32–256 “memory tokens”; Learning to retrieve with a differentiable top-k operator; Training everything end-to-end with nothing more than next-token prediction loss. On four open QA data sets the framework matches or outperforms full-text RAG while using 1–2 % of the usual context length. ## Table of Contents The Two Walls Hitting Every RAG …