Using Entity Linking to Fix RAG’s Chronic “Wrong Document” Problem

Have you ever asked an AI tutor a precise question like
“In The Wealth of Nations, how does Adam Smith define the division of labor?”
…only to get back a confident answer that’s completely wrong because the system pulled paragraphs about some random economist named Smith from 2023?

That’s not the language model being dumb.
That’s the retrieval part being blind.

In specialized domains — university lectures, medical textbooks, legal documents, corporate knowledge bases — pure semantic similarity retrieval fails exactly when you need it most: when the same word can refer to ten different things.

We just published a paper (arXiv:2512.05967) that fixes this once and for all, at least for educational content.
We call the system ELERAG: Entity Linking Enhanced Retrieval-Augmented Generation.

Here’s the entire story, explained like you’re a master’s student who wants the truth, not hype.

What Is ELERAG in One Sentence?

Take a normal RAG pipeline, add a Wikidata-powered entity linking brain that knows exactly which “Smith” you’re talking about, then fuse the two signals intelligently. Done.

Why Normal RAG Struggles with University Lectures

Real example from our Italian university courses:

Student question: “What does Smith say about the division of labor?”
A vanilla dense retriever (multilingual-e5-large + FAISS) returns:

  • The correct passage from Adam Smith’s Wealth of Nations ✓
  • A 2022 paper by a different Professor Smith ✗
  • A case study about a company called Smith Ltd ✗

The word “Smith” looks identical in embedding space. The model has no idea there’s a difference.

This ambiguity explosion is everywhere in real course material: same acronym used in three different courses, professors referring to concepts by nickname, cross-chapter references, etc.

The Full ELERAG Pipeline (With Diagram)

graph TD
    A[User Question] --> B[Two parallel paths]
    B --> C[Path 1: Dense retrieval → multilingual-e5-large → FAISS Top-30]
    B --> D[Path 2: Entity Linking → spaCy NER → Wikidata lookup → Entity overlap]
    C --> E[Two separate ranked lists]
    D --> E
    E --> F[Reciprocal Rank Fusion (RRF) – no hyperparameters!]
    F --> G[Top 3–5 chunks → GPT-4o]
    G --> H[Final answer with source citations]

The magic is not adding entity linking (many people tried that).
The magic is how we combine the two signals.

We Tested Three Fusion Strategies — One Crushed the Others

Method Idea Performance on Real University Data Compute Cost
Weighted sum dense_score + β × entity_score (β tuned) Decent Very low
RRF (our winner) Parameter-free reciprocal rank fusion Best by far Extremely low
RRF + Cross-Encoder (SOTA) RRF first, then expensive cross-encoder re-rank Actually worse 5–10× higher

Yes, you read that right: the expensive state-of-the-art cross-encoder lost to a 10-line RRF function on real lecture transcripts.

Head-to-Head Results: University Lectures vs General Wikipedia

Dataset Best method Exact Match (gold chunk ranked #1) MRR Winner
Italian university courses (our data) ELERAG + RRF 56.5% 0.779 ELERAG
SQuAD-it (Wikipedia) Cross-Encoder 77.7% 0.836 Cross-Encoder

This is the “Domain Mismatch” phenomenon in action.

Cross-encoders dominate on Wikipedia because they were trained on billions of web pages that look exactly like Wikipedia.
Switch to spoken university lectures (long sentences, anaphora, informal references) and they collapse. Entity linking doesn’t care about writing style — it only cares whether the unique Wikidata Q-id matches.

How We Built the Entity Linking Module (Copy-Paste Friendly)

  1. Extract entities with spaCy large Italian model
  2. For each entity, query Wikidata public API → candidate list
  3. Disambiguation score (our simple but deadly effective formula):

    HybridScore = 0.9 × similarity(e5-large between mention context and label+description)
                + 0.1 × popularity(1/(rank+1))
    
  4. Pick the highest-scoring Wikidata Q-id
  5. Pre-compute and cache for every chunk in the corpus (one-time cost)

Zero training, pure rules + lightweight embeddings.

Concrete Gains on Real Courses (69 questions)

Metric Vanilla RAG ELERAG + RRF Improvement
Exact Match (gold chunk #1) 52.2% 56.5% +8.2%
Precision@1 65.2% 69.6% +6.8%
MRR (gold) 0.652 0.779 +19.5%
GPT-4o completeness score 5.99/10 6.10/10 +1.8%
GPT-4o relevance score 5.45/10 5.57/10 +2.2%

Small numbers, massive perceived quality jump for students.

Frequently Asked Questions

Q: Does this only work for Italian?
No. Wikidata is multilingual. Swap the spaCy NER model and you’re good for Chinese, Spanish, German, etc. We already have a working Chinese version — results are even better because Chinese names have insane homograph problems.

Q: Isn’t entity linking slow?
Real-world latency on a 2-core server: +0.4 seconds compared to vanilla RAG. Totally acceptable.

Q: What if Wikidata doesn’t have the entity?
Graceful degradation to normal dense retrieval. In practice, >90% of proper nouns in university courses are in Wikidata.

Q: Can I combine RRF with a cross-encoder anyway?
You can, but in our experiments it hurt performance. The cross-encoder sometimes kills perfectly correct chunks because they sound “less Wikipedia-like”.

Q: Is the code open source?
Yes: https://github.com/Granataaa/educational-rag-el
Drop your own lecture transcripts and run the demo in minutes.

When Should You Add Entity Linking to Your RAG?

Do it yesterday if you’re building:

  • University / corporate training platforms
  • Medical or legal question-answering systems
  • Internal company knowledge bases (product names, project codenames, people)
  • Any non-English or mixed-language corpus

Bottom Line

On general web text → cross-encoders are still king.
On real-world specialized educational content → entity linking + parameter-free RRF is the quiet killer.

We proved it with real university courses, open-sourced everything, and wrote the paper so you don’t have to rediscover it the hard way.

Paper: https://arxiv.org/abs/2512.05967
Code: https://github.com/Granataaa/educational-rag-el