How FaithLens Beats GPT-4: The 8B Parameter Model Stopping AI Lies

高效码农

2 months ago

FaithLens in Plain English: How an 8-Billion-Parameter Model Outperforms GPT-4.1 on Hallucination Detection

“

A practitioner’s walk-through of the open-source paper “FaithLens: Detecting and Explaining Faithfulness Hallucination” (arXiv:2512.20182).
No hype, no jargon—just facts, code snippets, and reproducible numbers.

Why “faithfulness hallucination” matters
What FaithLens does in one sentence
Architecture & training pipeline (SFT → RL)
Data recipe: public sets only, no private APIs
Benchmark results: 12 data sets, one table
Install & inference in < 5 minutes
Re-training on your own corpus
Limitations you should know
FAQ from real users
Take-away checklist

1. Why “faithfulness hallucination” matters

Imagine you ask a RAG system:
“How many times was the coach fired that week?”
The retrieved article says once, yet the model answers twice.
The statement sounds plausible but contradicts the provided context—this is a faithfulness hallucination.

Risks

◉

Medical summarisation: wrong dosage → patient safety issue
◉

Legal RAG: non-existent clause → compliance risk
◉

Finance Q&A: fake figure → trading error

Existing cures fall into two buckets:

Big-model-as-judge (SelfCheckGPT, Chain-of-Verification) – accurate but expensive.
Lightweight classifier – cheap but black-box; only returns 0/1.

FaithLens tries to keep the price near the second bucket while giving explainable answers like the first.

2. What FaithLens does in one sentence

Feed it (document, claim) and it returns:

{
  "label": 0,   // 0 = hallucination, 1 = faithful
  "explanation": "The document mentions only one dismissal; the claim adds a second dismissal not supported by the text."
}

No extra retrieval calls, no external search—one forward pass, one human-readable justification.

3. Architecture & training pipeline (SFT → RL)

Stage	Goal	Key technique
Cold-start SFT	Produce labels + explanations	Synthetic data + three-step filter
Rule-based RL	Boost accuracy & explanation quality	GRPO policy optimisation + composite reward

Model size: 8-billion parameters (Llama-3.1-8B-Instruct as backbone).

3.1 Synthetic data generation

◉

Prompt DeepSeek-V3.2-Think with (doc, claim) pairs from public corpora.
◉

Ask for chain-of-thought → explanation → label.
◉

Output format strictly:

<claim> ... </claim>
<explanation> ... </explanation>
<label> 0 or 1 </label>

3.2 Triple filter before SFT

Label correctness – discard if synthetic label ≠ gold label.
Explanation quality – keep only samples that lower perplexity of a small verifier (Llama-3.1-8B) when the explanation is present.
Data diversity – K-Medoids clustering (k=10) to avoid collapsed domains.

Numbers
52 k synthetic → 35 k after label filter → 11.9 k after all filters.

3.3 Reinforcement-learning details

◉

Algorithm: GRPO (Group Relative Policy Optimisation) – no extra critic model.
◉

Group size G = 7; temperature = 0.6.
◉

Reward = pred_correct + exp_useful + format_ok (each 0/1).
◉

Updates run for 2 epochs on 16 k CG2C examples.

4. Data recipe: public sets only

Split	Source	Task type	Samples	Licence
SFT	ANLI, C2D, D2C	NLI & synthetic QA	11.9 k	Apache-2.0
RL	CG2C-MHQA/DOC	Multi-hop claims	16.7 k	Apache-2.0
Eval	LLM-AggreFact (11 tasks)	Summarisation, RAG, dialogue	21 k	MIT-style
Eval	HoVer	2-4 hop Wikipedia claims	4 k	CC-BY-SA

No private paid API data—anyone can re-create the weights.

5. Benchmark results: 12 data sets, one table

Macro-F1 (%) averaged over 12 tasks; ↓ lower std. dev. = more consistent.

Model	Mean F1	Std.	Params	GPU $/1k*
GPT-4.1	83.0	6.5	?	11.4
o3	82.1	6.0	?	8.8
MiniCheck-7B	80.7	7.5	7 B	0.3
FaithLens-8B	86.4	4.6	8 B	0.1

*Cost on NVIDIA A800-SXM4-80G: electricity + amortised hardware.

Explanation quality (GPT-4.1 as judge, 120 samples):

Dimension	GPT-4o	FaithLens
Readability /100	84.1	90.4
Helpfulness /100	82.9	93.4
Informativeness /100	83.7	85.4

FaithLens scores higher on helpfulness because it quotes the exact sentence that contradicts the claim.

6. Install & inference in < 5 minutes

System: Ubuntu 20.04, CUDA ≥ 11.8, Python 3.9+

# 1. clone
git clone https://github.com/S1s-Z/FaithLens.git
cd FaithLens

# 2. install
pip install -e .          # or: pip install "faithlens @ git+https://github.com/S1s-Z/FaithLens.git@master"

# 3. run quick-start
python quickstart.py

quickstart.py (copy–paste ready):

from faithlens.inference import FaithLensInfer
import json

model = FaithLensInfer(
    model_name="ssz1111/FaithLens",
    device="cuda:0"
)

out = model.infer(
    docs=["Romanian club Ceahlaul dismissed coach Ze Maria after a poor run. He was reinstated the next day, then lost 2-0 on Saturday."],
    claims=["Ze Maria was fired twice in the same week."]
)

print(json.dumps(out, indent=2))

Output:

[
  {
    "claim": "Ze Maria was fired twice in the same week.",
    "label": 0,
    "explanation": "The document records only one dismissal and one reinstatement; a second dismissal is never mentioned, so the claim is hallucinated."
  }
]

Memory: 16 GB GPU RAM (FP16) or 8 GB (INT4).
Speed: ~120 tokens/s on a single A800.

7. Re-training on your own corpus

7.1 Prepare JSONL

Each line:

{"doc": "...", "claim": "...", "cot": "step-by-step reasoning", "explanation": "...", "label": 0}

7.2 SFT script (already in repo)

cd training/sft
pip install -r requirements.txt
bash train_llama8b_instruct.sh   # edit paths inside

Key hyper-parameters:

◉

LR = 1 × 10⁻⁵, batch = 16, epochs = 3, DeepSpeed Zero-3, BF16
◉

Hardware: 4 × A800 80 GB ≈ 6 hours

7.3 Optional RL boost

cd ../verl
pip install -r requirements.txt
bash rl_training.sh              # needs 7 × A800 80 GB

Without RL you lose ~2.3 F1 points on average—still competitive.

8. Limitations you should know

Text-only: images, audio, tables are out-of-scope.
Sequential generation: CoT → explanation → label adds ~30 % latency vs. pure classifiers.
Binary output: no “partially supported” or severity score—datasets lack granular labels.
Zero-shot non-English: paper tests English only; authors report ~78 % F1 on Chinese but do not guarantee robustness.
Resource-heavy retraining: full pipeline needs multi-GPU; inference is light.

9. FAQ from real users

Q1: Does FaithLens check real-world facts?
A: No. It only checks if the claim is supported by the supplied document. If the document itself is wrong, FaithLens will still mark a matching claim as “faithful”.

Q2: Can I run it on CPU?
A: INT4 version loads in 8 GB RAM and runs ~4 tokens/s on 16-core CPU. Usable for offline audit, not for real-time chat.

Q3: How small can I distil it?
A: Authors tried 3 B and 7 B versions; 3 B keeps 83.4 F1 with 4.9 std. dev.—still above MiniCheck-7B.

Q4: Is the training data bias-free?
A: The corpora come from Wikipedia, CNN, XSum, etc. Domain gaps remain; always validate on your own data.

Q5: Licence?
A: Model weights Apache-2.0, code MIT, synthetic data inherit original licences (mostly Apache/CC). Commercial use allowed, but conduct your own legal review.

10. Take-away checklist

◉

FaithLens is open-source and cheaper than GPT-4.1/o3 for hallucination audits.
◉

It returns human-readable explanations—great for debugging RAG or summarisers.
◉

Reproducible: public data + full scripts + hyper-parameters listed.
◉

Inference footprint is modest (8–16 GB GPU).
◉

Re-training is heavy; start with the released checkpoint before collecting custom data.

If you need a lightweight, explainable gatekeeper between your generative model and the user, FaithLens is worth a slot in your toolbox.

Citation (BibTeX)

@misc{si2025faithlensdetectingexplainingfaithfulness,
      title={FaithLens: Detecting and Explaining Faithfulness Hallucination}, 
      author={Shuzheng Si and Qingyi Wang and Haozhe Zhao and Yuzhuo Bai and Guanqiao Chen and Kangyang Luo and Gang Chen and Fanchao Qi and Minjia Zhang and Baobao Chang and Maosong Sun},
      year={2025},
      eprint={2512.20182},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.20182}, 
}