On-Policy Self-Alignment: Using Fine-Grained Knowledge Feedback to Mitigate Hallucinations in LLMs
As large language models (LLMs) continue to evolve, their ability to generate fluent and plausible responses has reached impressive heights. However, a persistent challenge remains: hallucination. Hallucination occurs when these models generate responses that deviate from the boundaries of their knowledge, fabricating facts or providing misleading information. This issue undermines the reliability of LLMs and limits their practical applications.
Recent research has introduced a novel approach called Reinforcement Learning for Hallucination (RLFH), which addresses this critical issue through on-policy self-alignment. This method enables LLMs to actively explore their knowledge boundaries and self-correct generative behavior using fine-grained feedback signals. In this blog post, we will delve into the details of RLFH, exploring how it works, its advantages over previous methods, and the experimental results that demonstrate its effectiveness.
Understanding Hallucination in LLMs
Hallucination in LLMs manifests in various forms, such as providing incorrect answers within the model’s knowledge boundary, attempting to respond to queries beyond its knowledge, or refraining from answering despite having relevant knowledge. The root cause of hallucination lies in the misalignment between the model’s generative process and its internal knowledge.
Existing methods to mitigate hallucination can be broadly categorized into learning-based and editing-based approaches. Learning-based methods involve detecting the model’s knowledge boundaries and fine-tuning it with carefully curated feedback data. However, these methods face challenges such as distribution shifts due to off-policy data sampling and the inability of coarse-grained instance-level feedback to precisely pinpoint hallucinations. Editing-based methods, on the other hand, generate content first and then edit it based on external knowledge. But they rely heavily on external knowledge sources and do not address how models utilize their internal knowledge.
The RLFH Approach
RLFH is an innovative on-policy self-alignment approach that leverages fine-grained feedback for hallucination mitigation. It introduces a self-assessment framework where the policy model serves as its own judge. Here’s how it works:
-
Response Generation: The policy model generates a response based on the input prompt.
-
Fine-Grained Feedback from Policy as Judge: The policy model evaluates the generated response by decomposing it into atomic facts and verifying them against external knowledge sources. This provides fine-grained feedback at the statement level.
-
On-Policy Optimization with Token-Level Reward: The statement-level feedback is converted into token-level dense reward signals, which are then used to update the policy model through online reinforcement learning.
Key Components of RLFH
Self-Assessment Framework
RLFH’s self-assessment framework is designed to automatically decompose responses into atomic facts and assess their truthfulness and informativeness. This framework allows the model to generate fine-grained knowledge feedback in real-time, providing token-level dense reward signals for online reinforcement learning. By having the policy serve as its own judge, RLFH constructs a self-driven fact assessment framework that enables timely and low-cost reward signal collection for on-policy optimization without human intervention.
Fine-Grained Feedback from Policy as the Judge
The policy model plays a dual role in RLFH: it generates responses and evaluates them. After generating a response, the policy model breaks it down into atomic statements and verifies each statement against external knowledge sources. This process involves statement extraction, factual verification, and informativeness assessment.
Statement Extraction
The policy model partitions the response into sentences and then extracts valid factual statements from each sentence. This hierarchical approach ensures finer granularity and facilitates the conversion from language-form annotation to token-level dense reward.
Factual Verification
For each extracted factual statement, the policy model retrieves relevant supporting contexts from the reference document set and conducts statement verification as a reading comprehension task. Statements are classified into categories such as Correct, Hedged Correct, Vague, Hedged Wrong, and Wrong.
Informativeness Assessment
In addition to verifying the truthfulness of statements, the policy model also evaluates their informativeness on a five-point scale. This assessment considers the original query and response, ensuring that the model does not simply reduce information to minimize errors but instead provides responses with appropriate informativeness.
On-Policy Optimization with Token-Level Reward
The fine-grained feedback from the policy-as-judge framework is translated into token-level dense reward signals. These rewards are then used to update the policy model through online reinforcement learning. The dense reward conversion process takes into account both the truthfulness and informativeness of the response, balancing these two objectives to optimize the model’s behavior.
Experimental Results
Comprehensive evaluations on HotpotQA, SQuADv2, and Biography benchmarks demonstrate RLFH’s effectiveness in hallucination mitigation. The results show significant improvements over both base models and existing hallucination mitigation approaches.
Main Results
RLFH achieves the highest FactScore across all datasets. FactScore is a well-established metric for assessing factuality with external knowledge support. The consistent improvement in FactScore across different datasets substantiates the effectiveness of RLFH in reducing hallucinations.
Detailed Results
A detailed analysis of 5,000 HotpotQA questions held out from training reveals that RLFH increases the ratio of high-accuracy responses and suppresses errors and unverifiable content. The method also enhances the average informativeness of statements in responses.
Impact of Reward Granularity
Ablation experiments show that statement-level rewards consistently achieve the highest FactScore, highlighting the importance of fine-grained feedback for developing more reliable models.
Impact of Judge Model
Experiments comparing different judge models indicate that the on-policy setting, where the policy model itself serves as the judge, achieves superior performance and eliminates the need for an additional reward model in the training process.
Conclusion
RLFH represents a significant step toward more reliable and self-aware language models. By enabling LLMs to explore their knowledge boundaries and self-correct hallucination behavior through fine-grained feedback, RLFH offers a promising solution to the hallucination problem. While there are limitations to address, such as the broader challenge of generalized hallucination across diverse domains and the need for more comprehensive evaluation frameworks, RLFH provides a robust foundation for future research in this area.
As LLMs become increasingly integrated into society, ensuring their truthfulness and reliability is paramount. RLFH contributes to the development of more trustworthy AI systems, helping to mitigate the spread of misinformation and enhance the practical utility of LLMs in real-world applications.