Building Chinese Reward Models from Scratch: A Practical Guide to CheemsBench and CheemsPreference
Why Do We Need Dedicated Chinese Reward Models?
In the development of large language models (LLMs), reward models (RMs) act as “value referees” that align AI outputs with human preferences. However, current research faces two critical challenges:
-
Language Bias: 90% of existing studies focus on English, leaving Chinese applications underserved -
Data Reliability: Synthetic datasets dominate current approaches, failing to capture authentic human preferences
The Cheems project – a collaboration between the Institute of Software (Chinese Academy of Sciences) and Xiaohongshu – introduces the first comprehensive framework for Chinese reward model development. This guide explores its groundbreaking components.
Core Components Explained
CheemsBench: The Gold Standard for Chinese RM Evaluation
Key Features:
-
Diverse Data Sources:
-
1,146 open-source prompts from 8 datasets (Humaneval-XL, GAOKAO-Bench, etc.) -
1,346 real-world human instructions capturing practical use cases
-
-
Innovative Evaluation Protocol:
# Conflict resolution algorithm example def resolve_conflicts(responses, annotations): G = build_preference_graph(annotations) while cycles := detect_cycles(G): merge_nodes(G, cycles) return topological_sort(G)
Five-round human triple comparisons + graph-based conflict resolution ensure consistent rankings
Performance Metrics:
Metric Type | Formula | Use Case |
---|---|---|
Accuracy (Acc) | ∑(correct pairs)/total pairs | General performance |
Exact Match (Exact) | ∑(fully ordered samples)/total | Complex scenario robustness |
CheemsPreference: The Chinese Value Lexicon (27K Instructions)
Dataset Construction Workflow:
-
Instruction Collection:
-
27,861 authentic user queries -
Hierarchical taxonomy with 8 categories/50+ subcategories (Fig 10)
-
-
Response Generation:
-
Covers Qwen2, Llama3 open-source models -
Includes GPT-4, Claude-3 commercial APIs
-
-
Annotation Strategy:
graph TD A[Human-labeled Golden Data] --> B[Train Initial RM] C[GPT-4o Annotations] --> D[Initial RM Filtering] B --> D D --> E[Combined Dataset]
Technical Innovations:
-
Length Debias: Balances preference for verbose responses -
Distant Supervision: Hybrid human-AI annotation pipeline
Key Experimental Findings
The Chinese Performance Gap
Table 2 reveals even top-performing models like Skywork-Reward-Gemma-2-27B:
-
Accuracy drops from 75.4% (standard prompts) to 74.8% (real-world instructions) -
Excels in math reasoning (82%) but struggles with text comprehension (61%)
Data Quality Dictates Model Ceiling
Table 3 comparisons show:
Dataset Type | Accuracy |
---|---|
Best Chinese Dataset | 72.8% |
Best English Dataset | 76.8% |
CheemsPreference | 85.7% |
Practical Implementation Guide
Training High-Performance Chinese RMs
-
Data Preparation:
-
Minimum 3,260 human-annotated samples -
Recommended 5:1 AI-human data ratio
-
-
Model Configuration:
- Base Model: Qwen2.5-72B-Instruct - Regularization: 0.1 - Learning Rate: 5e-6 (cosine decay)
-
Training Techniques:
-
Greedy batch sampling prevents redundant computations -
Gaussian prior regularization controls score inflation
-
FAQ: Addressing Common Concerns
Q1: Why Not Use English Reward Models Directly?
Experiments show English models:
-
Suffer 12.3% average accuracy drop in Chinese -
Make 41% errors on culture-specific tasks (e.g., idiom usage)
Q2: Is Human Annotation Essential?
Comparative results:
-
Pure AI annotations: 77.8% accuracy -
Human-AI hybrid: 85.7% accuracy -
23% improvement on complex instruction tasks
Q3: How to Evaluate Custom RMs?
Recommended two-tier testing:
-
Basic Test: CheemsBench standard prompts -
Stress Test: Real-user OOD instructions
Future Directions & Limitations
Three Promising Applications:
-
Value alignment for Chinese chatbots -
Cross-cultural preference modeling -
Low-resource language adaptation
Current Constraints:
-
Potential cultural bias in annotator demographics -
Limited dialect/language variant coverage -
Challenges in long-text coherence assessment
Technical Note: All data comes from the paper “Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch”. Implementation details available in original Appendix F. Model code is open-sourced through official channels.