Building Chinese Reward Models from Scratch: A Practical Guide to CheemsBench and CheemsPreference

Why Do We Need Dedicated Chinese Reward Models?

In the development of large language models (LLMs), reward models (RMs) act as “value referees” that align AI outputs with human preferences. However, current research faces two critical challenges:

Language Bias: 90% of existing studies focus on English, leaving Chinese applications underserved
Data Reliability: Synthetic datasets dominate current approaches, failing to capture authentic human preferences

The Cheems project – a collaboration between the Institute of Software (Chinese Academy of Sciences) and Xiaohongshu – introduces the first comprehensive framework for Chinese reward model development. This guide explores its groundbreaking components.

Core Components Explained

CheemsBench: The Gold Standard for Chinese RM Evaluation

Key Features:

Diverse Data Sources:
- 1,146 open-source prompts from 8 datasets (Humaneval-XL, GAOKAO-Bench, etc.)
- 1,346 real-world human instructions capturing practical use cases

Innovative Evaluation Protocol:

# Conflict resolution algorithm example
def resolve_conflicts(responses, annotations):
    G = build_preference_graph(annotations)
    while cycles := detect_cycles(G):
        merge_nodes(G, cycles)
    return topological_sort(G)

Five-round human triple comparisons + graph-based conflict resolution ensure consistent rankings

Performance Metrics:

Metric Type	Formula	Use Case
Accuracy (Acc)	∑(correct pairs)/total pairs	General performance
Exact Match (Exact)	∑(fully ordered samples)/total	Complex scenario robustness

CheemsPreference: The Chinese Value Lexicon (27K Instructions)

Dataset Construction Workflow:

Instruction Collection:
- 27,861 authentic user queries
- Hierarchical taxonomy with 8 categories/50+ subcategories (Fig 10)
Response Generation:
- Covers Qwen2, Llama3 open-source models
- Includes GPT-4, Claude-3 commercial APIs

Annotation Strategy:

graph TD
A[Human-labeled Golden Data] --> B[Train Initial RM]
C[GPT-4o Annotations] --> D[Initial RM Filtering]
B --> D
D --> E[Combined Dataset]

Technical Innovations:

Length Debias: Balances preference for verbose responses
Distant Supervision: Hybrid human-AI annotation pipeline

Key Experimental Findings

The Chinese Performance Gap

Table 2 reveals even top-performing models like Skywork-Reward-Gemma-2-27B:

Accuracy drops from 75.4% (standard prompts) to 74.8% (real-world instructions)
Excels in math reasoning (82%) but struggles with text comprehension (61%)

Data Quality Dictates Model Ceiling

Table 3 comparisons show:

Dataset Type	Accuracy
Best Chinese Dataset	72.8%
Best English Dataset	76.8%
CheemsPreference	85.7%

Practical Implementation Guide

Training High-Performance Chinese RMs

Data Preparation:
- Minimum 3,260 human-annotated samples
- Recommended 5:1 AI-human data ratio

Model Configuration:

- Base Model: Qwen2.5-72B-Instruct
- Regularization: 0.1
- Learning Rate: 5e-6 (cosine decay)

Training Techniques:
- Greedy batch sampling prevents redundant computations
- Gaussian prior regularization controls score inflation

FAQ: Addressing Common Concerns

Q1: Why Not Use English Reward Models Directly?

Experiments show English models:

Suffer 12.3% average accuracy drop in Chinese
Make 41% errors on culture-specific tasks (e.g., idiom usage)

Q2: Is Human Annotation Essential?

Comparative results:

Pure AI annotations: 77.8% accuracy
Human-AI hybrid: 85.7% accuracy
23% improvement on complex instruction tasks

Q3: How to Evaluate Custom RMs?

Recommended two-tier testing:

Basic Test: CheemsBench standard prompts
Stress Test: Real-user OOD instructions

Future Directions & Limitations

Three Promising Applications:

Value alignment for Chinese chatbots
Cross-cultural preference modeling
Low-resource language adaptation

Current Constraints:

Potential cultural bias in annotator demographics
Limited dialect/language variant coverage
Challenges in long-text coherence assessment

Technical Note: All data comes from the paper “Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch”. Implementation details available in original Appendix F. Model code is open-sourced through official channels.

Building Chinese Reward Models: Mastering CheemsBench & CheemsPreference for AI Alignment