Building Chinese Reward Models from Scratch: A Practical Guide to CheemsBench and CheemsPreference

Why Do We Need Dedicated Chinese Reward Models?

In the development of large language models (LLMs), reward models (RMs) act as “value referees” that align AI outputs with human preferences. However, current research faces two critical challenges:

  1. Language Bias: 90% of existing studies focus on English, leaving Chinese applications underserved
  2. Data Reliability: Synthetic datasets dominate current approaches, failing to capture authentic human preferences

The Cheems project – a collaboration between the Institute of Software (Chinese Academy of Sciences) and Xiaohongshu – introduces the first comprehensive framework for Chinese reward model development. This guide explores its groundbreaking components.


Core Components Explained

CheemsBench: The Gold Standard for Chinese RM Evaluation

Key Features:

  • Diverse Data Sources:

    • 1,146 open-source prompts from 8 datasets (Humaneval-XL, GAOKAO-Bench, etc.)
    • 1,346 real-world human instructions capturing practical use cases
  • Innovative Evaluation Protocol:

    # Conflict resolution algorithm example
    def resolve_conflicts(responses, annotations):
        G = build_preference_graph(annotations)
        while cycles := detect_cycles(G):
            merge_nodes(G, cycles)
        return topological_sort(G)
    

    Five-round human triple comparisons + graph-based conflict resolution ensure consistent rankings

Performance Metrics:

Metric Type Formula Use Case
Accuracy (Acc) ∑(correct pairs)/total pairs General performance
Exact Match (Exact) ∑(fully ordered samples)/total Complex scenario robustness

CheemsPreference: The Chinese Value Lexicon (27K Instructions)

Dataset Construction Workflow:

  1. Instruction Collection:

    • 27,861 authentic user queries
    • Hierarchical taxonomy with 8 categories/50+ subcategories (Fig 10)
  2. Response Generation:

    • Covers Qwen2, Llama3 open-source models
    • Includes GPT-4, Claude-3 commercial APIs
  3. Annotation Strategy:

    graph TD
    A[Human-labeled Golden Data] --> B[Train Initial RM]
    C[GPT-4o Annotations] --> D[Initial RM Filtering]
    B --> D
    D --> E[Combined Dataset]
    

Technical Innovations:

  • Length Debias: Balances preference for verbose responses
  • Distant Supervision: Hybrid human-AI annotation pipeline

Key Experimental Findings

The Chinese Performance Gap

Table 2 reveals even top-performing models like Skywork-Reward-Gemma-2-27B:

  • Accuracy drops from 75.4% (standard prompts) to 74.8% (real-world instructions)
  • Excels in math reasoning (82%) but struggles with text comprehension (61%)

Data Quality Dictates Model Ceiling

Table 3 comparisons show:

Dataset Type Accuracy
Best Chinese Dataset 72.8%
Best English Dataset 76.8%
CheemsPreference 85.7%

Practical Implementation Guide

Training High-Performance Chinese RMs

  1. Data Preparation:

    • Minimum 3,260 human-annotated samples
    • Recommended 5:1 AI-human data ratio
  2. Model Configuration:

    - Base Model: Qwen2.5-72B-Instruct
    - Regularization: 0.1
    - Learning Rate: 5e-6 (cosine decay)
    
  3. Training Techniques:

    • Greedy batch sampling prevents redundant computations
    • Gaussian prior regularization controls score inflation

FAQ: Addressing Common Concerns

Q1: Why Not Use English Reward Models Directly?

Experiments show English models:

  • Suffer 12.3% average accuracy drop in Chinese
  • Make 41% errors on culture-specific tasks (e.g., idiom usage)

Q2: Is Human Annotation Essential?

Comparative results:

  • Pure AI annotations: 77.8% accuracy
  • Human-AI hybrid: 85.7% accuracy
  • 23% improvement on complex instruction tasks

Q3: How to Evaluate Custom RMs?

Recommended two-tier testing:

  1. Basic Test: CheemsBench standard prompts
  2. Stress Test: Real-user OOD instructions

Future Directions & Limitations

Three Promising Applications:

  1. Value alignment for Chinese chatbots
  2. Cross-cultural preference modeling
  3. Low-resource language adaptation

Current Constraints:

  • Potential cultural bias in annotator demographics
  • Limited dialect/language variant coverage
  • Challenges in long-text coherence assessment

Technical Note: All data comes from the paper “Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch”. Implementation details available in original Appendix F. Model code is open-sourced through official channels.