RedOne 2.0: Revolutionizing Social Media AI with Domain-Specific LLM Training

高效码农

2 months ago

RedOne 2.0: Rethinking Domain-Specific LLM Post-Training for Social Networking Services

Introduction: Why Social Networking Services Need Specialized Large Language Models?

Core Question This Section Aims to Answer: What unique challenges do general-purpose large language models face when deployed in social networking services? General-purpose LLMs frequently underperform in social networking environments due to rapidly evolving trends, diverse cultural contexts, and heterogeneous workloads. Social platforms contain constantly changing content: new memes emerge overnight, community norms shift daily, and users communicate in multiple languages across different cultural backgrounds. These factors cause general models to misinterpret community-specific rules, over-enforce or under-enforce policies, and experience performance drift as conventions evolve.

More specifically, social networking services present exceptionally heterogeneous workloads—from real-time content moderation and abuse response to recommendation-driven dialogues, creator assistance, and community operations. Each task carries distinct latency, safety, and tone requirements. Traditional supervised fine-tuning methods often trigger a “seesaw effect” when specializing models: gains on in-distribution tasks come at the expense of out-of-distribution robustness. This problem proves particularly severe for smaller parameter models, which are more susceptible to catastrophic forgetting as new domain patterns overwrite previously learned skills.

Image Source: Unsplash – Data Diversity and Complexity in Social Networks

The Three-Stage Training Paradigm of RedOne 2.0

Stage 1: Exploratory Learning

Core Question This Section Aims to Answer: How does exploratory learning establish initial domain alignment for social networking services? Exploratory learning immerses the model in diverse social networking data to capture the breadth of task distributions and domain-specific interaction patterns while systematically identifying weaknesses. Rather than committing to narrowly scoped objectives early, this stage allows the model to explore the domain landscape broadly while preserving general competencies.

In practice, the research team curated approximately 750,000 SNS entries covering 75 heterogeneous tasks and all capability types, including post taxonomy, query classification, machine reading comprehension, post view search, and SNS domain translation. To maintain reasoning and general competence, they supplemented this with 50,000 general-domain instances containing rationales, widely recognized as beneficial for preserving model knowledge and supporting structured reasoning.

Author’s Reflection: From this stage’s design, we recognize that initial alignment shouldn’t be rushed. Giving models adequate exploration space to discover which tasks prove most challenging often proves more effective than directly instructing them what to learn. This diagnostic approach provides precise mapping for subsequent targeted repairs.

Reward function design represents a crucial innovation in this phase. Given substantial variation in format and content across downstream scenarios aligned with RedOne 2.0, the team developed task-specific reward mechanisms:

▸

Exact Match: For close-ended problems with determinate answers like classification or multiple-choice questions, focusing on answer consistency through exact match scoring
▸

Metrics-Based: For open-ended tasks like translation, defining rewards using task-specific evaluation metrics rather than binary correctness standards
▸

Sandbox Simulation: For tasks like code generation, creating execution environments to run generated solutions and evaluate them based on obtained results
▸

Pattern Matching: Emphasizing adherence to specified formats over semantic content itself, addressing generative LLM output instability

Stage 2: Targeted Fine-Tuning

Core Question This Section Aims to Answer: How does targeted fine-tuning specifically address model deficiencies in social networking service tasks? Targeted fine-tuning directly repairs weaknesses identified during the exploratory learning stage, emphasizing defect resolution while preserving previous gains through careful blending of challenging SNS data with filtered general data.

For data preparation, the team constructed a dataset of 1.8 million examples comprising 1.7 million SNS instances and 100,000 general-domain instances. The SNS portion derived from pre-training data corpus, specifically focusing on failure tasks identified via evaluation results from various benchmarks in the previous stage.

A key innovation involves using soft labels: For given prompts, the team generated eight candidate completions using the first-stage model, scored them with composite quality signals from a judge model, and selected the best to form soft supervisory targets. These soft labels not only mitigate catastrophic forgetting of general knowledge during SFT but also reduce distributional transformation impact between “ground-truth” labels and the first-stage model’s learned distribution, thereby improving learning efficiency for SNS tasks.

During actual training, optimization aims to close gaps on underperforming SNS tasks while preserving gains from the previous stage using a plain SFT objective on a mixture of hard SNS examples and a small set of general-domain examples with soft labels. This approach yields consistent improvements on previously weak SNS tasks while maintaining first-stage gains across most capability types.

Practical Application Scenario: Imagine a social platform needing to improve its content classification system, where the model underperforms in identifying emerging content types like new advertisement formats or potentially harmful content. The targeted fine-tuning stage can specifically increase training data for these categories while maintaining model capability on traditional content classification through general data, avoiding the pitfall of solving one problem while creating another.

Stage 3: Refinement Learning

Core Question This Section Aims to Answer: How does refinement learning consolidate gains from previous stages? Refinement learning consolidates prior improvements and achieves further performance enhancements by applying reinforcement learning after the previous SFT-based stage, with training again centered on SNS data and emphasizing challenging subsets.

Specifically, the team utilized approximately 400,000 examples drawn from SNS and general sources similar to the previous stage, while increasing the proportion of samples with rationale to 57.18%. This design further preserves the model’s reasoning ability and benefits a broad range of downstream tasks. The policy model initializes from the prior stage to provide a strong starting base, then applies preference-based DAPO for refinement.

Author’s Reflection: The most surprising discovery in the three-stage training is that RL still provides significant gains even after SFT. Model behavior remains optimizable post-SFT, and RL can further smooth performance across tasks, finding better balance points. This suggests model optimization represents a gradual process without instant solutions.

After training, model behavior stabilizes and smoothes within the explored solution space, yielding further improvements on both SNS-specific and general tasks. Compared to the previous stage, RL-based refinement delivers better convergence and more robust domain adaptation.

Experimental Validation: How Does RedOne 2.0 Perform?

Core Question This Section Aims to Answer: How does RedOne 2.0 perform across various benchmarks and real-world applications? RedOne 2.0 demonstrates strong, balanced results across multiple evaluation benchmarks, including general capabilities and SNS-specific competencies, outperforming both open-source and proprietary baselines of comparable scale.

Specifically, the 4B-parameter variant achieves the highest average score on General-Bench at 70.8, exceeding larger open models like Qwen3-8B and GLM-4-9B, and achieving comparable or superior results to some proprietary LLMs or models with over 100B parameters. This proves the proposed three-stage post-training pipeline effectively enhances both general and domain-specific capabilities even at smaller scales.

On SNS-Bench, which evaluates domain-specific understanding and reasoning across eight tasks, RedOne 2.0 continues to lead within its scale group. The 4B variant achieves an average score of 67.57, outperforming all sub-10B baselines and exceeding the previous RedOne-7B model by 0.69 despite fewer parameters. Similarly, the 30B-A3B version achieves 69.04, matching or surpassing much larger models like GPT-4o and GLM-4.5.

On SNS-TransBench, which measures cross-lingual understanding and translation quality between Chinese and English, RedOne 2.0 maintains competitive results across BLEU and chrF++ metrics. Both 4B and 30B-A3B variants achieve top-2 overall averages at 47.67 and 49.54 respectively, outperforming all similarly scaled models. Consistent performance across both translation directions indicates RedOne 2.0’s alignment pipeline preserves linguistic versatility while improving domain adaptation.

Model Scale	General-Bench	SNS-Bench	SNS-TransBench
4B Parameters	70.80	67.57	47.67
8B Parameters	69.27	65.82	46.72
30B Parameters	75.17	69.04	49.54

Performance of RedOne 2.0 Across Different Scales

Practical Application Scenario: A social platform requiring multilingual content processing needs both accurate classification capabilities and high-quality translation performance. RedOne 2.0’s balanced performance means deploying different models for various tasks becomes unnecessary—a single model handles multiple requirements, significantly simplifying system architecture and maintenance costs.

Incremental Performance and Comparative Analysis

Incremental Impact of Three-Stage Training

Core Question This Section Aims to Answer: What individual contributions do each of RedOne 2.0’s three training stages make to final performance? By progressively adding each training stage and evaluating performance changes, we can clearly see each stage’s value.

The RL-based exploratory learning stage establishes a strong foundation, improving performance to 71.25% on General-Bench, 62.27% on SNS-Bench, and 43.35% on SNS-TransBench, highlighting its effectiveness in consistently enhancing the base model’s overall capability.

The SFT-based targeted fine-tuning stage then addresses weaknesses in the SNS domain exhibited in the previous stage, raising scores to 65.67% on SNS-Bench and 47.72% on SNS-TransBench, with only a slight 1.21% drop on General-Bench.

Finally, the RL-based refinement learning stage balances performance across tasks, increasing the average from 61.14% to 62.01% and resulting in final scores of 70.80% on General-Bench, 67.57% on SNS-Bench, and 47.67% on SNS-TransBench.

Comparison with Traditional Methods

Core Question This Section Aims to Answer: What advantages does RedOne 2.0 offer over traditional SFT-followed-by-RL approaches? Considering RedOne 2.0’s most notable shift from traditional SFT-centric domain-specific post-training to RL, the team conducted experiments comparing it against a naive SFT-followed-by-RL baseline.

This baseline typically begins with SFT for domain adaptation, followed by RL to align the model with human preferences or downstream objectives. While SFT effectively boosts performance in SNS domains, it often causes a “seesaw” effect, significantly reducing general capability from 69.80 to 63.65. Although subsequent RL attempts to mitigate this issue, overall improvements across the three benchmarks remain limited.

In contrast, RedOne 2.0 refines the process: starting with RL to establish domain priors, followed by SFT for targeted enhancement, and concluding with RL for final optimization. This paradigm effectively avoids the trade-off between general and domain-specific performance and surpasses the naive baseline by 1.00 on General-Bench, 4.54 on SNS-Bench, and 1.72 on SNS-TransBench.

Author’s Reflection: The most surprising result from this comparison is the importance of training sequence. Intuition might suggest starting with SFT to build foundation then refining with RL, but evidence proves the path of explore-first, then targeted repair, and finally refinement works better. This challenges some traditional assumptions in domain adaptation, providing new directions for future model optimization.

Comparison with Task-Specific Fine-Tuning

Core Question This Section Aims to Answer: What advantages does a unified training framework offer over traditional task-specific fine-tuning? The team also compared the RedOne 2.0 framework against conventional task-specific fine-tuning approaches, with RedOne 2.0 designed for unified optimization across all tasks.

Task-specific fine-tuning produces strong performance on its target objectives. For instance, a Qwen3-4B model fine-tuned specifically for query generation achieves 49.24, and another fine-tuned for hashtag selection reaches 90.12. However, RedOne 2.0 4B, trained concurrently on a mixture of all tasks, delivers robust and highly competitive results across the entire benchmark spectrum.

Notably, it outperforms task-specific fine-tuned Qwen3-4B models on machine reading comprehension by 9.00 points and comment highlight words by 11.87 points. It also maintains strong performance on query correlation at 60.92 and SNS translation at 47.67. These results substantiate that a unified training framework can effectively capture and leverage beneficial inter-task relationships, enabling a single model to achieve comprehensive and superior capability.

Image Source: Unsplash – Unified Framework Handling Multiple Tasks

Real-World Applications and Case Studies

Online Deployment and Business Impact

Core Question This Section Aims to Answer: How does RedOne 2.0 perform when deployed on actual social networking platforms? The team deployed RedOne 2.0 on a large-scale social networking platform with over 3 million users to recommend personalized re-created post titles in real-time. Each pre-published title routes to the service, which performs semantic analysis and produces an enhanced title preserving original intent while optimizing for engagement.

Evaluation covered business impact and content quality. The primary business indicator is Advertiser Value (AdvV), reflecting value delivered to advertisers through audience quality and engagement. Content quality measured through human review across four dimensions: vagueness, practicality, authenticity, and interactivity.

Online testing conducted over several weeks across millions of posts showed consistent gains. Advertiser Value increased by 0.43%, a statistically significant improvement at platform scale. Human evaluation reported an 11.9% reduction in vague titles and increases of 7.1% in practical titles, 12.9% in authentic titles, and 25.8% in interactive titles. The strong rise in interactive titles indicates the model learns linguistic patterns that encourage responses like comments and shares.

Practical Application Scenario: A content creator publishes a post about travel tips on a social platform, with a potentially bland original title. RedOne 2.0 can analyze the content and generate a more engaging title, such as transforming “Transportation Card Guide for Osaka and Kyoto” into “Must-See for Kansai Travel! Avoid These Transportation Card Traps,” preserving key information while increasing click appeal.

Case Study: The Good and Bad of Title Rewriting

Core Question This Section Aims to Answer: What are the strengths and limitations of RedOne 2.0 in actual content optimization? By comparing RedOne 2.0’s outputs against baselines, we can qualitatively assess its capabilities.

Excellent Case 1: Dehumidification Content

▸

Original Title: “Plum rain season, a great helper for dehumidification and mold prevention”
▸

Baseline Model: “Dehumidification essential for the plum rain season, a fresh choice for a dry and comfortable life”
▸

RedOne 2.0: “Say goodbye to ‘steamy’ homes! Rescue your plum rain season”

Excellent Case 2: Wedding Photography Content

▸

Original Title: “Beijing wedding photo recommendations: 17 lawn wedding photo outdoor spots”
▸

Baseline Model: “Dreamy lawn wedding photos, capturing the most beautiful moments”
▸

RedOne 2.0: “Escape the studio! 17 stunning lawns capture cinematic-level wedding photos”

Problematic Case: Transportation Card Content

▸

Original Title: “Don’t buy the wrong transportation card for Osaka and Kyoto! A lesson learned the hard way!”
▸

Baseline Model: “A guide to Japanese transportation cards—stop making these mistakes!”
▸

RedOne 2.0: “Avoid these pitfalls for your Kansai trip, check out the guide now”

Author’s Reflection: From these cases, we observe RedOne 2.0 excels at optimizing engagement and stylistic appeal but sometimes over-optimizes at the expense of essential details. Future work should reinforce faithfulness constraints while preserving expressiveness. This indicates that in real deployments, finding balance between creativity and accuracy proves crucial, potentially achievable through reward function adjustments or additional constraints.

Reflections and Insights

Core Question This Section Aims to Answer: What key lessons have we learned from developing RedOne 2.0? The development process of RedOne 2.0 challenged several traditional assumptions in domain-specific LLM post-training. The most significant insight concerns the importance of training sequence—starting with RL to establish domain priors, then proceeding to targeted SFT, and finally refining with RL again proves more effective than traditional SFT-first approaches.

Another crucial takeaway involves the value of data efficiency. RedOne 2.0 achieved an average performance lift of approximately 8.74 from the base model using less than half the data required by RedOne, demonstrating the superior data efficiency of RL-centric curriculum learning. This suggests that in domain adaptation, data quality and usage methodology matter more than data volume.

From an architectural perspective, RedOne 2.0 proved that strong performance is achievable even at compact scales through carefully designed training pipelines. The 4B parameter variant surpassed the 7B counterpart by 2.41 on average, indicating that robust performance at compact scales opens new possibilities for deployment in resource-constrained environments.

Unique Insight: The most surprising discovery reveals that unified training frameworks can effectively capture and leverage inter-task relationships, enabling single models to achieve comprehensive, superior capability rather than optimizing separate models for each task. This not only simplifies deployment architecture but also enhances overall system synergy.

Finally, online deployment experience emphasized the importance of balancing engagement optimization with content faithfulness. Models excelled at generating appealing content but required constraints to ensure critical information preservation. This points toward the significance of reward function design in future work, needing to incorporate both creative and accuracy signals.

Conclusion

RedOne 2.0 presents a revolutionary framework for domain-specific large language model post-training in social networking services. Through its progressive, RL-prioritized three-stage pipeline—exploratory learning, targeted fine-tuning, and refinement learning—it effectively addresses the challenges of heterogeneity, dynamism, and cultural diversity in social networking environments.

Unlike conventional approaches, RedOne 2.0 avoids catastrophic forgetting and unstable trade-offs while demonstrating robust data efficiency, stable adaptation, and strong generalization even at compact model scales. Across both benchmark evaluations and real-platform deployments, it proves capable of enhancing domain-specific competencies without sacrificing generality, safety, or usability.

For enterprises and developers seeking to deploy large language models in social networking services, RedOne 2.0 establishes a competitive, cost-effective, and scalable baseline, marking a significant advancement in domain-specific model optimization.

Practical Summary and Actionable Checklist

Key Steps for Implementing RedOne 2.0-Style Training

Data Preparation and Categorization
- ▸
  
  Collect social networking domain data covering at least 75 heterogeneous tasks
- ▸
  
  Prepare high-quality general domain data, particularly samples with rationales
- ▸
  
  Standardize data format into question-and-answer pairs
Three-Stage Training Pipeline
- ▸
  
  Stage 1: Exploratory Learning – Use RL for initial domain alignment and weakness diagnosis
- ▸
  
  Stage 2: Targeted Fine-Tuning – Use SFT for specific defect repair, blending general data to prevent forgetting
- ▸
  
  Stage 3: Refinement Learning – Reapply RL to consolidate improvements and balance different tasks
Reward Function Design
- ▸
  
  Develop specific reward mechanisms for different task types
- ▸
  
  Combine exact matching, metric evaluation, sandbox simulation, and pattern matching
- ▸
  
  Ensure reward signals align with ultimate business objectives
Deployment and Monitoring
- ▸
  
  Gradually deploy in live traffic using A/B testing to verify effectiveness
- ▸
  
  Monitor business metrics and content quality indicators
- ▸
  
  Establish continuous evaluation and iteration mechanisms

One-Page Summary

▸

Problem: Large language models in social networking services face heterogeneous workloads, rapidly changing norms, and multilingual cultural diversity challenges, with traditional methods causing “seesaw effects” and catastrophic forgetting
▸

Solution: RedOne 2.0’s three-stage training paradigm – exploratory learning (RL), targeted fine-tuning (SFT), and refinement learning (RL)
▸

Key Innovations: RL-first approach, task-specific reward functions, soft label regularization, unified multi-task training
▸

Core Advantages: High data efficiency (8.74 improvement with half the data), scale-friendly (4B surpasses 7B), balanced performance (simultaneously enhancing general and domain capabilities)
▸

Validation Results: Comprehensive leadership across general benchmarks, social networking benchmarks, and translation benchmarks; online deployment increased advertiser value by 0.43% with significant content quality metric improvements
▸

Applicable Scenarios: Social platform content moderation, personalized recommendation, multilingual translation, content generation and optimization

Frequently Asked Questions (FAQ)

What advantages does RedOne 2.0 offer over traditional SFT methods?
RedOne 2.0 avoids the “seesaw effect” of traditional SFT, enhancing domain-specific performance without sacrificing general capabilities, while demonstrating higher data efficiency—achieving greater improvements using less than half the data.

How does RedOne 2.0 prevent catastrophic forgetting?
By blending social networking data with general data throughout the three-stage training process, particularly using soft labels as regularizers during targeted fine-tuning to reduce distributional transformation impact.

Which specific social networking scenarios is RedOne 2.0 suitable for?
Applicable to over 75 social networking tasks including content categorization, tag recommendation, query correlation analysis, reading comprehension, entity extraction, gender-sensitive analysis, comment highlighting, and query generation.

How much computational resources does implementing RedOne 2.0 training require?
While specific resources depend on model scale, RedOne 2.0 achieves strong performance even at compact scales (like 4B parameters), making it suitable for resource-constrained environments.

How does RedOne 2.0 perform on multilingual tasks?
On social networking translation benchmarks, RedOne 2.0 achieved leading results in Chinese-English translation tasks, demonstrating effective handling of multilingual content in social networks.

What were the results of RedOne 2.0’s online deployment?
Online testing showed a 0.43% increase in advertiser value, with significant content quality improvements—11.9% reduction in vague titles and 25.8% increase in interactive titles.

How does RedOne 2.0 handle new trends and slang in social networks?
Through broad exposure during exploratory learning and continuous optimization in refinement learning, the model adapts to rapidly changing social network language patterns.

What advantages does a unified training framework offer compared to task-specific fine-tuning?
Unified training captures and leverages inter-task relationships, enabling single models to achieve comprehensive, superior capability while simplifying deployment architecture and enhancing system synergy.