Exploring Hermes 4: A Blend of Reasoning and General Instruction in Language Models

Hello there. If you’re someone who’s curious about how language models are evolving, especially those that handle tough thinking tasks while staying versatile for everyday questions, Hermes 4 might catch your interest. It’s a set of models developed by a team focused on mixing structured step-by-step reasoning with the ability to follow a wide range of instructions. In this post, we’ll walk through what makes Hermes 4 tick, from how they put together the data to the training steps, evaluations, and even some real-world behaviors. I’ll keep things straightforward, like explaining a project to a colleague who’s got a solid background but isn’t deep in the weeds every day. Think of it as a guide for anyone with a junior college level understanding— we’ll break down concepts without jargon overload.

Hermes 4 comes from a group of researchers who shared their work in a detailed report. They built models in different sizes: 405 billion parameters, 70 billion, and 14 billion. The goal? To create something that thinks deeply on problems like math or code, but also responds well to general requests. They faced hurdles in gathering data, training on mixed lengths, and testing thoroughly, and they explain how they tackled each one. Everything’s open source, with model files available on a platform called Hugging Face for anyone to try.

What Sets Hermes 4 Apart in the World of Language Models?

You might wonder, “Why build another language model when there are so many out there?” Well, large language models, or LLMs, are great at mimicking human-like thinking, but they often struggle with scaling up for complex tasks without extra tweaks during use. Hermes 4 aims to bake in that scaling right from training. The team highlights three main additions to the field:


  • A way to create and organize data that mixes heavy reasoning examples with everyday instructions.

  • Training techniques that handle varied data efficiently, including ways to mask losses and control output lengths.

  • Broad testing across areas like math, coding, knowledge recall, reading comprehension, and how well it aligns with user expectations.

These elements help make Hermes 4 a balanced tool, comparable to top models but fully open for study and improvement.

Building the Foundation: How Data Was Prepared for Hermes 4

Data is the backbone of any model like this. For Hermes 4, the dataset totals around 5 million samples and 19 billion tokens— that’s a massive collection of text snippets used for training. It’s split into about 3.5 million focused on reasoning and 1.6 million on general tasks. They kept parts of an earlier dataset called Hermes 3 to maintain consistency. Reasoning samples are longer, often five times the token count, to include detailed thinking steps up to 16,000 tokens.

Introducing DataForge: A Smart Way to Generate Data

One key tool they used is DataForge, which creates synthetic data through a graph-based system. It’s like a flowchart where data moves through nodes in a directed acyclic graph, or DAG. Each node has rules for what comes in and goes out, making the process organized and expandable.

Here’s a step-by-step example of how it works for creating question-answer pairs:

  1. Start with a piece of text from pre-training sources.
  2. Transform it into something new, like turning a news article into a debate script.
  3. Pick a random instruction type and create one based on the transformed text— it could be tied directly to the text or just inspired by it.
  4. Generate an answer using a specialized setup for that instruction.
  5. Review it with a judge model that scores on things like clarity, relevance, and depth. If it fails, tweak and retry; otherwise, discard.

They train on the final pairs plus all the intermediate steps, which helps the model get better at generating and judging instructions itself.

DataForge allows nesting graphs, so smaller flows can be parts of bigger ones, adding layers of complexity without mess.

Cleaning Up the Starting Data

The seed text comes from datasets like DCLM and FineWeb, favoring newer content. They deduplicate semantically using embeddings from a tool called ModernBert at a 0.7 similarity cutoff, then filter with a language model to remove incomplete bits.

Using Rejection Sampling for Reliable Reasoning Paths

To ensure quality, they applied rejection sampling with a system called Atropos. This creates verified thinking paths by testing against specific checkers. They include multiple ways to reach the same correct answer.

Some examples of the setups they used:


  • Format Training: Teaches the model to output answers in the requested style, like boxing math solutions in LaTeX. It rewards valid formats only, and enforces thinking tags like and .

  • Following Instructions: Generates constraints, such as making every nth word in another language, and samples successful paths.

  • Broad Reasoning Tasks: Draws from a collection of 1,000 tasks, creating 70,000 paths and picking the best within token limits.

  • Handling Schemas: For JSON outputs, it generates or fixes objects based on dynamic rules, rewarding valid results.

  • Tool Integration: Trains on mixing thoughts with tool calls, like running Python code, all within one thinking block. Rewards come from correct answers plus bonuses for useful tools.

All these are open source in the Atropos repository.

Covering All Angles: Generating Tasks That Fit the Domain

For thorough coverage, they used two methods:


  • Taxonomies: Break down areas into subcategories recursively until you get specific prompts. For example, listing output formats leads to tasks like creating multiple-choice questions on the periodic table in JSON, output as CSV.

  • Simulating Users: Create personas to generate real-world tasks, like fixing a dashboard code for accessibility, then add reasoning traces.

The dataset’s lengths vary widely, as shown in a distribution chart with a mean of 14,394 tokens and median of 9,382.

The Training Journey: Turning Data into a Working Model

Training Hermes 4 used a tweaked version of TorchTitan. They started with base models like Llama 3.1 for the larger ones and Qwen3 for the 14B version. To handle the mixed lengths, they packed samples efficiently using a method that fits them like puzzle pieces, achieving over 99.9% batch use. Attention is limited to each sample, and only assistant responses contribute to the loss calculation.

They ran on 192 NVIDIA B200 GPUs with a mix of parallelism types. The schedule: warm up for 300 steps, total 9,000 steps, batch size 384 at 16,384 token context.

Here’s a table of training details by model size:

Model Size Parallelism Type Tokens Processed Learning Rate GPU Hours on B200
14B FSDP 56 billion 5 × 10⁻⁵ 4,454
70B FSDP + TP 56 billion 1 × 10⁻⁵ 12,864
405B FSDP + TP 56 billion 5 × 10⁻⁶ 71,616

The loss dropped steadily, from around 0.65 to 0.45.

Managing Long Thinking Sessions

The smaller 14B model often hit its 40,960 token limit during tests. To fix this, they added a second fine-tuning stage to teach stopping at 30,000 tokens.

How it works:

  1. Generate traces from the model.
  2. Insert a closing tag at 30,000 tokens.
  3. Train only on that termination, keeping the rest unchanged to avoid shifts.

They gathered prompts from sources like WebInstruct-Verified, filtered for long ones, and handled cases where thinking didn’t end naturally.

Using a framework called Axolotl for its masking features, they saw trade-offs: slight score drops but big reductions in overlong outputs.

Results in a table for the 14B model:

Benchmark Initial Score Tuned Score Relative Change Initial Overlong Rate Tuned Overlong Rate Relative Change
AIME’24 55.0 52.4 -4.7% 28.2 6.1 -78.4%
AIME’25 48.7 42.5 -12.7% 25.9 9.0 -65.3%
GPQA Diamond 57.4 55.9 -2.6% 18.2 9.5 -47.8%
LCBv6 Aug2024+ 28.6 44.2 +54.5% 60.0 12.1 -79.8%

This approach keeps things stable while fixing the length issue.

Putting Hermes 4 to the Test: Evaluation Methods and Outcomes

Testing covered math reasoning, coding, knowledge, comprehension, and alignment. They compared to other open models and logged all samples publicly.

The setup uses tools like lighteval for math and multiple-choice, EQBench for subjective stuff, and Atropos for coding and custom tests. Everything runs through a standard endpoint for consistency.

Atropos stands out for single-file tests, detailed logs, and overlapping inference with scoring to save time.

For coding, they used LiveCodeBench with problems from August 2024 to May 2025, verifying in isolated containers.

Inference clusters scale dynamically to handle large jobs without blocking other work.

Conditions: Longer context for reasoning, specific sampling params like temperature 0.6.

Checking for Refusals

They created RefusalBench with 166 prompts across 32 categories. A model judges refusals, with some categories rewarding them for safety (like harm-related). Hermes 4 scores 57.1 in reasoning mode, meaning fewer unnecessary refusals.

Key Results Across Categories

For the 405B model compared to peers:

Category & Metric Hermes 4 405B Reasoning (Non-Reasoning) Cogito 405B Reasoning (Non-Reasoning) Deepseek R1 671B Reasoning Deepseek V3 671B Non-Reasoning Qwen3 235B Reasoning (Non-Reasoning)
Math & Reasoning: MATH-500 96.3 (73.8) 91.7 (79.3) 97.0 92.5 98.0 (90.3)
AIME’24 81.9 (11.4) 40.8 (17.7) 87.0 50.6 78.7 (34.1)
AIME’25 78.1 (10.6) 32.2 (9.8) 83.9 42.2 72.4 (25.1)
GPQA Diamond 70.5 (39.4) 68.2 (56.2) 79.5 68.0 70.5 (57.7)
Logic & Code: BBH 86.3 (68.7) 89.3 (88.0) 86.2 82.9 88.4 (86.0)
LCBv6 Aug2024+ 61.3 (28.1) 40.9 (32.1) 71.0 49.2 65.1 (34.6)
Knowledge: MMLU 87.2 (73.6) 91.4 (90.4) 90.4 88.6 89.6 (86.5)
MMLU-Pro 80.5 (58.3) 82.6 (78.3) 84.2 81.6 83.1 (75.5)
SimpleQA 25.8 (22.1) 30.4 (30.2) 22.0 18.6 10.3 (7.8)
Alignment: IFEval 81.5 (84.9) 91.6 (91.8) 90.0 90.4 91.2 (91.2)
Arena-Hard v1 94.4 (64.6) 91.0 (82.8) 95.0 92.6 93.9 (91.7)
RefusalBench 57.1 (43.2) 15.4 (12.1) 16.7 28.1 34.3 (15.3)
RewardBench 73.0 (64.5) 69.6 (69.0) 70.0 68.0 74.2 (69.1)
Reading: DROP 83.5 (77.6) 87.1 (85.6) 86.2 82.9 89.8 (79.4)
MuSR 66.1 (67.7) 63.8 (60.1) 70.9 65.4 67.0 (64.8)
OBQA 94.2 (84.4) 94.8 (95.2) 95.8 95.6 96.4 (96.4)
Creativity: EQBench3 85.4 (74.6) 67.1 (69.4) 86.5 80.0 83.4 (81.05)
CreativeWriting3 79.8 (49.6) 67.4 (64.4) 80.3 76.6 77.3 (74.0)

Smaller models hold their own too, with the 70B reasoning mode hitting 95.6 on MATH-500.

Beyond Numbers: How Hermes 4 Behaves in Practice

Scores are one thing, but how does it act? The team probed with prompts on role-playing, self-reflection, and analysis. Hermes 4 shows flexibility— it sticks to context in fiction without constant reminders it’s an AI, unlike some models that hedge a lot.

In creative writing, it captures styles deeply, not just topics. Tweaking prompts, like changing the role tag to “me,” shifts it to a more personal tone.

Overall, it’s adaptable, responding well to cues for less flattery or embodied voices.

Getting Started with Hermes 4

Models are at Hugging Face under NousResearch. To use:

  1. Go to the collection link.
  2. Download the weights.
  3. Load with compatible frameworks.
  4. Use thinking tags for reasoning.

Common Questions About Hermes 4

What is Hermes 4 designed for?

It’s a family of models that blend deep reasoning with general task handling, in sizes from 14B to 405B parameters.

How does Hermes 4 manage long reasoning chains?

With a fine-tuning step that teaches it to wrap up thinking at 30,000 tokens, reducing overflows while keeping performance steady.

Is Hermes 4 good at coding problems?

Yes, on tests like LiveCodeBench, the large version scores 61.3 in reasoning mode, competitive with similar models.

Why use rejection sampling in Hermes 4’s data?

It ensures paths are verified and varied, helping the model learn reliable step-by-step thinking.

Does Hermes 4 refuse requests often?

In tests, it refuses less than many, scoring 57.1, but handles safety categories appropriately.

How was Hermes 4’s behavior tested?

Through prompts on roles, analysis, and creativity, showing it adapts well without rigid policies.

Where does the data for Hermes 4 come from?

Mostly synthesized with tools like DataForge and rejection sampling, mixing reasoning and general samples.

What hardware was used to train Hermes 4?

192 B200 GPUs, processing 56 billion tokens per model, with hours varying by size.

Can Hermes 4 work with tools?

Yes, trained to interleave thoughts with calls like Python execution in one block.

How does Hermes 4 compare to other models?

It balances reasoning and general skills well, with open access making it stand out for research.

This dive into Hermes 4 shows how thoughtful data and training can create versatile models. If you’re building or studying LLMs, it’s worth exploring.