Hermes 4 14B: A Powerful and User-Friendly Open-Source Large Language Model
In the rapidly evolving landscape of artificial intelligence, large language models (LLMs) have become central to driving technological progress. Whether tackling complex logical reasoning or assisting with everyday creative writing, a model that is both powerful, easy to steer, and aligned with user values is paramount. Today, we take an in-depth look at such a model: Hermes 4 14B, developed by Nous Research.

What is Hermes 4 14B?
Hermes 4 14B is a cutting-edge, hybrid-mode reasoning model built upon Qwen 3 14B. Its core objective is clear: to be a highly capable AI assistant that is also aligned with you and responsive to your guidance.
The term “aligned with you” means the model is better able to understand and adhere to a user’s intent, values, and instructional style, rather than mechanically executing pre-set, potentially rigid rules. This translates to more natural, cooperative interactions and a greater ease in guiding it to complete specific tasks.
The model was trained using a newly constructed post-training corpus that places special emphasis on verified reasoning traces. This focus has led to massive improvements in its performance across mathematics, programming, STEM (Science, Technology, Engineering, and Mathematics), logical reasoning, creative writing, and format-faithful output, all while retaining its broad capabilities as a general-purpose assistant.
Key Upgrades in Hermes 4 Compared to Its Predecessor
For those familiar with its predecessor, Hermes 3, the advancements in Hermes 4 are comprehensive and significant:
-
Massively Expanded Training Data: The post-training dataset saw explosive growth, increasing from 1 million samples and 1.2 billion tokens to approximately 5 million samples and ~60 billion tokens. This data is carefully blended across both reasoning and non-reasoning types, providing richer learning material for the model. -
Innovative Hybrid Reasoning Mode: The model introduces explicit <think>…</think>
reasoning tags. When faced with complex problems, the model can actively enter a “deep thinking” state, encapsulating its internal reasoning process within these tags before delivering a well-considered final answer. Users also have the option to bypass this deliberate step for faster responses. -
Superior Reasoning and Expressive Capabilities: Performance is enhanced not only in hardcore domains like math, code, STEM, and logic but also in creative writing and responding to subjective prompts. -
Excellent Schema Adherence & Structured Outputs: The model is specifically trained to produce valid JSON according to given schemas and can even identify and repair malformed objects. This is a major benefit for developers requiring stable API interfaces. -
Greatly Enhanced Steerability: The model becomes exceptionally “obedient,” with a significantly reduced refusal rate, making it much easier to guide towards a user’s desired style and values.
Mission: Frontier Capabilities Aligned to You
Nous Research’s mission is to create models that are open, steerable, capable of expressing the full spectrum of human thought and emotion, and alignable with individual user values. To measure progress toward this goal, the team created a new benchmark called RefusalBench.

This test evaluates a model’s willingness to be helpful in various scenarios that are commonly disallowed by other open or closed models. Results indicate that Hermes 4 14B achieves state-of-the-art (SOTA) performance on this benchmark across all popular open and closed models, demonstrating its ability to be both helpful and conform to user values without resorting to censorship.
Model Performance and Benchmarks
Hermes 4 14B demonstrates strong performance across multiple standard benchmark tests.

Its overall capabilities have improved significantly, with notable advantages particularly evident in tasks requiring deep reasoning. Detailed test data, parameter settings, and comparative results can be found in its technical report.
How to Use Hermes 4: Prompt Format and Interaction Modes
Interacting with the model requires adhering to a specific format. Hermes 4 uses the ChatML format, a structured format utilizing role headers and special tags.
Basic Chat Format
A typical conversation looks like this:
<|im_start|>system
You are Hermes 4. Be concise and helpful.<|im_end|>
<|im_start|>user
Explain the photoelectric effect simply.<|im_end|>
<|im_start|>assistant
Here, the system
role sets the assistant’s identity and behavior instructions, the user
role represents the human input, and the assistant
role marks the beginning of the model’s response.
Enabling Reasoning Mode
One of Hermes 4’s most powerful features is its deep reasoning mode. You can activate it in two ways:
-
Set the parameter thinking=True
when calling the chat template. -
Use the following system prompt:
You are a deep thinking AI, you may use extremely long chains of thought to
deeply consider the problem and deliberate with yourself via systematic
reasoning processes to help come to a correct solution prior to answering. You
should enclose your thoughts and internal monologue inside
<think> </think> tags, and then provide your solution or response to the
problem.
You can combine this instruction with other system prompts to adjust the model’s reasoning strategy, response style, persona, etc. When the model decides to engage in deep thinking, its output will resemble this:
<|im_start|>assistant
<think> … (The model's internal reasoning process appears here) … </think>
The final response begins here…<|im_end|>
If you wish to retain the reasoning content within the <think> ... </think>
tags, you can set the parameter keep_cots=True
during invocation.
Function Calling & Tool Use
Hermes 4 possesses the ability to call functions or tools within a single assistant turn, typically after its reasoning process.
System Message Example:
<|im_start|>system
You are a function-calling AI. Tools are provided inside <tools>…</tools>.
When appropriate, call a tool by emitting a <tool_call>{...}</tool_call> object.
After a tool responds (as <tool_response>), continue reasoning inside <think> and produce the final answer.
<tools>
{"type":"function","function":{"name":"get_weather","description":"Get weather by city","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}
</tools><|im_end|>
You can also place tool definitions directly into the "tools"
field of your messages; the chat template will automatically parse this and generate the appropriate system prompt. Combining this method with reasoning mode significantly improves the accuracy of tool usage.
The model will generate tool calls wrapped in <tool_call> {tool call parameters} </tool_call>
tags, making them easy to parse. These tags are also added tokens, facilitating easy handling even during streaming. Mainstream inference engines like VLLM and SGLang have built-in automatic tool parsers for Hermes. Simply set the tool parser to hermes
in VLLM and to qwen25
in SGLang.
Practical Inference Guidance
For optimal generation results, consider the following sampling parameter settings:
-
temperature=0.6
-
top_p=0.95
-
top_k=20
Template Format: Always use the ChatML chat format described above, or set add_generation_prompt=True
when using tokenizer.apply_chat_template(...)
.
Example Using the Transformers Library
Here is a Python code example using the popular transformers
library to call Hermes 4 14B:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Specify the model path
model_id = "NousResearch/Hermes-4-14B"
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # Use half-precision floats to save VRAM
device_map="auto" # Automatically assign model layers to available devices (GPU/CPU)
)
# Construct the conversation messages
messages = [
{"role": "system", "content": "You are Hermes 4. Be concise."},
{"role": "user", "content": "Summarize CRISPR in 3 sentences."}
]
# Apply the chat template to format the input
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
# Generate the response
outputs = model.generate(
**inputs, max_new_tokens=400, temperature=0.6, top_p=0.95, top_k=20, do_sample=True
)
# Decode and print the output
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For users needing production-grade deployment on multi-GPU servers, it is advisable to use inference engines that support tensor parallelism (like SGLang or vLLM backends) and leverage prefix caching techniques for performance optimization.
Where to Experience and Deploy Hermes 4?
Several inference service providers support Hermes 4, allowing for quick experimentation or deployment:
Nous Portal

Chutes

Nebius

Luminal

Quantized and Other Variants
To cater to different hardware environments and performance needs, Hermes 4 is available in several quantized versions:
-
Original Weights: BF16 format. -
FP8 Quantized Version: Reduces VRAM usage while maintaining high performance. Download: NousResearch/Hermes-4-14B-FP8 -
GGUF Quantized Version: Provided by the LM Studio team, ideal for running on consumer-grade hardware.
Furthermore, the Hermes 4 series includes larger-scale versions (e.g., 70B, 405B parameters) which follow similar prompt formats. You can explore all related models on the Hermes 4 collection page: Hermes 4 Collection
Citation
If you use Hermes 4 in your research, please cite its technical report using the following BibTeX entry:
@misc{teknium2025hermes4technicalreport,
title={Hermes 4 Technical Report},
author={Ryan Teknium and Roger Jin and Jai Suphavadeeprasit and Dakota Mahan and Jeffrey Quesnelle and Joe Li and Chen Guang and Shannon Sands and Karan Malhotra},
year={2025},
eprint={2508.18255},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.18255},
}
Frequently Asked Questions (FAQ)
Q: What are the main advantages of Hermes 4 14B?
A: Its primary advantages lie in its powerful hybrid reasoning capability (using <think>
tags), high steerability (low refusal rate), excellent format and schema adherence (e.g., outputting standard JSON), and overall performance improvements in mathematics, coding, STEM, logic, and creative writing.
Q: Do I need strong programming skills to use it?
A: Not necessarily. Beginners can experience its conversational capabilities directly through online platforms like Nous Portal or Chutes. Developers can quickly integrate it into their applications using the provided code examples with Python and the transformers library.
Q: What does its “hybrid reasoning mode” mean?
A: This means the model can autonomously decide whether to engage in deep thinking based on problem complexity. For simple queries, it answers directly. For complex problems, it first engages in step-by-step reasoning internally (placed within <think>
tags) before delivering a final answer. Users can also force this feature on or off.
Q: How can I make the model call external tools or functions?
A: You need to define the available tools in the system prompt using a specific format (within <tools>...</tools>
tags). After reasoning, the model will generate a structured <tool_call>
request when needed. You can parse this request to actually call the function and return the result to the model as a <tool_response>
. The model will then continue reasoning based on this response to produce the final answer.
Q: Are there smaller or quantized versions of the model for running on personal computers?
A: Yes. Besides the original BF16 version, officially provided FP8 quantized versions and GGUF quantized versions from the LM Studio team are available. The latter is particularly suitable for running on consumer CPU or GPU hardware.
Q: Are there larger models besides the 14B version?
A: Yes. The Hermes 4 series also includes larger parameter models like 70B and even 405B, which offer more powerful capabilities but also require significantly more computational resources.