Granite 4.0 Nano Language Models: The Powerful Capabilities and Practical Guide to Lightweight AI

What Are Granite 4.0 Nano Language Models?
If you’re looking for an AI model that can run efficiently on devices with limited resources while still supporting a variety of complex tasks, Granite 4.0 Nano Language Models might be exactly what you need. Developed by IBM, these are lightweight, state-of-the-art open-source foundation models designed specifically for scenarios where efficiency and speed are critical.
Unlike large-scale models that require massive computing resources, Granite 4.0 Nano can operate on resource-constrained hardware such as smartphones and IoT (Internet of Things) devices. This enables offline applications and better privacy protection—after all, data doesn’t need to be uploaded to the cloud for processing.
The range of applications for these models is quite broad, including code completion (via “fill-in-the-middle” functionality using specialized prefix and suffix tokens), Retrieval-Augmented Generation (RAG), tool calling, and structured JSON output. Whether you’re a developer looking to integrate AI into your applications or a business seeking an AI solution tailored to specific use cases, Granite 4.0 Nano offers significant practical value.
What’s more, all Granite 4.0 Nano models are publicly released under the Apache 2.0 license, meaning you can use them freely for both research and commercial purposes. During the data processing and training phases, IBM incorporated governance, risk, and compliance (GRC) evaluations specific to enterprise scenarios, along with standard data cleansing and document quality review processes. This makes the models better suited for enterprise-level applications and customization needs.
Which Models Are Available in Granite 4.0 Nano?
The Granite 4.0 Nano model family offers multiple variants to meet different scenario requirements. They are categorized by parameter size (350M and 1B), architecture type (dense and dense-hybrid), and training stage (base models—checkpoints after pre-training—and instruct models—checkpoints fine-tuned for dialogue, instruction following, helpfulness, and safety).
The specific models are as follows:
-
ibm-granite/granite-4.0-1b-base(1B parameter base model, pure dense architecture) -
ibm-granite/granite-4.0-1b(1B parameter instruct model, pure dense architecture) -
ibm-granite/granite-4.0-h-1b-base(1B parameter base model, dense-hybrid architecture) -
ibm-granite/granite-4.0-h-1b(1B parameter instruct model, dense-hybrid architecture) -
ibm-granite/granite-4.0-350m-base(350M parameter base model, pure dense architecture) -
ibm-granite/granite-4.0-350m(350M parameter instruct model, pure dense architecture) -
ibm-granite/granite-4.0-h-350m-base(350M parameter base model, dense-hybrid architecture) -
ibm-granite/granite-4.0-h-350m(350M parameter instruct model, dense-hybrid architecture)
Among these, models marked with “h” use a hybrid architecture that combines the advantages of Transformers and SSM (State Space Models), achieving a better balance between efficiency and performance. Models without “h” use a traditional Transformer architecture, making them suitable for workloads where hybrid architecture support is not yet optimized (such as in the Llama.cpp environment).
What Can Granite 4.0 Nano Do? Detailed Explanation of Core Capabilities
Despite their compact size, Granite 4.0 Nano models offer comprehensive capabilities. They can handle basic Q&A interactions, complex tool calls, and code completion with ease. Below, we’ll break down their core capabilities and how to use them.
1. Basic Inference: Simple Conversations and Q&A
The most fundamental use case is having the model process user questions for conversations or answers—for example, answering general knowledge questions or explaining concepts.
How to Do It?
It only takes a few steps to implement:
-
Install necessary libraries (torch, transformers, etc.) -
Load the model and tokenizer -
Construct conversation content -
Generate and output results
Here’s an example code snippet for basic inference using the Granite-4.0-350M model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Select device; "auto" automatically chooses available devices (prioritizes GPU)
device = "auto"
# Model path; can be replaced with other model variants
model_path = "ibm-granite/granite-4.0-350m"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Load model; remove the device_map parameter if using CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
# Construct conversation content
chat = [
{ "role": "user", "content": "What is the name of the durable rock known for being one of the hardest natural building stones?"},
]
# Apply conversation template to prepare input
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
# Tokenize input text
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
# Generate output
output = model.generate(**input_tokens,
max_new_tokens=150)
# Decode and print results
output = tokenizer.batch_decode(output)
print(output)
The logic of this code is straightforward: first, prepare the conversation content, process it into a format the model can understand using the tokenizer, then let the model generate a response, and finally decode the result into natural language. You can modify the model_path to use different model variants or adjust max_new_tokens to control the length of the generated content.
2. Tool Calling: Enabling AI to Interact with External Systems
In many scenarios, AI needs to call external tools (such as APIs or functions) to complete tasks—for example, checking the weather or verifying vehicle information. Granite 4.0 Nano models have strong tool-calling capabilities, allowing them to select appropriate tools based on user needs and process results returned by these tools.
What Problems Can It Solve?
-
Retrieve real-time data (e.g., weather, stock prices) by calling corresponding APIs -
Handle domain-specific tasks (e.g., verifying VINs, checking vehicle registration information) -
In multi-turn interactions, decide whether to continue calling tools based on previous results
Example Use Case: Verifying Vehicle VINs and Handling Follow-Up Requests
The following code shows how the model calls a tool to verify a VIN and responds politely when it can’t handle a request:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # For CPU use, change to "cpu" and remove device_map when loading the model
model_path = "ibm-granite/granite-4.0-350m"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Load model; CPU users should remove the device_map parameter
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
# Construct conversation history
chat=[
{"role": "user", "content": "I want to buy a used truck for construction work. The seller provided this VIN: 1FMXK92W8YPA12345, and said the vehicle is registered in Georgia. Can you verify if this VIN is valid and matches a registered vehicle?"},
{"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "check_valid_vin",
"arguments": {"vin": "1FMXK92W8YPA12345"}
}
}
]
},
{"role": "tool", "content": "{\"valid\": true, \"vin_details\": {\"make\": \"Ford\", \"model\": \"F-150\", \"year\": 2020, \"vehicle_type\": \"Truck\", \"registration_status\": \"Active\", \"registration_state\": \"GA\", \"odometer\": 82345, \"title_status\": \"Clear\", \"lienholder\": null, \"recall_history\": \"No active recalls\"}, \"notes\": \"VIN is valid and registered in Georgia. PPSR lien check completed—no security interests found. License plate verification requires a separate DMV lookup, which is not currently supported by this tool.\"}"},
{"role": "user", "content": "I’m also considering buying a new Ford F-150 from an official dealership in Texas. Can you provide a cost estimate for this type of truck in that state?"},
]
# Define list of available tools (follows OpenAI function definition format)
tools = [
{
"type": "function",
"function": {
"name": "check_valid_registration",
"description": "Verifies if a vehicle registration number is valid for a specific state and returns detailed information about the registered vehicle (if valid). Used to verify vehicle registration status and obtain ownership/vehicle data.",
"parameters": {
"type": "object",
"properties": {
"reg": {
"type": "string",
"description": "Vehicle registration number in standard format (e.g., ABC123 or XYZ-7890)"
},
"state": {
"type": "string",
"description": "Two-letter abbreviation for the state where the vehicle is registered (e.g., CA for California, TX for Texas)"
}
},
"required": ["reg", "state"],
}
}
},
{
"type": "function",
"function": {
"name": "check_valid_vin",
"description": "Verifies if a Vehicle Identification Number (VIN) corresponds to a registered vehicle in official records. If valid, returns comprehensive vehicle details including make, model, year, registration status, and ownership information.",
"parameters": {
"type": "object",
"properties": {
"vin": {
"type": "string",
"description": "17-character VIN following standard format (uppercase alphanumeric, no spaces or special characters). Case-insensitive verification is performed internally."
}
},
"required": ["vin"],
}
}
},
{
"type": "function",
"function": {
"name": "ppsr_lookup_by_vin",
"description": "Performs a PPSR (Personal Property Securities Register) lookup for a vehicle using its VIN. Returns search results including security interests, ownership status, and a URL for the official PDF certificate. Used to verify vehicle history or security claims.",
"parameters": {
"type": "object",
"properties": {
"vin": {
"type": "string",
"description": "17-character alphanumeric VIN (compliant with ISO 3779 standard), case-insensitive. Example: '1HGCM82633A123456'"
}
},
"required": ["vin"]
}
}
},
]
# Apply conversation template and pass tool information
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True, tools=tools)
# Process input and generate output
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
output = model.generate(** input_tokens,
max_new_tokens=1000)
output = tokenizer.batch_decode(output)
print(output[0])
In this example, the model first receives a request to verify a VIN. Since the check_valid_vin tool is available, it generates the corresponding tool call. After the tool returns results, the user asks for a cost estimate for a new truck—but since no relevant tools are available, the model generates an apology explaining it can’t assist.
3. Structured JSON Output: Standardizing Data Formats
In scenarios requiring structured data (e.g., parsing forms, generating standardized reports), Granite 4.0 Nano can generate output that strictly follows a specified JSON schema. This avoids formatting issues that could complicate subsequent processing.
Applicable Scenarios:
-
Parsing IT support tickets to extract key information (e.g., requester, priority, issue category) -
Converting unstructured text into formats directly storable in databases -
Generating request parameters that meet API requirements
Example Use Case: Parsing IT Support Tickets into JSON Format
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # Change to "cpu" for CPU use
model_path = "ibm-granite/granite-4.0-350m"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device) # CPU users remove device_map
model.eval()
# Construct conversation; system prompt defines JSON format requirements
chat = [
{
"role": "system",
"content": "You are a helpful assistant that responds only in JSON format. Below is the JSON schema you must follow:\n<schema>\n{\"title\":\"ITSupportTicket\",\"type\":\"object\",\"properties\":{\"ticketID\":{\"type\":\"string\"},\"requester\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"email\":{\"type\":\"string\",\"format\":\"email\"}},\"required\":[\"name\",\"email\"]},\"category\":{\"type\":\"string\",\"enum\":[\"Access\",\"Hardware\",\"Software\",\"Network\",\"Other\"]},\"priority\":{\"type\":\"string\",\"enum\":[\"Low\",\"Medium\",\"High\",\"Critical\"]},\"description\":{\"type\":\"string\"},\"reportedAt\":{\"type\":\"string\",\"format\":\"date-time\"}},\"required\":[\"ticketID\",\"requester\",\"category\",\"priority\",\"description\",\"reportedAt\"]}\n</schema>\n"
},
{
"role": "user",
"content": "Please break down the content of the following IT ticket and classify it for me.\n# Ticket Content:\nI can’t access the VPN from home—it keeps timing out after authentication. Please mark this as High priority. My name is Jordan Lee, and my email is jordan.lee@acme.co. Create the ticket with ID TCK-10944. I noticed the issue around 2025-10-01T08:35:00Z today."
}
]
# Apply conversation template
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
# Generate and output results
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
output = model.generate(**input_tokens,
max_new_tokens=200)
output = tokenizer.batch_decode(output)
print(output[0])
After running this code, the model will strictly extract information from the user-provided ticket content according to the schema defined in the system prompt. It will generate a complete JSON structure including the ticket ID, requester information, issue category (classified as “Network” here), priority (“High”), and other key fields.
4. Code Completion (FIM): Filling in Missing Code
For developers, the “Fill-in-the-Middle” (FIM) feature is highly useful—it can intelligently generate missing code segments between the prefix and suffix of existing code, improving programming efficiency.
Applicable Scenarios:
-
Completing the logical implementation inside functions -
Filling in code within loops or conditional statements -
Refining incomplete algorithm steps
Example Use Case: Completing a User Data Summary Function
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # Change to "cpu" for CPU use
model_path = "ibm-granite/granite-4.0-350m"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device) # CPU users remove device_map
model.eval()
# Construct prompt with prefix and suffix; the middle section needs to be completed by the model
prompt = """<|fim_prefix|>
def summarize_users(users):
\"\"\"
Given a list of user dictionaries containing 'name' and 'age',
return a summary with the average age and a list of names.
\"\"\"
summary = {}
<|fim_suffix|>
return summary
<|fim_middle|>
"""
# Construct conversation
chat = [
{ "role": "user", "content": prompt},
]
# Process input and generate code
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
output = model.generate(** input_tokens,
max_new_tokens=100)
output = tokenizer.batch_decode(output)
print(output[0])
In this example, the beginning (prefix) and end (suffix) of the summarize_users function are provided. The model needs to fill in the logic for calculating the average age and collecting names in the middle section. The generated code will automatically complete this part, allowing the function to run properly.
How Well Does Granite 4.0 Nano Perform? Analysis of Evaluation Results
Performance is a key consideration when choosing a model. Granite 4.0 Nano models perform well across multiple benchmark tests—and their performance advantages are even more notable given their parameter size.
Performance Comparison Across Model Variants
The table below shows how different Granite 4.0 Nano variants perform on various tasks (data from official model evaluations):
| Benchmark Category | Benchmark Name | Metric | 350M Dense | H 350M Dense | 1B Dense | H 1B Dense |
|---|---|---|---|---|---|---|
| General Tasks | MMLU | 5-shot | 35.01 | 36.21 | 59.39 | 59.74 |
| MMLU-Pro | 5-shot, CoT | 12.13 | 14.38 | 34.02 | 32.86 | |
| BBH | 3-shot, CoT | 33.07 | 33.28 | 60.37 | 59.68 | |
| AGI EVAL | 0-shot, CoT | 26.22 | 29.61 | 49.22 | 52.44 | |
| GPQA | 0-shot, CoT | 24.11 | 26.12 | 29.91 | 29.69 | |
| Alignment Tasks | IFEval | Instruct, Strict | 61.63 | 67.63 | 80.82 | 82.37 |
| IFEval | Prompt, Strict | 49.17 | 55.64 | 73.94 | 74.68 | |
| IFEval | Average | 55.4 | 61.63 | 77.38 | 78.53 | |
| Math Tasks | GSM8K | 8-shot | 30.71 | 39.27 | 76.35 | 69.83 |
| GSM Symbolic | 8-shot | 26.76 | 33.7 | 72.3 | 65.72 | |
| Minerva Math | 0-shot, CoT | 13.04 | 5.76 | 45.28 | 49.4 | |
| DeepMind Math | 0-shot, CoT | 8.45 | 6.2 | 34 | 34.98 | |
| Code Tasks | HumanEval | pass@1 | 39 | 38 | 74 | 73 |
| HumanEval+ | pass@1 | 37 | 35 | 69 | 68 | |
| MBPP | pass@1 | 48 | 49 | 65 | 69 | |
| MBPP+ | pass@1 | 38 | 44 | 57 | 60 | |
| CRUXEval-O | pass@1 | 23.75 | 25.5 | 33.13 | 36 | |
| BigCodeBench | pass@1 | 11.14 | 11.23 | 30.18 | 29.12 | |
| Tool Calling | BFCL v3 | – | 39.32 | 43.32 | 54.82 | 50.21 |
| Multilingual | MULTIPLE | pass@1 | 15.99 | 14.31 | 32.24 | 36.11 |
| MMMLU | 5-shot | 28.23 | 27.95 | 45 | 49.43 | |
| INCLUDE | 5-shot | 27.74 | 27.09 | 42.12 | 43.35 | |
| MGSM | 8-shot | 14.72 | 16.16 | 37.84 | 27.52 | |
| Safety | SALAD-Bench | – | 97.12 | 96.55 | 93.44 | 96.4 |
| AttaQ | – | 82.53 | 81.76 | 85.26 | 82.85 |
Data Interpretation
From the table, we can draw the following conclusions:
-
Models with 1B parameters outperform those with 350M parameters in nearly all tasks, which aligns with the general relationship between model parameter size and performance. -
Hybrid architecture models (the “H” series) perform better in specific tasks (e.g., alignment tasks, multilingual tasks), demonstrating the advantages of their architectural design. -
In code tasks (such as HumanEval and MBPP), 1B parameter models achieve pass@1 scores of 65–74, indicating strong code generation capabilities. -
All variants score above 80 in safety-related benchmarks, showing that the models perform well in terms of safety.
Additionally, compared to other models of similar parameter size, Granite 4.0 Nano has clear advantages in overall capabilities. It leads in average accuracy across knowledge, math, code, and safety domains, as well as in critical tasks like instruction following (IFEval) and tool calling (BFCLv3).


How to Download and Install Granite 4.0 Nano Models?
Downloading and installing Granite 4.0 Nano is simple if you want to try the models yourself.
Downloading the Model
You can clone the model repository directly using Git. For example, to download ibm-granite/granite-4.0-350m, run the following command in your terminal:
git clone https://huggingface.co/ibm-granite/granite-4.0-350m
If you need a different model variant, simply replace the model path in the command with the corresponding variant name (e.g., ibm-granite/granite-4.0-h-1b).
Installing Dependent Libraries
Before using the model, you need to install the necessary Python libraries. We recommend using the following commands:
pip install torch torchvision torchaudio
pip install accelerate
pip install transformers
-
torch: The PyTorch deep learning framework, which is the foundation for running the model. -
accelerate: A tool for accelerating model training and inference. -
transformers: Hugging Face’s library that provides interfaces for loading and using models.
Once installation is complete, you can use the model following the example code provided earlier.
Frequently Asked Questions (FAQ)
1. Which Languages Does Granite 4.0 Nano Support?
Currently, it supports English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. If you need support for other languages, you can extend the model’s capabilities through fine-tuning.
2. Can These Models Be Used for Commercial Projects?
Yes. All Granite 4.0 Nano models are released under the Apache 2.0 license, which allows free use for both research and commercial purposes.
3. What Hardware Configuration Is Needed to Run the Models?
Due to their compact size, 350M parameter models can run on regular CPUs and operate efficiently on mid-to-low-end GPUs (such as NVIDIA GTX series). For 1B parameter models, we recommend using a GPU with sufficient VRAM (such as NVIDIA RTX series) for better performance. For resource-constrained devices (like smartphones), you can further reduce resource requirements through optimized deployment (e.g., quantization).
4. How to Choose the Right Model Variant?
-
Choose 350M parameter models if resource usage is your top priority. -
Choose 1B parameter models if you need better performance. -
Prioritize hybrid architecture models (the “H” series) if your environment has optimized support for hybrid architectures. -
Choose non-“H” series models (pure dense architecture) if you need compatibility with frameworks that work better with traditional Transformers (such as Llama.cpp).
5. Can the Models Be Fine-Tuned for Specific Tasks?
Yes. The compact size of Granite 4.0 Nano makes it ideal for fine-tuning on specific domains, and it doesn’t require large-scale computing resources. You can fine-tune the model using your own dataset to improve its performance on specific tasks.
6. How to Access Detailed Evaluation Reports for the Models?
Core evaluation results for each model variant can be found on their respective Hugging Face model cards. More comprehensive extended evaluation reports are available in the official documentation. You can also visit the Granite 4.0 Nano Hugging Face Collection for additional information.
7. What Should I Do If I Encounter Issues or Want to Provide Feedback?
You can provide feedback or ask questions in two ways:
-
Visit the model’s Hugging Face repository, go to the “Community” tab, and click “New discussion.” -
Post questions or comments on the GitHub Discussion Page.
Conclusion
With their lightweight design and high performance, the Granite 4.0 Nano model family provides an ideal choice for AI applications on resource-constrained devices. They can handle basic Q&A interactions, complex tool calls, structured data generation, and code completion with ease.
Thanks to the flexibility of the Apache 2.0 license and support for multiple languages and tasks, these models have broad application potential in both research and commercial scenarios. If you’re looking for an AI model that is easy to deploy, fully functional, and cost-effective, Granite 4.0 Nano is well worth trying.
To learn more or get started, visit the Hugging Face Collection or official documentation for additional resources.
Citation Information
If you use Granite 4.0 Nano models in your research or projects, please cite them in the following format:
@misc{granite2025,
author = {{IBM Research}},
title = {Granite 4.0 Nano Language Models},
year = {2025},
howpublished = {\url{https://github.com/ibm-granite/granite-4.0-nano-language-models}},
note = {Accessed: 2025-10-23}
}
