nanoVLM: Building Lightweight Vision-Language Models with PyTorch
An educational framework for training efficient multimodal AI systems.
Introduction: Simplifying Vision-Language Model Development
In the evolving landscape of multimodal AI, nanoVLM emerges as a minimalist PyTorch implementation designed to democratize access to vision-language model (VLM) development. Unlike resource-intensive counterparts, this framework prioritizes:
-
Accessibility: ~750 lines of human-readable code -
Modularity: Four decoupled components for easy customization -
Performance: 35.3% accuracy on MMStar benchmark with 222M parameters -
Hardware Efficiency: Trains on a single H100 GPU in 6 hours
Inspired by the philosophy of nanoGPT, nanoVLM serves as both an educational tool and a practical foundation for lightweight AI applications.
Technical Architecture: Under the Hood
Core Components
Module | Lines of Code | Key Function |
---|---|---|
Vision Backbone | ~150 | Feature extraction via SigLIP-B/16-224 |
Language Decoder | ~250 | Text generation with SmolLM2-135M |
Modality Projection | ~50 | Cross-modal alignment layer |
VLM Integration | ~100 | Multimodal fusion logic |
Key Technical Features
-
Pre-trained Foundations: Inherits weights from SigLIP (vision) and SmolLM2 (language) -
Dynamic Projection: Linear transformation for semantic space mapping -
Mixed Precision Training: Automatic FP16 optimization -
Dataset Compatibility: Native support for HuggingFace Datasets
Performance Benchmarks
Training Efficiency (H100 GPU)
-
Dataset: 1.7M samples from The Cauldron -
Training Time: 6 hours -
Final Loss: Stable convergence at 1.85

Inference Demonstration
Input: Image + “What is this?”
Sample Outputs:
1. A cat sitting on a floor-facing left
2. White-brown feline on platform surface
3. Indoor cat with black-brown fur
4. Blurred-background kitten on mat
5. Brown cat resting on small rug
Step-by-Step Implementation Guide
Environment Setup
Using uv
package manager:
git clone https://github.com/huggingface/nanoVLM.git
cd nanoVLM
uv init --bare
uv sync --python 3.12
source .venv/bin/activate
uv add torch numpy torchvision pillow datasets huggingface-hub transformers
Traditional pip installation:
pip install torch numpy torchvision pillow datasets huggingface-hub transformers
Training Execution
python train.py
Customize training via models/config.py
:
-
Learning rate scheduling -
Mixed precision modes -
Logging frequency -
Checkpoint intervals
Model Inference
Run with pretrained weights:
python generate.py --model lusxvr/nanoVLM-222M
API integration example:
from models import VisionLanguageModel
model = VisionLanguageModel.from_pretrained("lusxvr/nanoVLM-222M")
response = model.generate(image="assets/image.png", prompt="Describe the scene")
Practical Applications & Optimization Strategies
Real-World Use Cases
-
Automated image captioning -
Visual question answering systems -
Multimodal search engines -
Educational assistant tools
Performance Optimization
-
Data: Domain-specific fine-tuning -
Architecture: Cross-attention mechanisms -
Deployment: Model quantization -
Training: Contrastive learning approaches
Citation & Academic Use
@misc{wiedmann2025nanovlm,
author = {Luis Wiedmann and Aritra Roy Gosthipaty and Andrés Marafioti},
title = {nanoVLM},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/nanoVLM}}
}
Roadmap & Community Development
The nanoVLM project maintains an active development cycle with planned features:
-
Multimodal contrastive learning support -
Quantization-aware training modules -
Mobile deployment solutions -
Interactive tutorial series
Developers are encouraged to contribute through GitHub Issues and participate in quarterly release cycles.
This technical deep dive demonstrates nanoVLM’s potential as both an educational resource and practical toolkit. By balancing simplicity with performance, it lowers the barrier to entry for multimodal AI development while providing a robust foundation for research innovation. The project invites developers to explore its modular architecture and build upon its lightweight framework for specialized applications.