nanoVLM: The Simplest Guide to Training Vision-Language Models in Pure PyTorch What Is a Vision-Language Model (VLM)? What Can It Do? Imagine showing a computer a photo of cats and asking, “How many cats are in this image?” The computer not only understands the image but also answers your question in text. This type of model—capable of processing both visual and textual inputs to generate text outputs—is called a Vision-Language Model (VLM). In nanoVLM, we focus on Visual Question Answering (VQA). Below are common applications of VLMs: Input Type Example Question Example Output Task Type “Describe this image” “Two cats …