Getting Started with GLM-4.5V: A Practical Guide from Model to Desktop Assistant
“
“I have a Mac, an image, and I want AI to understand it—then help me build slides, record my screen, and chat. Where do I begin?”
This article breaks the official docs into a step-by-step checklist and answers the twenty questions readers ask most often. Every fact comes from the GLM-V repository; nothing has been added from outside sources.
1. What Exactly Is GLM-4.5V?
In plain language, GLM-4.5V is the newest open-source vision-language model from Zhipu. It reads text, images, videos, PDFs, and PowerPoint files, and it currently leads comparable open models on forty-two public benchmarks.
1.1 Quick Capability Map
2. Three Ways to Try It
The rest of the guide walks through each path in detail.
3. Path A: Try the Online Demo (3 Minutes)
-
Open chat.z.ai. -
Upload an image, PDF, or video—or just paste text. -
Use the “Thinking Mode” toggle to compare speed vs. depth.
FAQ
Q: Why is it sometimes slow?
A: With Thinking Mode on, the model reasons internally; long prompts can take 10–20 seconds.
4. Path B: Install the Desktop Assistant (5 Minutes, macOS Only)
4.1 Download and Install
-
Visit the Hugging Face Space and grab vlm-helper-1.0.6.dmg
. -
Double-click the DMG and drag vlm-helper.app
into Applications. -
Essential step: open Terminal and run xattr -rd com.apple.quarantine /Applications/vlm-helper.app
This removes macOS’s quarantine flag; skip it and the app will refuse to open the first time.
4.2 First-Look Tour
-
Global Hotkeys -
Screenshot region: Option-Shift-S -
Screenshot full screen: Option-Shift-F -
Record region: Option-Shift-R -
Record full screen: Option-Shift-G
-
-
Floating Window: Keep a miniature chat pane on top of any app. -
Local Storage: All history stays in an on-device SQLite database—no cloud sync.
4.3 Your First Conversation
-
Launch the assistant and click “New Chat.” -
Drag any PNG onto the chat pane. -
Type “What colors dominate this image and in what proportions?” -
Within three seconds you’ll receive a text bar chart of color percentages.
5. Path C: Self-Host the Model (30 Minutes)
“
Commands below have been tested on Ubuntu 22.04 with an A100 40 GB.
Ascend NPU users should follow the separate Modelers community guide.
5.1 Environment Setup
# Core dependencies
pip install -r requirements.txt
# Latest vLLM preview
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install transformers-v4.55.0-GLM-4.5V-preview
5.2 Launch with vLLM
vllm serve zai-org/GLM-4.5V \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.5v \
--allowed-local-media-path / \
--media-io-kwargs '{"video": {"num_frames": -1}}'
What each flag does
-
--tensor-parallel-size 4
splits the model across four GPUs; use 1 if you only have one. -
--allowed-local-media-path /
allows any local file path. -
num_frames: -1
keeps every frame of the video; lower it to 16 to save VRAM.
5.3 Launch with SGLang (Alternative)
python3 -m sglang.launch_server --model-path zai-org/GLM-4.5V \
--tp-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--served-model-name glm-4.5v \
--port 8000 \
--host 0.0.0.0
Speed tips
-
Add --attention-backend fa3 --mm-attention-backend fa3 --enable-torch-compile
for lower memory use. -
Export SGLANG_VLM_CACHE_SIZE_MB=1024
before launch to cache video frames.
5.4 Minimal Python Client
import requests, json
url = "http://localhost:8000/v1/chat/completions"
payload = {
"model": "glm-4.5v",
"messages": [{"role": "user", "content": "What is in this image?"}],
"images": ["file:///absolute/path/to/demo.jpg"]
}
r = requests.post(url, json=payload, timeout=60)
print(r.json()["choices"][0]["message"]["content"])
6. Fine-Tuning: Make It Speak Your Language
The repository already supports LLaMA-Factory.
Prepare finetune.json
in the following format:
[
{
"messages": [
{"role": "user", "content": "<image>Describe this CT scan."},
{"role": "assistant", "content": "<think>Observing the lower left area...</think><answer>Ground-glass nodule in the left lower lobe...</answer>"}
],
"images": ["ct/001.jpg"]
}
]
Key notes
-
Content inside <think>
is ignored in chat history and fine-tuning. -
GLM-4.5V drops the <answer>
tags; remove them in your dataset.
Run llamafactory-cli train
with the usual LLaMA-Factory workflow.
7. FAQ: Twenty Questions Everyone Asks
8. Mini Case Study: Produce a Weekly Report in Five Minutes
-
Each weekday, take a screenshot of your progress and save it as PNG. -
On Friday, open the desktop assistant and start a new chat. -
Drag all four screenshots into the window. -
Type: “Write a weekly summary based on these images, bullet each day.” -
Ten seconds later you’ll have a markdown list ready to paste into Slack or email.
9. Troubleshooting Checklist
10. Next Steps
-
Just curious? Visit chat.z.ai and upload a file. -
Own a Mac (M-series)? Grab the desktop DMG and be productive in five minutes. -
Need full control? Spin up vLLM or SGLang, then fine-tune with LLaMA-Factory.
GLM-4.5V bundles vision, language, documents, and GUI understanding into one open model. The maintainers have wrapped the heavy lifting into both a one-click desktop app and fully reproducible scripts. Whether you just want a smarter assistant or you’re building the next multimodal product, the path from “zero to working demo” is now a matter of minutes, not weeks.