Getting Started with GLM-4.5V: A Practical Guide from Model to Desktop Assistant
“
“I have a Mac, an image, and I want AI to understand it—then help me build slides, record my screen, and chat. Where do I begin?”
This article breaks the official docs into a step-by-step checklist and answers the twenty questions readers ask most often. Every fact comes from the GLM-V repository; nothing has been added from outside sources.
1. What Exactly Is GLM-4.5V?
In plain language, GLM-4.5V is the newest open-source vision-language model from Zhipu. It reads text, images, videos, PDFs, and PowerPoint files, and it currently leads comparable open models on forty-two public benchmarks.
1.1 Quick Capability Map
Scenario | What It Can Do | Simple Example |
---|---|---|
Images | Scene understanding, object grounding, multi-image reasoning | Upload ten vacation photos and ask for a day-by-day itinerary |
Videos | Shot-by-shot analysis, event tagging | Drop in a recorded meeting and receive a timestamped summary |
Documents | Extract numbers, answer questions from long PDFs | Feed a 60-page report and ask, “What was the Q3 gross margin?” |
GUI Tasks | Recognize icons, guide desktop automation | Screenshot Excel and ask, “How do I turn this into a bar chart?” |
Thinking Mode | Step-by-step internal reasoning before answering | Toggle on for hard questions, off for quick replies |
2. Three Ways to Try It
Approach | Best For | Audience |
---|---|---|
Online Demo | Zero-setup preview | Curious first-timers |
Desktop Assistant | One-click screenshots and local storage | Mac owners with Apple Silicon (M1/M2/M3) |
Local Deployment | Air-gapped data or custom fine-tuning | Developers and enterprises |
The rest of the guide walks through each path in detail.
3. Path A: Try the Online Demo (3 Minutes)
-
Open chat.z.ai. -
Upload an image, PDF, or video—or just paste text. -
Use the “Thinking Mode” toggle to compare speed vs. depth.
FAQ
Q: Why is it sometimes slow?
A: With Thinking Mode on, the model reasons internally; long prompts can take 10–20 seconds.
4. Path B: Install the Desktop Assistant (5 Minutes, macOS Only)
4.1 Download and Install
-
Visit the Hugging Face Space and grab vlm-helper-1.0.6.dmg
. -
Double-click the DMG and drag vlm-helper.app
into Applications. -
Essential step: open Terminal and run xattr -rd com.apple.quarantine /Applications/vlm-helper.app
This removes macOS’s quarantine flag; skip it and the app will refuse to open the first time.
4.2 First-Look Tour
-
Global Hotkeys -
Screenshot region: Option-Shift-S -
Screenshot full screen: Option-Shift-F -
Record region: Option-Shift-R -
Record full screen: Option-Shift-G
-
-
Floating Window: Keep a miniature chat pane on top of any app. -
Local Storage: All history stays in an on-device SQLite database—no cloud sync.
4.3 Your First Conversation
-
Launch the assistant and click “New Chat.” -
Drag any PNG onto the chat pane. -
Type “What colors dominate this image and in what proportions?” -
Within three seconds you’ll receive a text bar chart of color percentages.
5. Path C: Self-Host the Model (30 Minutes)
“
Commands below have been tested on Ubuntu 22.04 with an A100 40 GB.
Ascend NPU users should follow the separate Modelers community guide.
5.1 Environment Setup
# Core dependencies
pip install -r requirements.txt
# Latest vLLM preview
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install transformers-v4.55.0-GLM-4.5V-preview
5.2 Launch with vLLM
vllm serve zai-org/GLM-4.5V \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.5v \
--allowed-local-media-path / \
--media-io-kwargs '{"video": {"num_frames": -1}}'
What each flag does
-
--tensor-parallel-size 4
splits the model across four GPUs; use 1 if you only have one. -
--allowed-local-media-path /
allows any local file path. -
num_frames: -1
keeps every frame of the video; lower it to 16 to save VRAM.
5.3 Launch with SGLang (Alternative)
python3 -m sglang.launch_server --model-path zai-org/GLM-4.5V \
--tp-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--served-model-name glm-4.5v \
--port 8000 \
--host 0.0.0.0
Speed tips
-
Add --attention-backend fa3 --mm-attention-backend fa3 --enable-torch-compile
for lower memory use. -
Export SGLANG_VLM_CACHE_SIZE_MB=1024
before launch to cache video frames.
5.4 Minimal Python Client
import requests, json
url = "http://localhost:8000/v1/chat/completions"
payload = {
"model": "glm-4.5v",
"messages": [{"role": "user", "content": "What is in this image?"}],
"images": ["file:///absolute/path/to/demo.jpg"]
}
r = requests.post(url, json=payload, timeout=60)
print(r.json()["choices"][0]["message"]["content"])
6. Fine-Tuning: Make It Speak Your Language
The repository already supports LLaMA-Factory.
Prepare finetune.json
in the following format:
[
{
"messages": [
{"role": "user", "content": "<image>Describe this CT scan."},
{"role": "assistant", "content": "<think>Observing the lower left area...</think><answer>Ground-glass nodule in the left lower lobe...</answer>"}
],
"images": ["ct/001.jpg"]
}
]
Key notes
-
Content inside <think>
is ignored in chat history and fine-tuning. -
GLM-4.5V drops the <answer>
tags; remove them in your dataset.
Run llamafactory-cli train
with the usual LLaMA-Factory workflow.
7. FAQ: Twenty Questions Everyone Asks
Question | Short Answer |
---|---|
Can I use the desktop app on Windows? | Not yet—macOS Apple Silicon only. |
Will it run on Intel Macs? | No build is provided; use local deployment instead. |
How long a video can I feed? | Depends on VRAM; 10-minute 1080p works on an A100 40 GB. |
Is plain-text Q&A any good? | The team admits it still needs work; focus on multimodal tasks. |
How do I disable Thinking Mode? | Add "chat_template_kwargs": {"enable_thinking": False} in the request body. |
HTML in the output looks broken. | Run inference/html_detector.py to wrap it in markdown and fix escaping. |
Can I use this commercially? | Check the license in the repository; follow open-source terms. |
Why does the model repeat itself? | Over-thinking can occur; disable Thinking Mode or shorten context. |
Where is my chat history stored? | In ~/Library/Application Support/vlm-helper/db.sqlite3 on macOS. |
Can I change the hotkeys? | Not in the current release; future versions may expose them. |
Does it need internet? | The desktop app calls an online API; local deployment runs offline. |
How much disk space? | Model weights are ~20 GB FP16, ~10 GB FP8. |
Is there an iOS or Android version? | No mobile builds yet. |
Can the model browse the web? | No browsing capability; it only sees what you upload. |
Is quantization supported? | FP8 weights (GLM-4.5V-FP8 ) are already published. |
Do I need Docker? | Optional; bare-metal works as shown above. |
Can I feed multiple images at once? | Yes—pass an array of paths in the images field. |
How do I limit GPU usage? | Lower --tensor-parallel-size or use the FP8 checkpoint. |
Why do I get a 400 error? | File paths with non-ASCII spaces cause issues; rename or URL-encode. |
Where can I report bugs? | Use the issue tracker in the GitHub repository. |
8. Mini Case Study: Produce a Weekly Report in Five Minutes
-
Each weekday, take a screenshot of your progress and save it as PNG. -
On Friday, open the desktop assistant and start a new chat. -
Drag all four screenshots into the window. -
Type: “Write a weekly summary based on these images, bullet each day.” -
Ten seconds later you’ll have a markdown list ready to paste into Slack or email.
9. Troubleshooting Checklist
Symptom | Likely Cause | Fix |
---|---|---|
App won’t launch | Forgot xattr command |
Re-run the command in Terminal |
VRAM out of memory | Too many video frames | Reduce num_frames to 8 or 16 |
400 Bad Request | Spaces or non-ASCII path | Rename file or URL-encode |
Repetitive answer | Over-long thinking | Disable Thinking Mode or trim history |
10. Next Steps
-
Just curious? Visit chat.z.ai and upload a file. -
Own a Mac (M-series)? Grab the desktop DMG and be productive in five minutes. -
Need full control? Spin up vLLM or SGLang, then fine-tune with LLaMA-Factory.
GLM-4.5V bundles vision, language, documents, and GUI understanding into one open model. The maintainers have wrapped the heavy lifting into both a one-click desktop app and fully reproducible scripts. Whether you just want a smarter assistant or you’re building the next multimodal product, the path from “zero to working demo” is now a matter of minutes, not weeks.