Site icon Efficient Coder

GLM-4.5V Unleashed: Transform Your Mac into an AI Vision Powerhouse

Getting Started with GLM-4.5V: A Practical Guide from Model to Desktop Assistant

“I have a Mac, an image, and I want AI to understand it—then help me build slides, record my screen, and chat. Where do I begin?”
This article breaks the official docs into a step-by-step checklist and answers the twenty questions readers ask most often. Every fact comes from the GLM-V repository; nothing has been added from outside sources.


1. What Exactly Is GLM-4.5V?

In plain language, GLM-4.5V is the newest open-source vision-language model from Zhipu. It reads text, images, videos, PDFs, and PowerPoint files, and it currently leads comparable open models on forty-two public benchmarks.

1.1 Quick Capability Map

Scenario What It Can Do Simple Example
Images Scene understanding, object grounding, multi-image reasoning Upload ten vacation photos and ask for a day-by-day itinerary
Videos Shot-by-shot analysis, event tagging Drop in a recorded meeting and receive a timestamped summary
Documents Extract numbers, answer questions from long PDFs Feed a 60-page report and ask, “What was the Q3 gross margin?”
GUI Tasks Recognize icons, guide desktop automation Screenshot Excel and ask, “How do I turn this into a bar chart?”
Thinking Mode Step-by-step internal reasoning before answering Toggle on for hard questions, off for quick replies

2. Three Ways to Try It

Approach Best For Audience
Online Demo Zero-setup preview Curious first-timers
Desktop Assistant One-click screenshots and local storage Mac owners with Apple Silicon (M1/M2/M3)
Local Deployment Air-gapped data or custom fine-tuning Developers and enterprises

The rest of the guide walks through each path in detail.


3. Path A: Try the Online Demo (3 Minutes)

  1. Open chat.z.ai.
  2. Upload an image, PDF, or video—or just paste text.
  3. Use the “Thinking Mode” toggle to compare speed vs. depth.

FAQ
Q: Why is it sometimes slow?
A: With Thinking Mode on, the model reasons internally; long prompts can take 10–20 seconds.


4. Path B: Install the Desktop Assistant (5 Minutes, macOS Only)

4.1 Download and Install

  1. Visit the Hugging Face Space and grab vlm-helper-1.0.6.dmg.
  2. Double-click the DMG and drag vlm-helper.app into Applications.
  3. Essential step: open Terminal and run
    xattr -rd com.apple.quarantine /Applications/vlm-helper.app
    

    This removes macOS’s quarantine flag; skip it and the app will refuse to open the first time.

4.2 First-Look Tour

  • Global Hotkeys
    • Screenshot region: Option-Shift-S
    • Screenshot full screen: Option-Shift-F
    • Record region: Option-Shift-R
    • Record full screen: Option-Shift-G
  • Floating Window: Keep a miniature chat pane on top of any app.
  • Local Storage: All history stays in an on-device SQLite database—no cloud sync.

4.3 Your First Conversation

  1. Launch the assistant and click “New Chat.”
  2. Drag any PNG onto the chat pane.
  3. Type “What colors dominate this image and in what proportions?”
  4. Within three seconds you’ll receive a text bar chart of color percentages.

5. Path C: Self-Host the Model (30 Minutes)

Commands below have been tested on Ubuntu 22.04 with an A100 40 GB.
Ascend NPU users should follow the separate Modelers community guide.

5.1 Environment Setup

# Core dependencies
pip install -r requirements.txt
# Latest vLLM preview
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install transformers-v4.55.0-GLM-4.5V-preview

5.2 Launch with vLLM

vllm serve zai-org/GLM-4.5V \
  --tensor-parallel-size 4 \
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.5v \
  --allowed-local-media-path / \
  --media-io-kwargs '{"video": {"num_frames": -1}}'

What each flag does

  • --tensor-parallel-size 4 splits the model across four GPUs; use 1 if you only have one.
  • --allowed-local-media-path / allows any local file path.
  • num_frames: -1 keeps every frame of the video; lower it to 16 to save VRAM.

5.3 Launch with SGLang (Alternative)

python3 -m sglang.launch_server --model-path zai-org/GLM-4.5V \
  --tp-size 4 \
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --served-model-name glm-4.5v \
  --port 8000 \
  --host 0.0.0.0

Speed tips

  • Add --attention-backend fa3 --mm-attention-backend fa3 --enable-torch-compile for lower memory use.
  • Export SGLANG_VLM_CACHE_SIZE_MB=1024 before launch to cache video frames.

5.4 Minimal Python Client

import requests, json

url = "http://localhost:8000/v1/chat/completions"
payload = {
  "model": "glm-4.5v",
  "messages": [{"role": "user", "content": "What is in this image?"}],
  "images": ["file:///absolute/path/to/demo.jpg"]
}
r = requests.post(url, json=payload, timeout=60)
print(r.json()["choices"][0]["message"]["content"])

6. Fine-Tuning: Make It Speak Your Language

The repository already supports LLaMA-Factory.
Prepare finetune.json in the following format:

[
  {
    "messages": [
      {"role": "user", "content": "<image>Describe this CT scan."},
      {"role": "assistant", "content": "<think>Observing the lower left area...</think><answer>Ground-glass nodule in the left lower lobe...</answer>"}
    ],
    "images": ["ct/001.jpg"]
  }
]

Key notes

  • Content inside <think> is ignored in chat history and fine-tuning.
  • GLM-4.5V drops the <answer> tags; remove them in your dataset.

Run llamafactory-cli train with the usual LLaMA-Factory workflow.


7. FAQ: Twenty Questions Everyone Asks

Question Short Answer
Can I use the desktop app on Windows? Not yet—macOS Apple Silicon only.
Will it run on Intel Macs? No build is provided; use local deployment instead.
How long a video can I feed? Depends on VRAM; 10-minute 1080p works on an A100 40 GB.
Is plain-text Q&A any good? The team admits it still needs work; focus on multimodal tasks.
How do I disable Thinking Mode? Add "chat_template_kwargs": {"enable_thinking": False} in the request body.
HTML in the output looks broken. Run inference/html_detector.py to wrap it in markdown and fix escaping.
Can I use this commercially? Check the license in the repository; follow open-source terms.
Why does the model repeat itself? Over-thinking can occur; disable Thinking Mode or shorten context.
Where is my chat history stored? In ~/Library/Application Support/vlm-helper/db.sqlite3 on macOS.
Can I change the hotkeys? Not in the current release; future versions may expose them.
Does it need internet? The desktop app calls an online API; local deployment runs offline.
How much disk space? Model weights are ~20 GB FP16, ~10 GB FP8.
Is there an iOS or Android version? No mobile builds yet.
Can the model browse the web? No browsing capability; it only sees what you upload.
Is quantization supported? FP8 weights (GLM-4.5V-FP8) are already published.
Do I need Docker? Optional; bare-metal works as shown above.
Can I feed multiple images at once? Yes—pass an array of paths in the images field.
How do I limit GPU usage? Lower --tensor-parallel-size or use the FP8 checkpoint.
Why do I get a 400 error? File paths with non-ASCII spaces cause issues; rename or URL-encode.
Where can I report bugs? Use the issue tracker in the GitHub repository.

8. Mini Case Study: Produce a Weekly Report in Five Minutes

  1. Each weekday, take a screenshot of your progress and save it as PNG.
  2. On Friday, open the desktop assistant and start a new chat.
  3. Drag all four screenshots into the window.
  4. Type: “Write a weekly summary based on these images, bullet each day.”
  5. Ten seconds later you’ll have a markdown list ready to paste into Slack or email.

9. Troubleshooting Checklist

Symptom Likely Cause Fix
App won’t launch Forgot xattr command Re-run the command in Terminal
VRAM out of memory Too many video frames Reduce num_frames to 8 or 16
400 Bad Request Spaces or non-ASCII path Rename file or URL-encode
Repetitive answer Over-long thinking Disable Thinking Mode or trim history

10. Next Steps

  • Just curious? Visit chat.z.ai and upload a file.
  • Own a Mac (M-series)? Grab the desktop DMG and be productive in five minutes.
  • Need full control? Spin up vLLM or SGLang, then fine-tune with LLaMA-Factory.

GLM-4.5V bundles vision, language, documents, and GUI understanding into one open model. The maintainers have wrapped the heavy lifting into both a one-click desktop app and fully reproducible scripts. Whether you just want a smarter assistant or you’re building the next multimodal product, the path from “zero to working demo” is now a matter of minutes, not weeks.

Exit mobile version