From Cat vs. Dog Showdowns on Your Phone to the Edge AI Revolution: Building High-Accuracy Image Classifiers with Local Visual Language Models
Picture this: You’re lounging on the couch, scrolling through Instagram, and a friend’s post pops up—a fluffy orange tabby cat mid-yawn. Tap once, and your phone instantly chimes in: “Cat, 99.9% confidence.” No cloud ping-pong, no lag, just pure local magic. Sounds like a gimmick? For developers like us, it’s the holy grail of edge AI: running sophisticated image classification right on-device, offline and lightning-fast. I’ve battled my share of bloated cloud APIs and privacy nightmares, but this repo changes the game. Using Liquid AI’s open-source LFM2-VL visual language models (VLMs) and the Leap Edge SDK, we’ll craft a production-ready classifier—from a simple cat-vs-dog demo to iOS deployment. No fluff, just actionable steps to supercharge your edge AI projects. Let’s dive in.
Why Edge AI Image Classification Feels Like the Ultimate “Guess That Pic” Challenge
Remember those childhood “guess the animal” games with blurry sketches? Image classification is the AI upgrade: a cornerstone of computer vision that powers real-time decisions without Wi-Fi. Think self-driving cars dodging pedestrians—no time for server round-trips—or factory lines spotting defects on the fly.
This repo’s secret sauce? VLMs that “translate” images into structured text outputs via prompts. Forget rigid CNN labels; these models chat like experts, spitting out poetic descriptions or precise JSON for agentic workflows. I’m hooked because they unlock local multimodal magic—your phone doesn’t just see; it understands intent. The LFM2-VL lineup (from nimble 450M-param nano to beefy 1.6B) is tailor-made for edge devices: open-weight, efficient, and paired with Leap for seamless iOS integration.
But why build this now? If you’re tackling on-device defect detection in manufacturing or privacy-first medical imaging for skin lesions, this guide is your blueprint. It skips theory overload, focusing on hands-on iteration: from eval pipelines to LoRA fine-tuning that squeezes every drop of performance. Got your Python setup? We’re starting with the cat-vs-dog classic to nail production-grade thinking.
Cat vs. Dog Classification: From 97% Awkward to 100% Flawless—A Detective’s Iteration Tale
Cat-vs-dog might scream “bootcamp exercise,” but this repo uses it to drill real-world rigor: accuracy isn’t luck; it’s unearthed through sample-by-sample sleuthing, prompt tweaks, and fine-tuning. We’ll play detective—no equations, just chasing model “gotchas” step by step.
Step 1: Assemble Your “Detective Kit”—Building the Evaluation Pipeline
Start with evals, or you’ll fly blind on failures. The repo’s evaluate.py script is a Swiss Army knife: feed it a dataset, model, and prompt params; get back accuracy (correct predictions ratio for balanced classes). Clean dir setup: configs/ for YAML tweaks, image-to-json/ for logic, evals/ for CSV reports (base64 images + preds + labels).
Warm up your env: Grab uv (Astral’s zippy Python manager) via their official guide, then uv sync for deps. Dataset? Hugging Face’s microsoft/cats_vs_dogs—balanced and battle-tested.
Leverage Modal for GPU bursts (no home NVIDIA? No sweat). Ditch verbose CLI with Makefile. Baseline config in cats_vs_dogs_v0.yaml:
seed: 23  # For reproducibility
model: "LiquidAI/LFM2-VL-450M"
structured_generation: false
dataset: "microsoft/cats_vs_dogs"
n_samples: 100
split: train
image_column: "image"
label_column: "labels"
label_mapping:
  0: "cat"
  1: "dog"
system_prompt: |
  You are a veterinarian specialized in analyzing pictures of cats and dogs
  You excel at identifying the type of animal from a picture.
user_prompt: |
  What animal in the following list is the one you see in the picture?
  - cat
  - dog
  Provide your answer as a single animal from the list without any additional text.
Shell it: make evaluate CONFIG_FILE=cats_vs_dogs_v0.yaml. Boom—97% accuracy. Solid, but where’s the “wow”? Time to zoom in.
Step 2: Magnifying Glass on the Truth—Sample-Level Breakdowns
Enter notebooks/visualize_evals.ipynb: a Jupyter powerhouse. Fire it up with uv run jupyter notebook notebooks/visualize_evals.ipynb, then eval_report.print(only_misclassified=True) to spotlight the three culprits.

Green for wins, red for wrecks. The notebook lays bare your model’s blind spots in seconds.
Suspect #1? Dataset glitch: “Adopted” watermark screams “neither”—drop it (or add an “other” class for robustness, as the repo wisely suggests). The rest? Edge cases like multi-cat shots or fuzzy pups. VLMs falter here; they’re next-token gamblers, not logic bots.
Option A: Scale up to LFM2-VL-1.6B. Swap to cats_vs_dogs_v1.yaml, eval again: 99%. But plot twist—hallucinations! It tags a pooch “pug.” Creative? Sure. On-task? Nope.
Step 3: Structured Generation—Putting the LM on a Leash
Cue the hero: Outlines’ structured generation, masking tokens to enforce JSON like {"pred_class": "dog"}. Repo impl in inference.py‘s get_structured_model_output. Flip structured_generation: true in cats_vs_dogs_v2.yaml.
Re-eval: 98%, zero hallucinations. Production gold—fixed outputs, easy chaining.

Watch tokens get corralled: deterministic prefix, choice in the middle, locked suffix. Outlines turns poets into precision engineers.
Step 4: LoRA Fine-Tuning—Squeezing the Last 1% Juice
97% to 100%? Supervised fine-tuning via LoRA (Low-Rank Adaptation)—freeze the behemoth params, train tiny adapters. Per Wikipedia, it exploits low-rank update magic, slashing param counts.
Prep data: make train-test-split INPUT_DATASET_NAME=microsoft/cats_vs_dogs OUTPUT_DIR=Paulescu/cats_vs_dogs TRAIN_SIZE=0.9 SEED=42. Lands on HF as Paulescu/cats_vs_dogs.
Hyperparams in finetune_cats_vs_dogs.yaml (lr 5e-4, LoRA r=8, targeting q/k/v proj):
# ... (core sections omitted)
learning_rate: 5e-4
num_train_epochs: 1
batch_size: 1
gradient_accumulation_steps: 16
use_peft: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj
TRL + Modal: make fine-tune CONFIG_FILE=finetune_cats_vs_dogs.yaml. Track via WandB—train loss dips, eval confirms smarts. Checkpoint-600 in cats_vs_dogs_v4.yaml: 1.0 accuracy. Zero errors.

Left: Training loss plummets. Right: Eval loss validates real learning, not rote memorization.
Took me three weeks, but that’s the grind—from baseline to bulletproof.
Advanced Tasks: Leveling Up to Car Recognition and Action Detection
Beyond cats: Task 2 (car brand/model/year ID) cranks difficulty—multi-label chaos. Expect JSON structs + fine-tuning for “2023 Tesla Model Y” granularity. Task 3 (human action recognition)? Mid-tier: frame-by-frame for video. Both incoming, but the playbook’s identical: eval-optimize-tune.
iOS Deployment: Getting Your Model “Home” on the Device
Theory meets metal with Leap Edge SDK. Swift snippet’s elegantly simple:
enum AnimalClassification: String, CaseIterable {
    case dog = "dog"
    case cat = "cat"
}
func classify(image: UIImage) async -> AnimalClassification {
    // TODO: Load fine-tuned model via Leap, async infer
    // Real deal: return await leapModel.predict(image)
    return AnimalClassification.allCases.randomElement() ?? .dog
}
Bundle your checkpoint, async invoke—sub-100ms latency. iPhone-friendly, LoRA keeps it lean. Docs are gold; build it, and your app becomes an AI sleuth.
FAQ: Tackling Your Edge AI Image Classification Questions Head-On
Q: How do I handle imbalanced datasets in evals?
A: Repo defaults to accuracy, but for high-stakes like autonomous driving (“danger” class <1%), pivot to F1 or per-class recall. evaluate.py extends easily.
Q: No GPU? How to fine-tune?
A: Modal’s pay-per-use ($0.50/hr entry). Or Colab free tier—the repo’s batch=1 + grad accum works on CPU (just slower).
Q: VLMs vs. Traditional CNNs—Which Wins for Edge AI?
A: VLMs flex with prompts (API-like), CNNs edge on speed. Repo bets VLMs for structured outputs; Hugging Face benchmarks show 20% multimodal edge gains.
Q: Prompt optimization tips?
A: Manual role-play + lists shine here. Auto? DSPy + MIPROv2. Dive deeper: DSPy Prompt Opt Tutorial.
Wrapping Up: Edge AI Isn’t the Endgame—It’s Your Launchpad
From that cringey 97% cat flop to silky 100%, we’ve forged more than a classifier—we’ve honed an “AI detective” mindset. VLMs elevate edge devices from silent spectators to savvy sidekicks: drones spotting threats locally, smart homes guarding privacy. The horizon? Multimodal agents rule, and this repo’s sequel covers it.
Your move: Fork, spin up a cat-dog demo. What’s your killer app? Drop thoughts in comments—I’m substack-bound at paulabartabajo.substack.com. Stay curious; code never sleeps.
Updated October 21, 2025 | Tags: #VisualLanguageModels #EdgeAI #LoRAFineTuning #ImageClassification

