Site icon Efficient Coder

llmfit Guide: Find the Perfect Local LLM for Your PC or Mac in One Command

llmfit: The One Command That Finds the Right LLM for Your Hardware

Struggling to choose which local large language model to run? llmfit automatically detects your CPU, RAM, and GPU, then scans 206 models to find the ones that will actually run well on your machine. It gives you a composite score and speed estimate for each. No more guessing—one command does it all.

Why You Need llmfit

Large language models (LLMs) are evolving fast. Meta Llama, Mistral, Qwen, DeepSeek… new models drop every week. But when you want to pull one onto your own machine, the first question is always: “Can my computer even handle this?”

Hardware setups vary wildly. You might have a MacBook with 16GB of unified memory, a gaming PC with an RTX 4090 (24GB VRAM), or an older card with only 8GB. Every model has different VRAM and RAM requirements. Quantization versions add another layer of complexity. Manually calculating parameters, looking up quantization tables, and estimating memory usage is tedious and error-prone.

That’s exactly the problem llmfit solves. It’s a terminal tool that:

  • Detects your system hardware (RAM, CPU cores, GPU model and VRAM)
  • Consults a built-in database of 206 models (covering 57 providers)
  • Dynamically evaluates each model against your hardware
  • Ranks them by a composite score—telling you which will run perfectly, which are marginal, and which won’t run at all
  • Estimates inference speed (tokens per second) for your specific setup

It supports multi-GPU systems, Mixture-of-Experts (MoE) architectures, dynamic quantization selection, and works with or without a GPU. Whether you’re a hobbyist, researcher, or someone who just wants to experiment with local LLMs, llmfit takes the guesswork out of the equation.

Installation: Three Commands, All Platforms Covered

llmfit offers multiple installation paths. Choose the one that fits your workflow.

macOS / Linux One-Liner

curl -fsSL https://llmfit.axjns.dev/install.sh | sh

This script downloads the latest pre-built binary from GitHub and installs it to /usr/local/bin (if you have sudo) or ~/.local/bin. To force installation to your user directory without sudo:

curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local

Using Homebrew (macOS / Linux)

brew tap AlexsJones/llmfit
brew install llmfit

Via Cargo (Windows / macOS / Linux)

If you already have the Rust toolchain installed:

cargo install llmfit

Don’t have Rust? Get it from rustup.

Build from Source

Want the absolute latest version or need to customize the build? Clone the repo and compile it yourself:

git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release
# The binary will be at target/release/llmfit

How to Use llmfit

llmfit gives you two ways to interact with it: a full-screen interactive Terminal UI (TUI) that’s great for exploration, and a classic command-line interface (CLI) that’s perfect for scripts and quick checks.

TUI Mode: Explore Models Interactively

Just run llmfit with no arguments. You’ll see your system specs at the top: CPU, total RAM, GPU name, VRAM, and the detected acceleration backend (CUDA, Metal, ROCm, etc.).

Below that is a scrollable table listing every model in the database. Each row shows:

  • The model’s composite score (0–100)
  • Estimated speed in tokens/second
  • The best quantization for your hardware (e.g., Q4_K_M)
  • Run mode: GPU, MoE (expert offloading), CPU+GPU, or CPU
  • Memory usage (how much RAM/VRAM it will consume)
  • Use case: General, Coding, Reasoning, Chat, Multimodal, Embedding

You can navigate and filter the list using simple keyboard shortcuts.

Key Action
Up / Down or j / k Move selection up/down
/ Enter search mode (filter by name, provider, parameter count, use case)
Esc or Enter Exit search mode
Ctrl-U Clear the current search filter
f Cycle fit filter: All, Runnable, Perfect, Good, Marginal
s Cycle sort column: Score, Params, Mem%, Context, Release Date, Use Case
t Cycle through color themes (your choice is saved automatically)
p Open the provider filter popup
i Toggle “installed-first” sorting (requires Ollama integration)
d Download the selected model via Ollama
r Refresh the list of installed models from Ollama
19 Toggle visibility of specific model providers
Enter Expand or collapse the detail view for the selected model
PgUp / PgDn Scroll the list by 10 rows
g / G Jump to the top / bottom of the list
q Quit the application

Built-in Color Themes

Press t to cycle through six carefully designed themes. Your preference is saved to ~/.config/llmfit/theme and restored the next time you launch.

Theme Description
Default The original llmfit color scheme
Dracula Dark purple background with pastel accents
Solarized Ethan Schoonover’s Solarized Dark palette
Nord Arctic, cool blue-gray tones
Monokai Monokai Pro warm syntax colors
Gruvbox Retro groove palette with warm earth tones

CLI Mode: For Scripts and Quick Queries

Add the --cli flag or use any subcommand to get a plain table output that’s easy to read or pipe to other tools.

# Show all models ranked by fit (table format)
llmfit --cli

# Only perfectly fitting models, show the top 5
llmfit fit --perfect -n 5

# Display detected system hardware specs
llmfit system

# List all models in the database
llmfit list

# Search for models by name or size
llmfit search "llama 8b"

# Show detailed information about a specific model
llmfit info "Mistral-7B"

# Get the top 5 recommendations as JSON (great for agents or scripts)
llmfit recommend --json --limit 5

# Filter recommendations by use case (e.g., coding)
llmfit recommend --json --use-case coding --limit 3

Manually Override GPU Memory

GPU VRAM detection can fail in some environments—virtual machines, systems with broken nvidia-smi drivers, or passthrough setups. Use the --memory flag to set it manually.

# Override with 32 GB of VRAM
llmfit --memory=32G

# Megabytes also work (32000 MB ≈ 31.25 GB)
llmfit --memory=32000M

# Works with any subcommand
llmfit --memory=24G --cli
llmfit --memory=24G fit --perfect -n 5
llmfit --memory=24G info "Llama-3.1-70B"
llmfit --memory=24G recommend --json

Accepted suffixes: G/GB/GiB (gigabytes), M/MB/MiB (megabytes), T/TB/TiB (terabytes). Case-insensitive. If no GPU was detected, this flag creates a synthetic GPU entry, allowing models to be scored for GPU inference.

JSON Output for Automation

Add --json to any subcommand to get machine-readable output. This is perfect for integrating llmfit with other tools, dashboards, or AI agents.

llmfit --json system     # Hardware specs as JSON
llmfit --json fit -n 10  # Top 10 fits as JSON
llmfit recommend --json  # Top 5 recommendations (JSON is default for recommend)

How llmfit Works Under the Hood

llmfit’s core logic follows three steps: hardware detection → dynamic quantization selection → multi-dimensional scoring and ranking.

1. Hardware Detection: What Does Your Machine Have?

When you run llmfit, it first probes your system using the sysinfo library to get total and available RAM, and counts your CPU cores. Then it looks for GPUs:

  • NVIDIA GPUs: Calls nvidia-smi to get exact dedicated VRAM. Supports multi-GPU setups—it aggregates VRAM across all detected cards. If nvidia-smi fails, it attempts to estimate VRAM based on the GPU model name.
  • AMD GPUs: Detected via rocm-smi. VRAM reporting may be unavailable, so manual override is recommended if accuracy is critical.
  • Intel Arc (discrete): Reads VRAM from sysfs (/sys/class/drm/card?/device/mem_info_vram_total).
  • Intel integrated graphics: Identified via lspci. These use shared system memory, so VRAM is effectively your system RAM.
  • Apple Silicon: Runs system_profiler to get unified memory size. Here, VRAM = system RAM.

llmfit also identifies the acceleration backend: CUDA, Metal, ROCm, SYCL, or plain CPU (ARM/x86). This is crucial for speed estimation later.

If automatic detection gives you incorrect values, remember you can always override with --memory.

2. The Model Database: 206 Models from 57 Providers

llmfit ships with a curated list of models sourced from HuggingFace. It covers the most popular and useful models today:

  • Meta: Llama 2, Llama 3, Llama 3.1, Llama 3.2, Code Llama
  • Mistral: Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
  • Qwen: Qwen2, Qwen2.5, Qwen2.5-Coder, Qwen2.5-VL
  • Google: Gemma, Gemma 2
  • Microsoft: Phi-2, Phi-3, Phi-3.5
  • DeepSeek: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1
  • Others: IBM Granite, Allen Institute OLMo, xAI Grok, Cohere, BigCode, 01.ai, Upstage, TII Falcon, and more.

Each model record includes its parameter count, context length, architecture type (dense or MoE), use case category, and provider. The database is embedded in the binary at compile time via include_str!, so it’s always available.

The list is generated by scripts/scrape_hf_models.py, a standalone Python script that uses only the standard library (no pip dependencies). It queries the HuggingFace REST API and automatically detects MoE architectures by looking for config keys like num_local_experts and num_experts_per_tok.

To refresh the model database (for example, after adding a new model):

# Automated update (recommended)
make update-models

# Or run the script directly
./scripts/update_models.sh

# Or manually
python3 scripts/scrape_hf_models.py
cargo build --release

The update script backs up your existing data, validates the new JSON, and rebuilds the binary.

3. Dynamic Quantization: From Q8_0 Down to Q2_K

Model parameters are just numbers. What really determines memory usage is the quantization level. llmfit doesn’t assume a fixed quantization. Instead, it dynamically tries to find the highest quality quantization that fits your hardware.

It walks down a hierarchy: starting with Q8_0 (best quality, largest size), then Q6_K, Q5_K, Q4_K, Q3_K, and finally Q2_K (most compressed, lowest quality). For each level, it checks:

  1. Can the model (at this quantization) fit entirely in available VRAM (or RAM, if no GPU)?
  2. If yes, that’s the chosen quantization.
  3. If no, it moves to the next lower quantization.
  4. If even Q2_K doesn’t fit, it tries again with half the context window.
  5. If still no fit, the model is marked “Too Tight” (unrunnable).

Special Handling for MoE (Mixture of Experts)

Models like Mixtral 8x7B or DeepSeek-V3 have a Mixture-of-Experts architecture. They have a large total parameter count, but only a subset of “experts” are active for each token. This means the actual VRAM required is much lower than the total parameters would suggest.

For example:

  • Mixtral 8x7B: Total parameters 46.7B → naïve estimate ~23.9 GB VRAM.
  • With expert offloading: Only ~12.9B parameters active per token → VRAM requirement drops to ~6.6 GB.

llmfit automatically detects MoE models and applies an expert offloading strategy: active experts stay in VRAM, inactive ones are swapped to system RAM. This makes many large MoE models runnable on hardware that would otherwise be insufficient.

4. Multi-Dimensional Scoring: Quality, Speed, Fit, Context

Each model gets a score from 0–100 in four dimensions:

Dimension What It Measures
Quality Parameter count, model family reputation, quantization penalty, and how well it matches the intended use case
Speed Estimated tokens per second based on your backend, the model’s parameter count, and the chosen quantization
Fit Memory utilization efficiency—the sweet spot is using 50–80% of available memory
Context The model’s context window capability compared to what’s typical for its use case

These four scores are combined into a composite score using weighted averages. The weights vary depending on the model’s use case category:

Use Case Quality Weight Speed Weight Fit Weight Context Weight
General 0.30 0.30 0.20 0.20
Chat 0.25 0.35 0.20 0.20
Coding 0.35 0.30 0.20 0.15
Reasoning 0.55 0.15 0.15 0.15
Multimodal 0.40 0.25 0.15 0.20
Embedding 0.25 0.25 0.25 0.25

Finally, models are ranked by composite score. Models that are unrunnable (“Too Tight”) are always placed at the bottom, below any runnable model.

How Speed Is Estimated

Estimated tokens per second is calculated using a backend-specific constant divided by the model’s parameter count (in billions), then multiplied by a quantization speed multiplier.

Estimated speed = (K / params_in_billions) × quantization_speed_multiplier

The backend constant K is based on empirical testing:

Backend K Value
CUDA (NVIDIA) 220
Metal (Apple) 160
ROCm (AMD) 180
SYCL (Intel) 100
CPU (ARM) 90
CPU (x86) 70

Quantization speed multipliers increase as precision decreases (lower precision computes faster). Additionally, if the model uses CPU offloading (some layers on GPU, some on CPU) or runs entirely on CPU, the speed is multiplied by a penalty factor:

  • CPU offload: 0.5×
  • CPU only: 0.3×

MoE models get an additional 0.8× multiplier due to the overhead of switching between experts.

Run Modes and Fit Levels

Based on memory usage, llmfit assigns each model a run mode:

  • GPU: The model fits entirely in VRAM. This is the fastest mode.
  • MoE: Mixture-of-Experts with expert offloading. Active experts are in VRAM, inactive ones in RAM. Good balance of speed and memory.
  • CPU+GPU: VRAM is insufficient, so the model is partially offloaded to system RAM. Slower than pure GPU, but still benefits from some acceleration.
  • CPU: No GPU available, or VRAM is far too small. The model runs entirely in system RAM. Slowest, but it runs.

And a fit level:

  • Perfect: The model meets recommended memory requirements and runs on GPU. This is the ideal.
  • Good: Fits with some headroom, or is in MoE/CPU+GPU mode at its best achievable performance.
  • Marginal: A tight fit, or CPU-only mode (CPU-only models are capped at Marginal, even if they fit perfectly).
  • Too Tight: Not enough VRAM or system RAM to run the model at any quantization or context size.

Seamless Integration with Ollama

If you use Ollama to manage and run local models, llmfit integrates with it automatically.

What You Get

  • Installation detection: llmfit queries Ollama’s API (GET /api/tags) at startup. Models you already have installed are marked with a green in the “Inst” column of the TUI. The status bar shows Ollama: ✓ (N installed).
  • One-key download: Select a model in the TUI and press d. llmfit sends a pull request to Ollama (POST /api/pull). A progress indicator animates in real-time as the model downloads. Once complete, the model is immediately available for use.
  • Refresh: Press r to re-query Ollama and update the installed status.

If Ollama isn’t running, the d, i, and r keybindings are hidden from the status bar and disabled. The TUI still works perfectly—you just can’t see install status or pull models.

Remote Ollama Instances

By default, llmfit connects to http://localhost:11434. If you run Ollama on another machine or a different port, set the OLLAMA_HOST environment variable:

# Connect to Ollama on a specific IP and port
OLLAMA_HOST="http://192.168.1.100:11434" llmfit

# Works with CLI commands too
OLLAMA_HOST="http://ollama-server:666" llmfit --cli

This is useful when:

  • You have a powerful GPU server running Ollama, and you want to use llmfit on your laptop to select models.
  • Ollama runs in a Docker container with a custom port.
  • You’re using a reverse proxy or load balancer in front of Ollama.

Model Name Mapping

HuggingFace model IDs (e.g., Qwen/Qwen2.5-Coder-14B-Instruct) don’t always match Ollama’s tags (e.g., qwen2.5-coder:14b). llmfit maintains an internal mapping table to ensure accurate installation detection and pulls. For example, qwen2.5-coder:14b maps to the specialized Coder model, not the base qwen2.5:14b.

Platform and GPU Support Details

Platform GPU Detection Method VRAM Reporting
Linux (NVIDIA) nvidia-smi Exact dedicated VRAM, multi-GPU aggregated
Linux (AMD) rocm-smi May be unknown; manual override recommended
Linux (Intel Arc discrete) sysfs Exact dedicated VRAM
Linux (Intel integrated) lspci Shared system memory
macOS (Apple Silicon) system_profiler Unified memory (VRAM = system RAM)
macOS (Intel) Same as Linux (if NVIDIA present) Same as Linux
Windows nvidia-smi (if installed) Exact dedicated VRAM

If autodetection fails or reports incorrect values, use the --memory flag to override.

Integrating with OpenClaw: Let an AI Agent Recommend Models

llmfit ships as a skill for OpenClaw, an open-source AI agent framework. This lets your OpenClaw agent recommend hardware-appropriate local models and even configure providers like Ollama, vLLM, or LM Studio for you.

Installing the Skill

# From the llmfit repository
./scripts/install-openclaw-skill.sh

# Or manually
cp -r skills/llmfit-advisor ~/.openclaw/skills/

Once installed, you can ask your OpenClaw agent things like:

  • “What local models can I run on this machine?”
  • “Recommend a coding model that fits my hardware.”
  • “Set up Ollama with the best models for my GPU.”

The agent will call llmfit recommend --json under the hood, interpret the results, and guide you through configuring your openclaw.json with optimal model choices.

For full details, see skills/llmfit-advisor/SKILL.md in the repository.

How to Contribute a New Model

If a model you care about isn’t in the database, contributions are welcome. Here’s the process:

  1. Edit the scraper script: Open scripts/scrape_hf_models.py. Add the model’s HuggingFace repo ID (e.g., meta-llama/Llama-3.1-8B) to the TARGET_MODELS list.
  2. Handle gated models: If the model requires login to access metadata, add a fallback entry to the FALLBACKS list with the parameter count and context length.
  3. Run the update:
    make update-models
    # or: ./scripts/update_models.sh
    

    This backs up your existing data, runs the scraper, validates the new JSON, and rebuilds the binary.

  4. Verify:
    ./target/release/llmfit list | grep "Your Model Name"
    
  5. Update MODELS.md (if you want the markdown list to reflect the change). There’s a generator script in the commit history.
  6. Open a pull request on GitHub.

Comparison with Other Tools

If you’re looking for a more hands-on, measurement-based approach, check out llm-checker. It’s a Node.js CLI tool that actually runs models on your machine via Ollama and measures real-world performance, rather than estimating from specs.

llm-checker’s strengths:

  • Gives you real speed and memory usage numbers.
  • Great for final validation after you’ve narrowed down candidates.

llm-checker’s limitations:

  • Requires Ollama to be installed and running.
  • Doesn’t support MoE architectures—it treats all models as dense, so memory estimates for models like Mixtral will be overestimated.

llmfit and llm-checker complement each other: llmfit is for quick, broad-spectrum filtering; llm-checker is for deep-dive testing on a shortlist.

Frequently Asked Questions

Q: Does llmfit work on Windows?

A: Yes. You can install it via Cargo, or run it in WSL. GPU detection requires nvidia-smi to be installed and in your PATH. If not, only CPU and RAM detection will work, but that’s often enough for evaluation.

Q: I have an AMD GPU, but rocm-smi doesn’t show VRAM. What do I do?

A: Use the --memory flag to manually specify your VRAM, e.g., llmfit --memory=16G. You can also run rocm-smi separately to check if your drivers are correctly installed.

Q: Why does llmfit mark a model as “Too Tight” when I think it should run?

A: llmfit tries quantization from Q8_0 down to Q2_K and checks full context first, then half context. If no combination fits, it’s “Too Tight.” You might have a different expectation of acceptable quantization or context length. If you’re sure it can run, you can try llm-checker for a real-world test.

Q: How do I update to the latest model database?

A: Run make update-models. This scrapes fresh data from HuggingFace and rebuilds the binary. If you just want to see new models without rebuilding, you can run the scraper manually (python3 scripts/scrape_hf_models.py) but the binary won’t change until you rebuild.

Q: Does llmfit send my hardware information anywhere?

A: No. All detection happens locally. The only network requests are to the Ollama API (if enabled) and they stay on your local machine (or the host you specify with OLLAMA_HOST). No data is uploaded to any remote server.

Q: Can I use llmfit without Ollama?

A: Absolutely. The TUI and CLI work perfectly with no Ollama dependency. You just won’t see installed status or be able to download models directly. The core functionality—hardware detection, model scoring, and ranking—doesn’t require Ollama at all.

Final Thoughts

From trial-and-error to informed decision-making, llmfit fills a crucial gap in the local LLM workflow. It’s more than just a hardware checker—it’s an intelligent recommendation engine that combines dynamic quantization, MoE awareness, and multi-dimensional scoring. Whether you’re an AI hobbyist, a researcher, or someone who wants to run LLMs locally for privacy and control, llmfit saves you hours of guesswork.

One command. No more guessing. Just the right model for your hardware.

Exit mobile version