Site icon Efficient Coder

Osaurus vs Ollama: The Ultimate Apple Silicon LLM Server Showdown

Osaurus: A Feather-Light, Apple-Silicon-Only LLM Server That Runs Rings Around Ollama

Last updated: 26 Aug 2025

If you own an Apple-silicon Mac and want a truly local, offline chatbot that weighs less than a PDF, let me introduce Osaurus: a 7 MB, open-source, Swift-native LLM server built on Apple’s MLX framework. It claims to be 20 % faster than Ollama, speaks the OpenAI REST API fluently, and runs entirely on your laptop without a single cloud call.

Below you’ll find everything you need—no fluff, no hype—to decide whether Osaurus deserves a spot in your toolkit.


Table of contents

  1. What exactly is Osaurus?
  2. Compatibility checklist: will it run on my Mac?
  3. Installation in three minutes (drag-and-drop or source)
  4. First launch: a walk-through of the model manager
  5. Talking to the server: curl, Python, and JavaScript examples
  6. Benchmarks: time-to-first-token, throughput, and reliability
  7. Power-user features: KV-cache reuse, tool calling, chat templates
  8. FAQ: storage paths, Intel Macs, Whisper, and more
  9. When to choose Osaurus—and when not to

1. What exactly is Osaurus?

In plain English, Osaurus is a tiny, native macOS app that turns your M-series Mac into an OpenAI-compatible inference endpoint. Think of it as Ollama’s younger, lighter cousin that only works on Apple Silicon and only runs locally.

Key differences at a glance

Aspect Osaurus Ollama
Core engine Apple MLX llama.cpp
Binary size ~7 MB ~200 MB
Supported hardware Apple Silicon only (M1, M2, M3…) Cross-platform
API surface OpenAI /v1/models and /v1/chat/completions Same
Streaming Server-Sent Events Same
Tool calls OpenAI-style tools + tool_choice Experimental

You’ll notice no Windows build, no Intel build, and no Docker image—Osaurus sticks to Apple’s ecosystem like glue, trading reach for raw speed.


2. Compatibility checklist: will it run on my Mac?

Ask yourself three quick questions:

Question Required answer
macOS version? 15.5 or newer
Chip? Any Apple Silicon (M1, M2, M3, Pro, Max, Ultra)
Xcode? Only if you compile from source: Xcode 16.4+

If any answer is “no,” stop here—Osaurus won’t work. Otherwise, read on.


3. Installation in three minutes

Option A: ready-made DMG (no coding)

  1. Visit the GitHub Releases page.
  2. Download Osaurus.dmg.
  3. Drag the app into /Applications.
  4. First launch? macOS will warn you the developer is “unverified.” Go to System Settings → Privacy & Security → Allow and you’re done.

Option B: build from source (developers)

git clone https://github.com/dinoki-ai/osaurus.git
cd osaurus
open osaurus.xcodeproj

Choose the osaurus target, press ⌘R, and Xcode compiles the SwiftUI frontend plus the embedded SwiftNIO server in one shot.


4. First launch: a walk-through of the model manager

The UI is deliberately minimal:

UI element Purpose
Toggle switch Start or stop the HTTP server (default port 8080)
Gear icon Change port or set a custom model directory
Model list Browse curated models from Hugging Face mlx-community; shows size and download progress
Resource chart Real-time CPU and RAM usage

Default storage: ~/Documents/MLXModels.
Override: launch with export OSU_MODELS_DIR=/Volumes/SSD/mlx.


5. Talking to the server: curl, Python, and JavaScript examples

Osaurus exposes three endpoints you already know:

Endpoint Method Description
/v1/models GET List available models
/v1/chat/completions POST Chat (stream or non-stream)
/health GET JSON health check

5.1 List models with curl

curl -s http://127.0.0.1:8080/v1/models | jq

Sample JSON:

{
  "object": "list",
  "data": [
    {
      "id": "llama-3.2-3b-instruct-4bit",
      "object": "model",
      "created": 1724588800,
      "owned_by": "mlx-community"
    }
  ]
}

5.2 Non-streaming chat

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role": "user", "content": "Write a haiku about dinosaurs"}],
        "max_tokens": 200
      }'

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Ancient thunder lizard,\nshadows stretch across time,\nfossils whisper tales."
      },
      "finish_reason": "stop"
    }
  ]
}

5.3 Streaming chat (SSE)

curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role": "user", "content": "Explain quantum tunneling in one paragraph"}],
        "stream": true
      }'

The terminal will print deltas exactly like ChatGPT’s web UI.

5.4 Python snippet with the official OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")

response = client.chat.completions.create(
    model="llama-3.2-3b-instruct-4bit",
    messages=[{"role": "user", "content": "Hello from Python!"}],
)
print(response.choices[0].message.content)

5.5 JavaScript (browser or Node)

const response = await fetch("http://127.0.0.1:8080/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "llama-3.2-3b-instruct-4bit",
    messages: [{ role: "user", content: "Hello from JavaScript!" }],
  }),
});
const data = await response.json();
console.log(data.choices[0].message.content);

6. Benchmarks: time-to-first-token, throughput, and reliability

The maintainers ran 20 iterations on the same Apple-silicon hardware (exact model not disclosed) and averaged the results:

Server Model TTFT (ms) Total (ms) Chars/sec Success
Osaurus llama-3.2-3b-instruct-4bit 191 1 461 521 100 %
Ollama llama3.2 59 1 667 439 100 %
LM Studio llama-3.2-3b-instruct 56 1 205 605 100 %

What it means for you:

  • TTFT (time-to-first-token) is slightly higher on Osaurus—likely because MLX’s memory mapping warms up slower than llama.cpp.
  • Throughput (chars/sec) is highest on Osaurus once generation starts.
  • All servers achieved 100 % success over the test suite, so stability is not a differentiator.

7. Power-user features

7.1 KV-cache reuse via session_id

Multi-turn conversations can be accelerated by reusing the KV cache. Just keep the same session_id across requests.

Example:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "session_id": "trip-planning",
        "messages": [
          {"role": "user", "content": "Suggest a two-day itinerary in Kyoto"}
        ]
      }'

Rules:

  • The cache is opportunistically reused only if no other request is using that session_id.
  • There is no manual eviction; unused sessions expire automatically after a short idle window.

7.2 Tool / function calling (OpenAI-compatible)

Osaurus supports the exact JSON schema that OpenAI clients expect.

Step 1: declare tools

"tools": [
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather by city name",
      "parameters": {
        "type": "object",
        "properties": { "city": { "type": "string" } },
        "required": ["city"]
      }
    }
  }
]

Step 2: send request

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [
          {"role": "system", "content": "You can call functions to answer concisely."},
          {"role": "user", "content": "What is the weather in Tokyo?"}
        ],
        "tools": [...],
        "tool_choice": "auto"
      }'

Step 3: execute the returned tool call on your side, then continue the conversation with a role: tool message.

Osaurus gracefully handles malformed JSON, code fences, and extra whitespace, so you don’t need to sanitize model output.

7.3 Chat templates (Jinja)

When a model’s tokenizer_config.json contains a chat_template, Osaurus renders it with:

  • messages array
  • add_generation_prompt: true
  • bos_token and eos_token if defined

If the template is missing or fails to render, a simple transcript fallback is used:

User: ...
Assistant: ...

System messages are automatically prepended so you never need to manually format prompts.


8. FAQ: quick answers to common questions

Q: Where are the models stored?
A: Default is ~/Documents/MLXModels. Override with the environment variable OSU_MODELS_DIR.

Q: Can I run this on an Intel Mac?
A: No. Osaurus relies on Apple’s MLX, which only supports Apple Silicon.

Q: Is there Whisper support?
A: Not yet. The /transcribe endpoints are placeholders for a future release.

Q: How do I expose the server to my LAN?
A: Bind address is hard-coded to 127.0.0.1. Place a reverse proxy (e.g., Nginx) in front if you need external access.

Q: Does it work with LangChain or LlamaIndex?
A: Yes—any library that speaks the OpenAI REST API works out of the box. Just point the base_url to http://127.0.0.1:8080/v1.


9. When to choose Osaurus—and when not to

Your situation Recommendation
Apple Silicon Mac only and you want the smallest footprint Use Osaurus
Need Windows, Linux, or Intel Mac support Stick with Ollama
Curious about Apple MLX performance Benchmark Osaurus against your current stack
Need enterprise-grade multi-user hosting Wait for future releases or use cloud services

Osaurus is not trying to be everything for everyone. It is a laser-focused tool that squeezes every last drop of performance out of Apple Silicon while staying compatible with the OpenAI ecosystem you already know.

If that matches your hardware and your goals, download the latest release and give it a spin—you might find that 7 MB is all you need.

Exit mobile version