Osaurus vs Ollama: The Ultimate Apple Silicon LLM Server Showdown

高效码农

3 months ago

Osaurus: A Feather-Light, Apple-Silicon-Only LLM Server That Runs Rings Around Ollama

Last updated: 26 Aug 2025

If you own an Apple-silicon Mac and want a truly local, offline chatbot that weighs less than a PDF, let me introduce Osaurus: a 7 MB, open-source, Swift-native LLM server built on Apple’s MLX framework. It claims to be 20 % faster than Ollama, speaks the OpenAI REST API fluently, and runs entirely on your laptop without a single cloud call.

Below you’ll find everything you need—no fluff, no hype—to decide whether Osaurus deserves a spot in your toolkit.

What exactly is Osaurus?
Compatibility checklist: will it run on my Mac?
Installation in three minutes (drag-and-drop or source)
First launch: a walk-through of the model manager
Talking to the server: curl, Python, and JavaScript examples
Benchmarks: time-to-first-token, throughput, and reliability
Power-user features: KV-cache reuse, tool calling, chat templates
FAQ: storage paths, Intel Macs, Whisper, and more
When to choose Osaurus—and when not to

1. What exactly is Osaurus?

In plain English, Osaurus is a tiny, native macOS app that turns your M-series Mac into an OpenAI-compatible inference endpoint. Think of it as Ollama’s younger, lighter cousin that only works on Apple Silicon and only runs locally.

Key differences at a glance

Aspect	Osaurus	Ollama
Core engine	Apple MLX	llama.cpp
Binary size	~7 MB	~200 MB
Supported hardware	Apple Silicon only (M1, M2, M3…)	Cross-platform
API surface	OpenAI `/v1/models` and `/v1/chat/completions`	Same
Streaming	Server-Sent Events	Same
Tool calls	OpenAI-style `tools` + `tool_choice`	Experimental

You’ll notice no Windows build, no Intel build, and no Docker image—Osaurus sticks to Apple’s ecosystem like glue, trading reach for raw speed.

2. Compatibility checklist: will it run on my Mac?

Ask yourself three quick questions:

Question	Required answer
macOS version?	15.5 or newer
Chip?	Any Apple Silicon (M1, M2, M3, Pro, Max, Ultra)
Xcode?	Only if you compile from source: Xcode 16.4+

If any answer is “no,” stop here—Osaurus won’t work. Otherwise, read on.

3. Installation in three minutes

Option A: ready-made DMG (no coding)

Visit the GitHub Releases page.
Download Osaurus.dmg.
Drag the app into /Applications.
First launch? macOS will warn you the developer is “unverified.” Go to System Settings → Privacy & Security → Allow and you’re done.

Option B: build from source (developers)

git clone https://github.com/dinoki-ai/osaurus.git
cd osaurus
open osaurus.xcodeproj

Choose the osaurus target, press ⌘R, and Xcode compiles the SwiftUI frontend plus the embedded SwiftNIO server in one shot.

4. First launch: a walk-through of the model manager

The UI is deliberately minimal:

UI element	Purpose
Toggle switch	Start or stop the HTTP server (default port 8080)
Gear icon	Change port or set a custom model directory
Model list	Browse curated models from Hugging Face `mlx-community`; shows size and download progress
Resource chart	Real-time CPU and RAM usage

Default storage: ~/Documents/MLXModels.
Override: launch with export OSU_MODELS_DIR=/Volumes/SSD/mlx.

5. Talking to the server: curl, Python, and JavaScript examples

Osaurus exposes three endpoints you already know:

Endpoint	Method	Description
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	Chat (stream or non-stream)
`/health`	GET	JSON health check

5.1 List models with curl

curl -s http://127.0.0.1:8080/v1/models | jq

Sample JSON:

{
  "object": "list",
  "data": [
    {
      "id": "llama-3.2-3b-instruct-4bit",
      "object": "model",
      "created": 1724588800,
      "owned_by": "mlx-community"
    }
  ]
}

5.2 Non-streaming chat

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role": "user", "content": "Write a haiku about dinosaurs"}],
        "max_tokens": 200
      }'

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Ancient thunder lizard,\nshadows stretch across time,\nfossils whisper tales."
      },
      "finish_reason": "stop"
    }
  ]
}

5.3 Streaming chat (SSE)

curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role": "user", "content": "Explain quantum tunneling in one paragraph"}],
        "stream": true
      }'

The terminal will print deltas exactly like ChatGPT’s web UI.

5.4 Python snippet with the official OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")

response = client.chat.completions.create(
    model="llama-3.2-3b-instruct-4bit",
    messages=[{"role": "user", "content": "Hello from Python!"}],
)
print(response.choices[0].message.content)

5.5 JavaScript (browser or Node)

const response = await fetch("http://127.0.0.1:8080/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "llama-3.2-3b-instruct-4bit",
    messages: [{ role: "user", content: "Hello from JavaScript!" }],
  }),
});
const data = await response.json();
console.log(data.choices[0].message.content);

6. Benchmarks: time-to-first-token, throughput, and reliability

The maintainers ran 20 iterations on the same Apple-silicon hardware (exact model not disclosed) and averaged the results:

Server	Model	TTFT (ms)	Total (ms)	Chars/sec	Success
Osaurus	llama-3.2-3b-instruct-4bit	191	1 461	521	100 %
Ollama	llama3.2	59	1 667	439	100 %
LM Studio	llama-3.2-3b-instruct	56	1 205	605	100 %

What it means for you:

TTFT (time-to-first-token) is slightly higher on Osaurus—likely because MLX’s memory mapping warms up slower than llama.cpp.
Throughput (chars/sec) is highest on Osaurus once generation starts.
All servers achieved 100 % success over the test suite, so stability is not a differentiator.

7. Power-user features

7.1 KV-cache reuse via `session_id`

Multi-turn conversations can be accelerated by reusing the KV cache. Just keep the same session_id across requests.

Example:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "session_id": "trip-planning",
        "messages": [
          {"role": "user", "content": "Suggest a two-day itinerary in Kyoto"}
        ]
      }'

Rules:

The cache is opportunistically reused only if no other request is using that session_id.
There is no manual eviction; unused sessions expire automatically after a short idle window.

7.2 Tool / function calling (OpenAI-compatible)

Osaurus supports the exact JSON schema that OpenAI clients expect.

Step 1: declare tools

"tools": [
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather by city name",
      "parameters": {
        "type": "object",
        "properties": { "city": { "type": "string" } },
        "required": ["city"]
      }
    }
  }
]

Step 2: send request

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [
          {"role": "system", "content": "You can call functions to answer concisely."},
          {"role": "user", "content": "What is the weather in Tokyo?"}
        ],
        "tools": [...],
        "tool_choice": "auto"
      }'

Step 3: execute the returned tool call on your side, then continue the conversation with a role: tool message.

Osaurus gracefully handles malformed JSON, code fences, and extra whitespace, so you don’t need to sanitize model output.

7.3 Chat templates (Jinja)

When a model’s tokenizer_config.json contains a chat_template, Osaurus renders it with:

messages array
add_generation_prompt: true
bos_token and eos_token if defined

If the template is missing or fails to render, a simple transcript fallback is used:

User: ...
Assistant: ...

System messages are automatically prepended so you never need to manually format prompts.

8. FAQ: quick answers to common questions

Q: Where are the models stored?
A: Default is ~/Documents/MLXModels. Override with the environment variable OSU_MODELS_DIR.

Q: Can I run this on an Intel Mac?
A: No. Osaurus relies on Apple’s MLX, which only supports Apple Silicon.

Q: Is there Whisper support?
A: Not yet. The /transcribe endpoints are placeholders for a future release.

Q: How do I expose the server to my LAN?
A: Bind address is hard-coded to 127.0.0.1. Place a reverse proxy (e.g., Nginx) in front if you need external access.

Q: Does it work with LangChain or LlamaIndex?
A: Yes—any library that speaks the OpenAI REST API works out of the box. Just point the base_url to http://127.0.0.1:8080/v1.

9. When to choose Osaurus—and when not to

Your situation	Recommendation
Apple Silicon Mac only and you want the smallest footprint	Use Osaurus
Need Windows, Linux, or Intel Mac support	Stick with Ollama
Curious about Apple MLX performance	Benchmark Osaurus against your current stack
Need enterprise-grade multi-user hosting	Wait for future releases or use cloud services

Osaurus is not trying to be everything for everyone. It is a laser-focused tool that squeezes every last drop of performance out of Apple Silicon while staying compatible with the OpenAI ecosystem you already know.

If that matches your hardware and your goals, download the latest release and give it a spin—you might find that 7 MB is all you need.