Osaurus: A Feather-Light, Apple-Silicon-Only LLM Server That Runs Rings Around Ollama
Last updated: 26 Aug 2025
If you own an Apple-silicon Mac and want a truly local, offline chatbot that weighs less than a PDF, let me introduce Osaurus: a 7 MB, open-source, Swift-native LLM server built on Apple’s MLX framework. It claims to be 20 % faster than Ollama, speaks the OpenAI REST API fluently, and runs entirely on your laptop without a single cloud call.
Below you’ll find everything you need—no fluff, no hype—to decide whether Osaurus deserves a spot in your toolkit.
Table of contents
-
What exactly is Osaurus? -
Compatibility checklist: will it run on my Mac? -
Installation in three minutes (drag-and-drop or source) -
First launch: a walk-through of the model manager -
Talking to the server: curl, Python, and JavaScript examples -
Benchmarks: time-to-first-token, throughput, and reliability -
Power-user features: KV-cache reuse, tool calling, chat templates -
FAQ: storage paths, Intel Macs, Whisper, and more -
When to choose Osaurus—and when not to
1. What exactly is Osaurus?
In plain English, Osaurus is a tiny, native macOS app that turns your M-series Mac into an OpenAI-compatible inference endpoint. Think of it as Ollama’s younger, lighter cousin that only works on Apple Silicon and only runs locally.
Key differences at a glance
Aspect | Osaurus | Ollama |
---|---|---|
Core engine | Apple MLX | llama.cpp |
Binary size | ~7 MB | ~200 MB |
Supported hardware | Apple Silicon only (M1, M2, M3…) | Cross-platform |
API surface | OpenAI /v1/models and /v1/chat/completions |
Same |
Streaming | Server-Sent Events | Same |
Tool calls | OpenAI-style tools + tool_choice |
Experimental |
You’ll notice no Windows build, no Intel build, and no Docker image—Osaurus sticks to Apple’s ecosystem like glue, trading reach for raw speed.
2. Compatibility checklist: will it run on my Mac?
Ask yourself three quick questions:
Question | Required answer |
---|---|
macOS version? | 15.5 or newer |
Chip? | Any Apple Silicon (M1, M2, M3, Pro, Max, Ultra) |
Xcode? | Only if you compile from source: Xcode 16.4+ |
If any answer is “no,” stop here—Osaurus won’t work. Otherwise, read on.
3. Installation in three minutes
Option A: ready-made DMG (no coding)
-
Visit the GitHub Releases page. -
Download Osaurus.dmg
. -
Drag the app into /Applications
. -
First launch? macOS will warn you the developer is “unverified.” Go to System Settings → Privacy & Security → Allow and you’re done.
Option B: build from source (developers)
git clone https://github.com/dinoki-ai/osaurus.git
cd osaurus
open osaurus.xcodeproj
Choose the osaurus
target, press ⌘R, and Xcode compiles the SwiftUI frontend plus the embedded SwiftNIO server in one shot.
4. First launch: a walk-through of the model manager
The UI is deliberately minimal:
UI element | Purpose |
---|---|
Toggle switch | Start or stop the HTTP server (default port 8080) |
Gear icon | Change port or set a custom model directory |
Model list | Browse curated models from Hugging Face mlx-community ; shows size and download progress |
Resource chart | Real-time CPU and RAM usage |
Default storage: ~/Documents/MLXModels
.
Override: launch with export OSU_MODELS_DIR=/Volumes/SSD/mlx
.
5. Talking to the server: curl, Python, and JavaScript examples
Osaurus exposes three endpoints you already know:
Endpoint | Method | Description |
---|---|---|
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat (stream or non-stream) |
/health |
GET | JSON health check |
5.1 List models with curl
curl -s http://127.0.0.1:8080/v1/models | jq
Sample JSON:
{
"object": "list",
"data": [
{
"id": "llama-3.2-3b-instruct-4bit",
"object": "model",
"created": 1724588800,
"owned_by": "mlx-community"
}
]
}
5.2 Non-streaming chat
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role": "user", "content": "Write a haiku about dinosaurs"}],
"max_tokens": 200
}'
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"choices": [
{
"message": {
"role": "assistant",
"content": "Ancient thunder lizard,\nshadows stretch across time,\nfossils whisper tales."
},
"finish_reason": "stop"
}
]
}
5.3 Streaming chat (SSE)
curl -N http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role": "user", "content": "Explain quantum tunneling in one paragraph"}],
"stream": true
}'
The terminal will print deltas exactly like ChatGPT’s web UI.
5.4 Python snippet with the official OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")
response = client.chat.completions.create(
model="llama-3.2-3b-instruct-4bit",
messages=[{"role": "user", "content": "Hello from Python!"}],
)
print(response.choices[0].message.content)
5.5 JavaScript (browser or Node)
const response = await fetch("http://127.0.0.1:8080/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama-3.2-3b-instruct-4bit",
messages: [{ role: "user", content: "Hello from JavaScript!" }],
}),
});
const data = await response.json();
console.log(data.choices[0].message.content);
6. Benchmarks: time-to-first-token, throughput, and reliability
The maintainers ran 20 iterations on the same Apple-silicon hardware (exact model not disclosed) and averaged the results:
Server | Model | TTFT (ms) | Total (ms) | Chars/sec | Success |
---|---|---|---|---|---|
Osaurus | llama-3.2-3b-instruct-4bit | 191 | 1 461 | 521 | 100 % |
Ollama | llama3.2 | 59 | 1 667 | 439 | 100 % |
LM Studio | llama-3.2-3b-instruct | 56 | 1 205 | 605 | 100 % |
What it means for you:
-
TTFT (time-to-first-token) is slightly higher on Osaurus—likely because MLX’s memory mapping warms up slower than llama.cpp. -
Throughput (chars/sec) is highest on Osaurus once generation starts. -
All servers achieved 100 % success over the test suite, so stability is not a differentiator.
7. Power-user features
7.1 KV-cache reuse via session_id
Multi-turn conversations can be accelerated by reusing the KV cache. Just keep the same session_id
across requests.
Example:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"session_id": "trip-planning",
"messages": [
{"role": "user", "content": "Suggest a two-day itinerary in Kyoto"}
]
}'
Rules:
-
The cache is opportunistically reused only if no other request is using that session_id
. -
There is no manual eviction; unused sessions expire automatically after a short idle window.
7.2 Tool / function calling (OpenAI-compatible)
Osaurus supports the exact JSON schema that OpenAI clients expect.
Step 1: declare tools
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather by city name",
"parameters": {
"type": "object",
"properties": { "city": { "type": "string" } },
"required": ["city"]
}
}
}
]
Step 2: send request
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [
{"role": "system", "content": "You can call functions to answer concisely."},
{"role": "user", "content": "What is the weather in Tokyo?"}
],
"tools": [...],
"tool_choice": "auto"
}'
Step 3: execute the returned tool call on your side, then continue the conversation with a role: tool
message.
Osaurus gracefully handles malformed JSON, code fences, and extra whitespace, so you don’t need to sanitize model output.
7.3 Chat templates (Jinja)
When a model’s tokenizer_config.json
contains a chat_template
, Osaurus renders it with:
-
messages
array -
add_generation_prompt: true
-
bos_token
andeos_token
if defined
If the template is missing or fails to render, a simple transcript fallback is used:
User: ...
Assistant: ...
System messages are automatically prepended so you never need to manually format prompts.
8. FAQ: quick answers to common questions
Q: Where are the models stored?
A: Default is ~/Documents/MLXModels
. Override with the environment variable OSU_MODELS_DIR
.
Q: Can I run this on an Intel Mac?
A: No. Osaurus relies on Apple’s MLX, which only supports Apple Silicon.
Q: Is there Whisper support?
A: Not yet. The /transcribe
endpoints are placeholders for a future release.
Q: How do I expose the server to my LAN?
A: Bind address is hard-coded to 127.0.0.1. Place a reverse proxy (e.g., Nginx) in front if you need external access.
Q: Does it work with LangChain or LlamaIndex?
A: Yes—any library that speaks the OpenAI REST API works out of the box. Just point the base_url
to http://127.0.0.1:8080/v1
.
9. When to choose Osaurus—and when not to
Your situation | Recommendation |
---|---|
Apple Silicon Mac only and you want the smallest footprint | Use Osaurus |
Need Windows, Linux, or Intel Mac support | Stick with Ollama |
Curious about Apple MLX performance | Benchmark Osaurus against your current stack |
Need enterprise-grade multi-user hosting | Wait for future releases or use cloud services |
Osaurus is not trying to be everything for everyone. It is a laser-focused tool that squeezes every last drop of performance out of Apple Silicon while staying compatible with the OpenAI ecosystem you already know.
If that matches your hardware and your goals, download the latest release and give it a spin—you might find that 7 MB is all you need.