Cactus Compute: A Cross‑Platform SDK for Local AI Inference
How can mobile and desktop applications harness the power of large‑scale AI models without sacrificing offline capability or draining device resources? Cactus Compute is a unified, open‑source SDK that lets developers integrate Local Large Language Models (LLMs), Visual‑Language Models (VLMs), Embedding generators, and Text‑to‑Speech (TTS) engines directly into Flutter, React Native, or native C/C++ apps. By supporting any GGUF‑formatted model—such as Qwen, Gemma, Llama, DeepSeek—and offering precision options from FP32 down to 2‑bit quantization, Cactus Compute strikes a balance between performance and footprint. It also provides cloud fallback modes to seamlessly offload heavy tasks when necessary.
In this article, you will learn:
-
Why local‑first AI inference matters -
How to install and initialize Cactus Compute in Flutter, React Native, and C/C++ -
Code examples for text completion, embeddings, and visual‑language tasks -
Strategies for intelligent cloud fallback -
Real‑world performance data across flagship devices -
Recommended models and community resources -
Common pitfalls, troubleshooting tips, and best practices -
How to contribute to the project and join the community
Whether you’re building a chat assistant, an on‑device image analyzer, or a voice‑driven interface, this guide will help you deliver responsive, reliable AI features—online or offline—while keeping your app’s size and battery usage in check.
Table of Contents
-
-
-
-
Flutter -
React Native -
Native C/C++
-
-
Core Use Cases & Code Examples
-
Text Completion -
Embedding Generation -
Visual‑Language Inference
-
-
-
-
-
-
-
Why Local AI Inference Matters
Building AI‑powered apps traditionally meant sending data to remote servers, waiting for responses, and depending on a stable internet connection. While cloud inference offers access to massive models, it introduces latency, privacy concerns, and unpredictable costs. Mobile and edge scenarios—from AR assistants to offline translators—demand:
-
Low latency: Immediate feedback without round‑trip network delays. -
Offline functionality: Core features remain available without internet. -
Privacy: User data never leaves the device unless explicitly allowed. -
Cost control: No per‑request cloud billing.
Cactus Compute addresses these needs by enabling on‑device inference for models of various sizes and precisions. By supporting quantization (reducing model weights to fewer bits), it minimizes memory and compute requirements. Developers still have the freedom to offload to the cloud for heavy tasks via built‑in fallback modes, ensuring reliability under all conditions.
This local‑first approach empowers applications to deliver AI experiences that feel native, snappy, and secure. From a product perspective, it unlocks new use cases—offline chatbots, personal voice assistants, on‑device content understanding—that were previously infeasible or cost‑prohibitive.
Key Features of Cactus Compute
Cactus Compute combines a rich feature set tailored for cross‑platform AI development:
-
Cross‑Platform Support
Native SDKs for Flutter and React Native cover iOS and Android. A pure C/C++ backend ensures compatibility with any platform that supports C or C++, including desktop, embedded Linux, and IoT devices. -
Universal GGUF Model Loading
Any model exported in the GGUF format can be loaded. This includes:-
LLMs (e.g., Qwen, Llama series) -
Visual‑Language Models (VLMs) (e.g., vision‑enabled chat) -
Embedding Models (for semantic search or downstream classification) -
TTS Engines (for speech output)
-
-
Flexible Quantization
Choose precision from full FP32 to 2‑bit. Lower‑bit quantization reduces model size and inference cost, at the expense of minor accuracy trade‑offs. Typical mobile devices handle Q4 (4‑bit) models at dozens of tokens per second. -
Intelligent Fallback Modes
Four built‑in modes let you control when inference happens locally vs. in the cloud:-
local
: Always on‑device -
localfirst
: Try local, then cloud on failure -
remotefirst
: Cloud first, then local if needed -
remote
: Always cloud
-
-
MCP Tool‑Calls
Preconfigured templates for common tasks (e.g., setting reminders, image search, auto‑reply) simplify integration and consistency across features. -
Jinja2 Template Engine
Construct dynamic chat templates with Jinja2, enabling conditional logic, loops, and context variables in prompts. -
Rich Documentation & Examples
Official guides and sample apps for Flutter, React Native, and C++ help you get started in minutes. An active Discord community offers real‑time support.
Installation & Initialization
Flutter
-
Add the package
flutter pub add cactus
-
Initialize in code
import 'package:cactus/cactus.dart'; Future<void> main() async { // Ensure Flutter bindings are initialized WidgetsFlutterBinding.ensureInitialized(); // Initialize the LLM final lm = await CactusLM.init( modelUrl: 'huggingface/gguf/your-model.gguf', contextSize: 2048, // Maximum token context length ); // Ready to use lm.completion() or lm.embedding() }
React Native
-
Install the package
npm install cactus-react-native && npx pod-install
-
Initialize in JavaScript/TypeScript
import { CactusLM } from 'cactus-react-native'; async function initializeModel() { const { lm, error } = await CactusLM.init({ model: '/path/to/your-model.gguf', n_ctx: 2048, }); if (error) { console.error('Failed to initialize model:', error); } else { // Use lm.completion() or lm.embedding() } } initializeModel();
Native C/C++
-
Clone the repo and build
git clone https://github.com/cactus-compute/cactus.git cd cactus chmod +x scripts/*.sh cd cpp ./build.sh # Compiles libraries and example executables
-
Run examples
# Language model demo ./cactus_llm # Visual-language model demo ./cactus_vlm # Embedding demo ./cactus_embed # Text-to-speech demo ./cactus_tts
Core Use Cases & Code Examples
Below are step‑by‑step snippets showing how to perform common AI tasks on‑device.
Text Completion
Flutter
final messages = [
ChatMessage(role: 'user', content: 'Hello, world!'),
];
final response = await lm.completion(
messages,
maxTokens: 100,
temperature: 0.7,
);
print('Reply: ${response.choices.first.text}');
React Native
const messages = [{ role: 'user', content: 'What is the weather today?' }];
const params = { n_predict: 100, temperature: 0.7 };
const response = await lm.completion(messages, params);
console.log('AI says:', response.choices[0].text);
Embedding Generation
Generate vector representations for downstream tasks like semantic search.
Flutter
final embeddingResult = await lm.embedding('Your text here');
print('Vector length: ${embeddingResult.embeddings.length}');
React Native
const text = 'Sample text to embed';
const result = await lm.embedding(text, { normalize: true });
console.log('Embedding vector:', result.embeddings);
Visual‑Language Inference
Run image‑aware prompts by supplying local image paths.
Flutter
final vlm = await CactusVLM.init(
modelUrl: 'huggingface/gguf/vision-model.gguf',
mmprojUrl: 'huggingface/gguf/mmproj.gguf',
);
final resp = await vlm.completion(
[ChatMessage(role: 'user', content: 'Describe this image')],
imagePaths: ['/absolute/path/to/photo.jpg'],
maxTokens: 200,
);
print('Description:', resp.choices.first.text);
Cloud Fallback Strategies
Even with efficient on‑device inference, some tasks or larger models may require cloud resources. Cactus Compute’s fallback modes let you fine‑tune where processing occurs:
-
local: All inferences are on‑device. Ensures privacy and offline support. -
localfirst: Attempts on‑device, then falls back to cloud if there’s an error or resource constraint. -
remotefirst: Sends inference requests to the cloud by default, using local only as a backup. Useful when device performance is unpredictable. -
remote: Forces all inferences to the cloud, which is ideal for very large models or minimal app size.
Flutter Example
final embed = await lm.embedding(
'Fallback mode test',
mode: 'localfirst',
);
Performance Benchmarks
The table below shows inference speeds (tokens/sec) for Gemma3 1B Q4 and Qwen3 4B Q4 models on popular flagship devices:
Device | Gemma3 1B Q4 | Qwen3 4B Q4 |
---|---|---|
iPhone 16 Pro Max | 54 | 18 |
iPhone 16 Pro | 54 | 18 |
iPhone 16 | 49 | 16 |
iPhone 15 Pro Max | 45 | 15 |
OnePlus 13 5G | 43 | 14 |
Samsung Galaxy S24 Ultra | 42 | 14 |
Galaxy S25 Ultra | 29 | 9 |
Xiaomi Poco F6 5G | 22 | 6 |
Quantized to Q4 precision, these models offer real‑time responsiveness on modern devices, making them suitable for chatbots, voice assistants, and interactive agents.
Recommended Models & Resources
-
Official HuggingFace Repository
Browse curated models on the Cactus Compute organization page:
https://huggingface.co/Cactus-Compute?sort_models=alphabetical#models -
Discord Community
Join fellow developers and AI enthusiasts for Q&A, feature requests, and optimization tips:
https://discord.gg/bNurx3AXTJ -
Documentation & Examples
-
Flutter: github.com/cactus-compute/cactus/blob/main/flutter
-
React Native: github.com/cactus-compute/cactus/blob/main/react
-
C++: github.com/cactus-compute/cactus/blob/main/cpp
-
Troubleshooting & FAQs
-
Model loading is slow
-
Verify the model file path and integrity. Copy the model to persistent storage before loading.
-
-
Out‑of‑Memory (OOM) errors
-
Use lower‑bit quantization (e.g., Q4 or Q2). Switch to remotefirst
mode for large models.
-
-
Dependency conflicts
-
Ensure your Flutter or React Native versions match the example’s. Clear caches ( flutter clean
,npm cache clean
).
-
-
Custom feature extension
-
Leverage MCP tool‑calls and Jinja2 templates to build reminders, search modules, and more without reinventing prompt logic.
-
How to Contribute
Contributions help the entire community build better AI apps. To get started:
-
Fork the repository and create a feature branch:
git checkout -b feature/your-feature
-
Implement your changes, write tests if applicable, and ensure existing examples still run.
-
Open a pull request with a clear description of your enhancement or fix.
Before large features, please open an issue to discuss scope and avoid duplicate efforts.
Conclusion & Future Outlook
Cactus Compute bridges the gap between powerful AI models and resource‑constrained devices. By offering cross‑platform SDKs, flexible quantization, intelligent fallback modes, and a growing model ecosystem, it enables developers to build responsive, private, and reliable AI experiences—on mobile, desktop, or embedded platforms. As device hardware continues to improve and model innovations emerge, on‑device AI will become ever more capable. We invite you to explore Cactus Compute in your next project, contribute improvements, and help shape the future of local AI inference.
Join a global community of developers bringing AI to every application.