Cross‑Platform Development

Cactus Compute: A Cross‑Platform SDK for Local AI Inference

How can mobile and desktop applications harness the power of large‑scale AI models without sacrificing offline capability or draining device resources? Cactus Compute is a unified, open‑source SDK that lets developers integrate Local Large Language Models (LLMs), Visual‑Language Models (VLMs), Embedding generators, and Text‑to‑Speech (TTS) engines directly into Flutter, React Native, or native C/C++ apps. By supporting any GGUF‑formatted model—such as Qwen, Gemma, Llama, DeepSeek—and offering precision options from FP32 down to 2‑bit quantization, Cactus Compute strikes a balance between performance and footprint. It also provides cloud fallback modes to seamlessly offload heavy tasks when necessary.

In this article, you will learn:

  • Why local‑first AI inference matters
  • How to install and initialize Cactus Compute in Flutter, React Native, and C/C++
  • Code examples for text completion, embeddings, and visual‑language tasks
  • Strategies for intelligent cloud fallback
  • Real‑world performance data across flagship devices
  • Recommended models and community resources
  • Common pitfalls, troubleshooting tips, and best practices
  • How to contribute to the project and join the community

Whether you’re building a chat assistant, an on‑device image analyzer, or a voice‑driven interface, this guide will help you deliver responsive, reliable AI features—online or offline—while keeping your app’s size and battery usage in check.


Table of Contents

  1. Why Local AI Inference Matters

  2. Key Features of Cactus Compute

  3. Installation & Initialization

    • Flutter
    • React Native
    • Native C/C++
  4. Core Use Cases & Code Examples

    • Text Completion
    • Embedding Generation
    • Visual‑Language Inference
  5. Cloud Fallback Strategies

  6. Performance Benchmarks

  7. Recommended Models & Resources

  8. Troubleshooting & FAQs

  9. How to Contribute

  10. Conclusion & Future Outlook


Why Local AI Inference Matters

Building AI‑powered apps traditionally meant sending data to remote servers, waiting for responses, and depending on a stable internet connection. While cloud inference offers access to massive models, it introduces latency, privacy concerns, and unpredictable costs. Mobile and edge scenarios—from AR assistants to offline translators—demand:

  • Low latency: Immediate feedback without round‑trip network delays.
  • Offline functionality: Core features remain available without internet.
  • Privacy: User data never leaves the device unless explicitly allowed.
  • Cost control: No per‑request cloud billing.

Cactus Compute addresses these needs by enabling on‑device inference for models of various sizes and precisions. By supporting quantization (reducing model weights to fewer bits), it minimizes memory and compute requirements. Developers still have the freedom to offload to the cloud for heavy tasks via built‑in fallback modes, ensuring reliability under all conditions.

This local‑first approach empowers applications to deliver AI experiences that feel native, snappy, and secure. From a product perspective, it unlocks new use cases—offline chatbots, personal voice assistants, on‑device content understanding—that were previously infeasible or cost‑prohibitive.


Key Features of Cactus Compute

Cactus Compute combines a rich feature set tailored for cross‑platform AI development:

  • Cross‑Platform Support
    Native SDKs for Flutter and React Native cover iOS and Android. A pure C/C++ backend ensures compatibility with any platform that supports C or C++, including desktop, embedded Linux, and IoT devices.

  • Universal GGUF Model Loading
    Any model exported in the GGUF format can be loaded. This includes:

    • LLMs (e.g., Qwen, Llama series)
    • Visual‑Language Models (VLMs) (e.g., vision‑enabled chat)
    • Embedding Models (for semantic search or downstream classification)
    • TTS Engines (for speech output)
  • Flexible Quantization
    Choose precision from full FP32 to 2‑bit. Lower‑bit quantization reduces model size and inference cost, at the expense of minor accuracy trade‑offs. Typical mobile devices handle Q4 (4‑bit) models at dozens of tokens per second.

  • Intelligent Fallback Modes
    Four built‑in modes let you control when inference happens locally vs. in the cloud:

    • local: Always on‑device
    • localfirst: Try local, then cloud on failure
    • remotefirst: Cloud first, then local if needed
    • remote: Always cloud
  • MCP Tool‑Calls
    Preconfigured templates for common tasks (e.g., setting reminders, image search, auto‑reply) simplify integration and consistency across features.

  • Jinja2 Template Engine
    Construct dynamic chat templates with Jinja2, enabling conditional logic, loops, and context variables in prompts.

  • Rich Documentation & Examples
    Official guides and sample apps for Flutter, React Native, and C++ help you get started in minutes. An active Discord community offers real‑time support.


Installation & Initialization

Flutter

  1. Add the package

    flutter pub add cactus
    
  2. Initialize in code

    import 'package:cactus/cactus.dart';
    
    Future<void> main() async {
      // Ensure Flutter bindings are initialized
      WidgetsFlutterBinding.ensureInitialized();
    
      // Initialize the LLM
      final lm = await CactusLM.init(
        modelUrl: 'huggingface/gguf/your-model.gguf',
        contextSize: 2048, // Maximum token context length
      );
    
      // Ready to use lm.completion() or lm.embedding()
    }
    

React Native

  1. Install the package

    npm install cactus-react-native && npx pod-install
    
  2. Initialize in JavaScript/TypeScript

    import { CactusLM } from 'cactus-react-native';
    
    async function initializeModel() {
      const { lm, error } = await CactusLM.init({
        model: '/path/to/your-model.gguf',
        n_ctx: 2048,
      });
    
      if (error) {
        console.error('Failed to initialize model:', error);
      } else {
        // Use lm.completion() or lm.embedding()
      }
    }
    
    initializeModel();
    

Native C/C++

  1. Clone the repo and build

    git clone https://github.com/cactus-compute/cactus.git
    cd cactus
    chmod +x scripts/*.sh
    cd cpp
    ./build.sh  # Compiles libraries and example executables
    
  2. Run examples

    # Language model demo
    ./cactus_llm
    
    # Visual-language model demo
    ./cactus_vlm
    
    # Embedding demo
    ./cactus_embed
    
    # Text-to-speech demo
    ./cactus_tts
    

Core Use Cases & Code Examples

Below are step‑by‑step snippets showing how to perform common AI tasks on‑device.

Text Completion

Flutter

final messages = [
  ChatMessage(role: 'user', content: 'Hello, world!'),
];
final response = await lm.completion(
  messages,
  maxTokens: 100,
  temperature: 0.7,
);
print('Reply: ${response.choices.first.text}');

React Native

const messages = [{ role: 'user', content: 'What is the weather today?' }];
const params = { n_predict: 100, temperature: 0.7 };
const response = await lm.completion(messages, params);
console.log('AI says:', response.choices[0].text);

Embedding Generation

Generate vector representations for downstream tasks like semantic search.

Flutter

final embeddingResult = await lm.embedding('Your text here');
print('Vector length: ${embeddingResult.embeddings.length}');

React Native

const text = 'Sample text to embed';
const result = await lm.embedding(text, { normalize: true });
console.log('Embedding vector:', result.embeddings);

Visual‑Language Inference

Run image‑aware prompts by supplying local image paths.

Flutter

final vlm = await CactusVLM.init(
  modelUrl: 'huggingface/gguf/vision-model.gguf',
  mmprojUrl: 'huggingface/gguf/mmproj.gguf',
);

final resp = await vlm.completion(
  [ChatMessage(role: 'user', content: 'Describe this image')],
  imagePaths: ['/absolute/path/to/photo.jpg'],
  maxTokens: 200,
);
print('Description:', resp.choices.first.text);

Cloud Fallback Strategies

Even with efficient on‑device inference, some tasks or larger models may require cloud resources. Cactus Compute’s fallback modes let you fine‑tune where processing occurs:

  • local: All inferences are on‑device. Ensures privacy and offline support.
  • localfirst: Attempts on‑device, then falls back to cloud if there’s an error or resource constraint.
  • remotefirst: Sends inference requests to the cloud by default, using local only as a backup. Useful when device performance is unpredictable.
  • remote: Forces all inferences to the cloud, which is ideal for very large models or minimal app size.

Flutter Example

final embed = await lm.embedding(
  'Fallback mode test',
  mode: 'localfirst',
);

Performance Benchmarks

Device Performance

The table below shows inference speeds (tokens/sec) for Gemma3 1B Q4 and Qwen3 4B Q4 models on popular flagship devices:

Device Gemma3 1B Q4 Qwen3 4B Q4
iPhone 16 Pro Max 54 18
iPhone 16 Pro 54 18
iPhone 16 49 16
iPhone 15 Pro Max 45 15
OnePlus 13 5G 43 14
Samsung Galaxy S24 Ultra 42 14
Galaxy S25 Ultra 29 9
Xiaomi Poco F6 5G 22 6

Quantized to Q4 precision, these models offer real‑time responsiveness on modern devices, making them suitable for chatbots, voice assistants, and interactive agents.


Recommended Models & Resources

  • Official HuggingFace Repository
    Browse curated models on the Cactus Compute organization page:
    https://huggingface.co/Cactus-Compute?sort_models=alphabetical#models

  • Discord Community
    Join fellow developers and AI enthusiasts for Q&A, feature requests, and optimization tips:
    https://discord.gg/bNurx3AXTJ

  • Documentation & Examples

    • Flutter: github.com/cactus-compute/cactus/blob/main/flutter
    • React Native: github.com/cactus-compute/cactus/blob/main/react
    • C++: github.com/cactus-compute/cactus/blob/main/cpp

Troubleshooting & FAQs

  1. Model loading is slow

    • Verify the model file path and integrity. Copy the model to persistent storage before loading.
  2. Out‑of‑Memory (OOM) errors

    • Use lower‑bit quantization (e.g., Q4 or Q2). Switch to remotefirst mode for large models.
  3. Dependency conflicts

    • Ensure your Flutter or React Native versions match the example’s. Clear caches (flutter clean, npm cache clean).
  4. Custom feature extension

    • Leverage MCP tool‑calls and Jinja2 templates to build reminders, search modules, and more without reinventing prompt logic.

How to Contribute

Contributions help the entire community build better AI apps. To get started:

  1. Fork the repository and create a feature branch:

    git checkout -b feature/your-feature
    
  2. Implement your changes, write tests if applicable, and ensure existing examples still run.

  3. Open a pull request with a clear description of your enhancement or fix.

Before large features, please open an issue to discuss scope and avoid duplicate efforts.


Conclusion & Future Outlook

Cactus Compute bridges the gap between powerful AI models and resource‑constrained devices. By offering cross‑platform SDKs, flexible quantization, intelligent fallback modes, and a growing model ecosystem, it enables developers to build responsive, private, and reliable AI experiences—on mobile, desktop, or embedded platforms. As device hardware continues to improve and model innovations emerge, on‑device AI will become ever more capable. We invite you to explore Cactus Compute in your next project, contribute improvements, and help shape the future of local AI inference.

Join Us
Join a global community of developers bringing AI to every application.