Cactus is a lightweight, high-performance framework for running AI models on mobile devices, with simple and consistent APIs across C/C++, Dart/Flutter and Ts/React-Native. Cactus currently leverages GGML backends to support any GGUF model already compatible with Llama.cpp.
- Text completion and chat completion
- Vision Language Models
- Streaming token generation
- Embedding generation
- Text-to-speech model support (early stages)
- JSON mode with schema validation
- Chat templates with Jinja2 support
- Low memory footprint
- Battery-efficient inference
- Background processing
- APIs are increasingly becoming expensive, especially at scale
- Private and local, data do not leave the device whatsoever
- Low-latency anf fault-tolerant, no need for users to have internet connections
- Small models excell at most tasks, big APIs are often only better at enterprise tasks like coding
- Freedom to use any GGUF model, unlike Apple Foundations and Google AI Core
- React-Native and Flutter APIs, no need for separate Swift and Android setups
- iOS xcframework and JNILibs ifworking in native setup
- Neat and tiny C++ build for custom hardware
- Update
pubspec.yaml
: Addcactus
to your project's dependencies. Ensure you haveflutter: sdk: flutter
(usually present by default).dependencies: flutter: sdk: flutter cactus: ^0.1.0
- Install dependencies:
Execute the following command in your project terminal:
flutter pub get
- Basic Flutter Text Completion
import 'package:cactus/cactus.dart'; Future<String> basicCompletion() async { // Initialize context final context = await CactusContext.init(CactusInitParams( modelPath: '/path/to/model.gguf', contextSize: 2048, threads: 4, )); // Generate response final result = await context.completion(CactusCompletionParams( messages: [ ChatMessage(role: 'user', content: 'Hello, how are you?') ], maxPredictedTokens: 100, temperature: 0.7, )); context.free(); return result.text; }
To learn more, see the Flutter Docs. It covers chat design, embeddings, multimodal models, text-to-speech, and more.
-
Install the
cactus-react-native
package: Using npm:npm install cactus-react-native
Or using yarn:
yarn add cactus-react-native
-
Install iOS Pods (if not using Expo): For native iOS projects, ensure you link the native dependencies. Navigate to your
ios
directory and run:npx pod-install
-
Basic React-Native Text Completion
import { initLlama } from 'cactus-react-native'; // Initialize a context const context = await initLlama({ model: '/path/to/your/model.gguf', n_ctx: 2048, n_threads: 4, }); // Generate text const result = await context.completion({ messages: [ { role: 'user', content: 'Hello, how are you?' } ], n_predict: 100, temperature: 0.7, }); console.log(result.text);
To learn more, see the React Docs. It covers chat design, embeddings, multimodal models, text-to-speech, and more.
Cactus backend is written in C/C++ and can run directly on any ARM/X86/Raspberry PI hardware like phones, smart tvs, watches, speakers, cameras, laptops etc.
-
Setup You need
CMake 3.14+
installed, or install withbrew install cmake
(on macOS) or standard package managers on Linux. -
Build from Source
git clone https://github.com/your-org/cactus.git cd cactus mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release make -j$(nproc)
-
CMake Integration Add to your
CMakeLists.txt
:# Add Cactus as subdirectory add_subdirectory(cactus) # Link to your target target_link_libraries(your_target cactus) target_include_directories(your_target PRIVATE cactus) # Requires C++17 or higher
-
Basic Text Completion
#include "cactus/cactus.h" #include <iostream> int main() { cactus::cactus_context context; // Configure parameters common_params params; params.model.path = "model.gguf"; params.n_ctx = 2048; params.n_threads = 4; params.n_gpu_layers = 99; // Use GPU acceleration // Load model if (!context.loadModel(params)) { std::cerr << "Failed to load model" << std::endl; return 1; } // Set prompt context.params.prompt = "Hello, how are you?"; context.params.n_predict = 100; // Initialize sampling if (!context.initSampling()) { std::cerr << "Failed to initialize sampling" << std::endl; return 1; } // Generate response context.beginCompletion(); context.loadPrompt(); while (context.has_next_token && !context.is_interrupted) { auto token_output = context.doCompletion(); if (token_output.tok == -1) break; } std::cout << "Response: " << context.generated_text << std::endl; return 0; }
To learn more, see the C++ Docs. It covers chat design, embeddings, multimodal models, text-to-speech, and more.
First, clone the repo with git clone https://github.com/cactus-compute/cactus.git
, cd into it and make all scripts executable with chmod +x scripts/*.sh
-
Flutter
- Build the Android JNILibs with
scripts/build-flutter-android.sh
. - Build the Flutter Plugin with
scripts/build-flutter-android.sh
. - Navigate to the example app with
cd examples/flutter
. - Open your simulator via Xcode or Android Studio, walkthrough if you have not done this before.
- Always start app with this combo
flutter clean && flutter pub get && flutter run
. - Play with the app, and make changes either to the example app or plugin as desired.
- Build the Android JNILibs with
-
React Native
- Build the Android JNILibs with
scripts/build-react-android.sh
. - Build the Flutter Plugin with
scripts/build-react-android.sh
. - Navigate to the example app with
cd examples/react
. - Setup your simulator via Xcode or Android Studio, walkthrough if you have not done this before.
- Always start app with this combo
yarn && yarn ios
oryarn && yarn android
. - Play with the app, and make changes either to the example app or package as desired.
- For now, if changes are made in the package, you would manually copy the files/folders into the
examples/react/node_modules/cactus-react-native
.
- Build the Android JNILibs with
-
C/C++
- Navigate to the example app with
cd examples/cpp
. - There are multiple main files
main_vlm, main_llm, main_embed, main_tts
. - Build both the libraries and executable using
build.sh
. - Run with one of the executables
./cactus_vlm
,./cactus_llm
,./cactus_embed
,./cactus_tts
. - Try different models and make changes as desired.
- Navigate to the example app with
-
Contributing
- To contribute a bug fix, create a branch after making your changes with
git checkout -b <branch-name>
and submit a PR. - To contribute a feature, please raise as issue first so it can be discussed, to avoid intersecting with someone else.
- Join our discord
- To contribute a bug fix, create a branch after making your changes with
Device | Gemma-3 1B Q8 (toks/sec) | Qwen-2.5 1.5B Q8 (toks/sec) |
---|---|---|
iPhone 16 Pro Max | 46 | 37 |
iPhone 16 Pro | 46 | 37 |
iPhone 16 | 42 | 36 |
iPhone 15 Pro Max | 39 | 31 |
iPhone 15 Pro | 39 | 31 |
iPhone 14 Pro Max | 38 | 29 |
OnePlus 13 5G | 37 | - |
Samsung Galaxy S24 Ultra | 36 | - |
iPhone 15 | 36 | 25 |
OnePlus Open | 33 | - |
Samsung Galaxy S23 5G | 32 | - |
Samsung Galaxy S24 | 31 | - |
iPhone 13 Pro | 30 | - |
OnePlus 12 | 30 | - |
Galaxy S25 Ultra | 25 | - |
OnePlus 11 | 23 | - |
iPhone 13 mini | 22 | - |
Redmi K70 Ultra | 21 | - |
Xiaomi 13 | 21 | - |
Samsung Galaxy S24+ | 19 | - |
Samsung Galaxy Z Fold 4 | 19 | - |
Xiaomi Poco F6 5G | 19 | - |
We created a demo chat app we use for benchmarking:
You can run up to 10B models at Q4 on most devices, but it is not recommended for production due to file size, speed, battery, heating performance. We generally give the following recommendation.
- Language Generation:
SmolLM2-360m
,Qwen-3-600m-Q6
,Gemma-3-1B-Q6
,Qwen-3-1.7B-Q6
- Multimodal Language Generation:
Smol-VLM-500m-Q6
,Gemma-3n-2B-Q6
- Embeddings:
nomic-v2-moe-300m-Q6
,jina-v3-570m-Q6
- Text-To-Speech:
OuteTTS-0.2-500m-Q6
Gemma-3n-2B-Q6 is a great omni model and beats GPT 4.1 across many metrics. It is multimodal (vision, audio) and can perfectly be used for embedding text, images, audio, as well as zero-shot classification and more, with clever prompt engineering. We are trying hard to get the weights.