Inference Engines
OMM supports two inference engines. llama.cpp is the default and works on all hardware. rvllm provides advanced features for multi-GPU and high-throughput scenarios.
llama.cpp (default)
Binding via llama-cpp-2 Rust FFI. Supports GGUF format with broad hardware compatibility.
| Feature | Support |
|---|---|
| GGUF format | Yes |
| CUDA / Metal / Vulkan / ROCm | Yes |
| Flash Attention | Yes |
| Memory-mapped loading | Yes (instant start) |
| Multi-GPU | Limited |
| Continuous batching | No |
rvllm (advanced)
High-performance engine for production workloads. Supports SafeTensors format with advanced scheduling.
| Feature | Support |
|---|---|
| GGUF format | No (SafeTensors only) |
| Tensor parallelism | Yes (multi-GPU) |
| Speculative decoding | Yes |
| Continuous batching | Yes |
| KV cache quantization | Yes |
| CUPTI autotuning | Yes |