Inference Engines

OMM supports two inference engines. llama.cpp is the default and works on all hardware. rvllm provides advanced features for multi-GPU and high-throughput scenarios.

llama.cpp (default)

Binding via llama-cpp-2 Rust FFI. Supports GGUF format with broad hardware compatibility.

FeatureSupport
GGUF formatYes
CUDA / Metal / Vulkan / ROCmYes
Flash AttentionYes
Memory-mapped loadingYes (instant start)
Multi-GPULimited
Continuous batchingNo

rvllm (advanced)

High-performance engine for production workloads. Supports SafeTensors format with advanced scheduling.

FeatureSupport
GGUF formatNo (SafeTensors only)
Tensor parallelismYes (multi-GPU)
Speculative decodingYes
Continuous batchingYes
KV cache quantizationYes
CUPTI autotuningYes

Engine Selection

Terminal
bash
# Auto-detect (default)
omm serve --engine auto
# Force llama.cpp
omm serve --engine llama
# Force rvllm
omm serve --engine rvllm
PreviousKeyboard ShortcutsNextGPU Setup