Inference Engines

OMM supports two inference engines. llama.cpp is the default and works on all hardware. rvllm provides advanced features for multi-GPU and high-throughput scenarios.

llama.cpp (default)

Binding via llama-cpp-2 Rust FFI. Supports GGUF format with broad hardware compatibility.

Feature	Support
GGUF format	Yes
CUDA / Metal / Vulkan / ROCm	Yes
Flash Attention	Yes
Memory-mapped loading	Yes (instant start)
Multi-GPU	Limited
Continuous batching	No

rvllm (advanced)

High-performance engine for production workloads. Supports SafeTensors format with advanced scheduling.

Feature	Support
GGUF format	No (SafeTensors only)
Tensor parallelism	Yes (multi-GPU)
Speculative decoding	Yes
Continuous batching	Yes
KV cache quantization	Yes
CUPTI autotuning	Yes

Engine Selection

Terminal

bash

# Auto-detect (default)
omm serve --engine auto

# Force llama.cpp
omm serve --engine llama

# Force rvllm
omm serve --engine rvllm

#Inference Engines

#llama.cpp (default)

#rvllm (advanced)

#Engine Selection

Inference Engines

llama.cpp (default)

rvllm (advanced)

Engine Selection