GPU Setup

OMM supports GPU acceleration through four backends: CUDA (NVIDIA), Metal (Apple), Vulkan (cross-vendor), and ROCm (AMD). The default auto-detection selects the best available backend for your hardware.

Auto-Detection

OMM probes your system on startup and selects the optimal backend:

Terminal
bash
$ omm doctor
OMM System Check
─────────────────────────
OS: Linux 6.17 (x86_64)
CPU: AMD Ryzen 9 7950X (32 threads)
RAM: 64 GB available
GPU: NVIDIA RTX 4090 (24 GB VRAM)
CUDA: 12.8 detected
Backend: CUDA (auto-detected)
Status: Ready ✓

CUDA (NVIDIA)

The most mature and performant backend. Supports all NVIDIA GPUs with compute capability 6.0+ (Pascal and newer).

Requirements

  • NVIDIA GPU with CUDA Compute 6.0+
  • NVIDIA driver 525+
  • CUDA toolkit 12.x (bundled with OMM on most platforms)

Configuration

config.toml
toml
[engine.llama]
gpu_layers = 35 # Layers to offload (max for full offload)
flash_attention = true # Faster attention computation
[engine.rvllm]
tensor_parallel = 2 # Number of GPUs to use
kv_cache_dtype = "fp8" # Reduce VRAM usage
iNote
Set gpu_layers = -1 to offload all layers to GPU. This gives maximum speed but requires enough VRAM to hold the entire model.

Metal (Apple)

Built-in support for Apple Silicon (M1, M2, M3, M4) and AMD GPUs on macOS. Metal acceleration is enabled automatically on Apple hardware.

M1 / M28-16 GBLlama 3.2 3B, Phi-4 3.8BM1 Pro / M2 Pro16-32 GBQwen 2.5 7B, Mistral 7BM1 Max / M2 Max32-64 GBQwen 2.5 14B, Gemma 3 27BM3 Max / M4 Max36-128 GBLlama 3.3 70B (Q2_K)M2 Ultra / M4 Ultra128-192 GBLlama 3.3 70B (Q4_K_M)

Vulkan (Cross-Vendor)

Vulkan provides GPU acceleration on hardware without native CUDA or Metal support. Works with Intel, AMD, and some ARM GPUs.

Terminal
bash
# Force Vulkan backend
omm serve --vulkan
# Or in config
# engine.llama.backend = "vulkan"

Vulkan support varies by GPU driver. Runomm doctor to check compatibility.

ROCm (AMD)

AMD GPU support via ROCm 6.x. Supports RDNA3 and later architectures (RX 7900 XTX, RX 7900 XT, MI250, MI300).

Terminal
bash
# Force ROCm backend
omm serve --rocm
# Set VRAM fraction
omm serve --rocm --gpu-memory-fraction 0.9

Memory Management

GPU Layers

The gpu_layers parameter controls how many model layers are offloaded to GPU. More layers = faster inference but more VRAM. The remaining layers run on CPU.

0CPU-only inference (slowest)10-20Partial offload (mixed speed)35 (typical)Full offload for 7B models-1All layers to GPU (fastest, needs enough VRAM)

Multi-GPU

With rvllm, distribute models across multiple GPUs using tensor parallelism:

Terminal
bash
# Split a 70B model across 2 GPUs
omm serve --engine rvllm --tensor-parallel 2 --model llama3.3:70b

Troubleshooting

  • Out of VRAM: Reduce gpu_layers, use a smaller quantization (Q4 instead of Q8), or switch to a smaller model
  • Slow inference: Verify GPU is actually being used withomm doctor. Check that gpu_layers is not 0
  • CUDA not found: Install NVIDIA drivers and CUDA toolkit. On Linux, ensurenvidia-smi works
PreviousInference EnginesNextAPI Reference