GPU Setup
OMM supports GPU acceleration through four backends: CUDA (NVIDIA), Metal (Apple), Vulkan (cross-vendor), and ROCm (AMD). The default auto-detection selects the best available backend for your hardware.
Auto-Detection
OMM probes your system on startup and selects the optimal backend:
CUDA (NVIDIA)
The most mature and performant backend. Supports all NVIDIA GPUs with compute capability 6.0+ (Pascal and newer).
Requirements
- NVIDIA GPU with CUDA Compute 6.0+
- NVIDIA driver 525+
- CUDA toolkit 12.x (bundled with OMM on most platforms)
Configuration
gpu_layers = -1 to offload all layers to GPU. This gives maximum speed but requires enough VRAM to hold the entire model.Metal (Apple)
Built-in support for Apple Silicon (M1, M2, M3, M4) and AMD GPUs on macOS. Metal acceleration is enabled automatically on Apple hardware.
Vulkan (Cross-Vendor)
Vulkan provides GPU acceleration on hardware without native CUDA or Metal support. Works with Intel, AMD, and some ARM GPUs.
Vulkan support varies by GPU driver. Runomm doctor to check compatibility.
ROCm (AMD)
AMD GPU support via ROCm 6.x. Supports RDNA3 and later architectures (RX 7900 XTX, RX 7900 XT, MI250, MI300).
Memory Management
GPU Layers
The gpu_layers parameter controls how many model layers are offloaded to GPU. More layers = faster inference but more VRAM. The remaining layers run on CPU.
Multi-GPU
With rvllm, distribute models across multiple GPUs using tensor parallelism:
Troubleshooting
- Out of VRAM: Reduce gpu_layers, use a smaller quantization (Q4 instead of Q8), or switch to a smaller model
- Slow inference: Verify GPU is actually being used with
omm doctor. Check that gpu_layers is not 0 - CUDA not found: Install NVIDIA drivers and CUDA toolkit. On Linux, ensure
nvidia-smiworks