GGUF and SafeTensors
OMM supports two model formats. GGUF is the standard for llama.cpp inference with quantized weights. SafeTensors is the full-precision format used by rvllm for advanced workloads.
GGUF
GGUF (GPT-Generated Unified Format) is a binary format optimized for llama.cpp. It packages model weights, tokenizer, and metadata into a single file with built-in quantization.
Advantages
- Quantized: Models are compressed at download time, reducing disk and memory usage
- Single file: Everything packaged together (weights, tokenizer, config)
- Memory-mapped: Near-instant model loading via mmap
- Universal: Runs on CPU, CUDA, Metal, Vulkan, and ROCm
- Huge ecosystem: 50,000+ models available on HuggingFace
Quantization Variants
SafeTensors
SafeTensors is a secure, fast file format developed by HuggingFace for storing model tensors. OMM uses SafeTensors with the rvllm engine for full-precision inference and advanced features.
Advantages
- Full precision: No quality loss from quantization
- Tensor parallelism: Split model across multiple GPUs
- Continuous batching: Handle concurrent requests efficiently
- Speculative decoding: Use a draft model for faster generation
- KV cache quantization: Reduce memory without reducing model quality
Format Comparison
Choosing a Format
Use GGUF with llama.cpp. Choose Q4_K_M or Q5_K_M quantization.
Use SafeTensors with rvllm. Enable tensor parallelism across your GPUs.
Use GGUF F16 or SafeTensors. Both provide full-precision inference.
Use SafeTensors with rvllm continuous batching for best throughput.