GGUF and SafeTensors

OMM supports two model formats. GGUF is the standard for llama.cpp inference with quantized weights. SafeTensors is the full-precision format used by rvllm for advanced workloads.

GGUF

GGUF (GPT-Generated Unified Format) is a binary format optimized for llama.cpp. It packages model weights, tokenizer, and metadata into a single file with built-in quantization.

Advantages

Quantized: Models are compressed at download time, reducing disk and memory usage
Single file: Everything packaged together (weights, tokenizer, config)
Memory-mapped: Near-instant model loading via mmap
Universal: Runs on CPU, CUDA, Metal, Vulkan, and ROCm
Huge ecosystem: 50,000+ models available on HuggingFace

Quantization Variants

Q2_K2-bit~25%Testing, very low RAMQ3_K_M3-bit~30%Minimal hardwareQ4_K_M4-bit~40%Recommended for most usersQ5_K_M5-bit~50%High quality, modest RAMQ6_K6-bit~60%Near full precisionQ8_08-bit~75%Near-lossless qualityF1616-bit100%Full precision (no quantization)

iNote

Q4_K_M is the best starting point for most users. It provides a strong balance between model quality and resource usage. Move to Q5_K_M or Q6_K if you have extra RAM and want higher fidelity.

SafeTensors

SafeTensors is a secure, fast file format developed by HuggingFace for storing model tensors. OMM uses SafeTensors with the rvllm engine for full-precision inference and advanced features.

Advantages

Full precision: No quality loss from quantization
Tensor parallelism: Split model across multiple GPUs
Continuous batching: Handle concurrent requests efficiently
Speculative decoding: Use a draft model for faster generation
KV cache quantization: Reduce memory without reducing model quality

Format Comparison

Enginellama.cpprvllmQuantizationBuilt-in (Q2-Q8)None (full precision)File sizeSmall (quantized)Large (full weights)Loading timeInstant (mmap)Slower (tensor load)Multi-GPULimitedFull tensor parallelismContinuous batchingNoYesHardwareCPU + GPUGPU only (CUDA required)Minimum RAM2 GB16 GB + GPU VRAM

Choosing a Format

Running on a laptop or single GPU with limited VRAM

Use GGUF with llama.cpp. Choose Q4_K_M or Q5_K_M quantization.

Running on a server with multiple GPUs

Use SafeTensors with rvllm. Enable tensor parallelism across your GPUs.

Need maximum quality regardless of resources

Use GGUF F16 or SafeTensors. Both provide full-precision inference.

Serving concurrent users

Use SafeTensors with rvllm continuous batching for best throughput.

#GGUF and SafeTensors

#GGUF

#Advantages

#Quantization Variants

#SafeTensors

#Advantages

#Format Comparison

#Choosing a Format

GGUF and SafeTensors

GGUF

Advantages

Quantization Variants

SafeTensors

Advantages

Format Comparison

Choosing a Format