GGUF and SafeTensors

OMM supports two model formats. GGUF is the standard for llama.cpp inference with quantized weights. SafeTensors is the full-precision format used by rvllm for advanced workloads.

GGUF

GGUF (GPT-Generated Unified Format) is a binary format optimized for llama.cpp. It packages model weights, tokenizer, and metadata into a single file with built-in quantization.

Advantages

  • Quantized: Models are compressed at download time, reducing disk and memory usage
  • Single file: Everything packaged together (weights, tokenizer, config)
  • Memory-mapped: Near-instant model loading via mmap
  • Universal: Runs on CPU, CUDA, Metal, Vulkan, and ROCm
  • Huge ecosystem: 50,000+ models available on HuggingFace

Quantization Variants

Q2_K2-bit~25%Testing, very low RAMQ3_K_M3-bit~30%Minimal hardwareQ4_K_M4-bit~40%Recommended for most usersQ5_K_M5-bit~50%High quality, modest RAMQ6_K6-bit~60%Near full precisionQ8_08-bit~75%Near-lossless qualityF1616-bit100%Full precision (no quantization)
iNote
Q4_K_M is the best starting point for most users. It provides a strong balance between model quality and resource usage. Move to Q5_K_M or Q6_K if you have extra RAM and want higher fidelity.

SafeTensors

SafeTensors is a secure, fast file format developed by HuggingFace for storing model tensors. OMM uses SafeTensors with the rvllm engine for full-precision inference and advanced features.

Advantages

  • Full precision: No quality loss from quantization
  • Tensor parallelism: Split model across multiple GPUs
  • Continuous batching: Handle concurrent requests efficiently
  • Speculative decoding: Use a draft model for faster generation
  • KV cache quantization: Reduce memory without reducing model quality

Format Comparison

Enginellama.cpprvllmQuantizationBuilt-in (Q2-Q8)None (full precision)File sizeSmall (quantized)Large (full weights)Loading timeInstant (mmap)Slower (tensor load)Multi-GPULimitedFull tensor parallelismContinuous batchingNoYesHardwareCPU + GPUGPU only (CUDA required)Minimum RAM2 GB16 GB + GPU VRAM

Choosing a Format

1
  • Running on a laptop or single GPU with limited VRAM

    Use GGUF with llama.cpp. Choose Q4_K_M or Q5_K_M quantization.

  • 2
  • Running on a server with multiple GPUs

    Use SafeTensors with rvllm. Enable tensor parallelism across your GPUs.

  • 3
  • Need maximum quality regardless of resources

    Use GGUF F16 or SafeTensors. Both provide full-precision inference.

  • 4
  • Serving concurrent users

    Use SafeTensors with rvllm continuous batching for best throughput.

  • PreviousModel CatalogNextChat