Ollama Optimization: Speed Up Local LLM Inference
Advanced tips for faster Ollama performance. GPU offloading, context length tuning, and model selection strategies.
Table of Contents
1. Getting Started with Ollama
Ollama provides the easiest way to run local LLMs with automatic GPU detection, model management, and an OpenAI-compatible API. These tips help you squeeze maximum performance from your hardware.
2. GPU Offloading Configuration
By default, Ollama automatically offloads layers to GPU. Control this with OLLAMA_NUM_GPU environment variable.
Set OLLAMA_NUM_GPU=999 to force full GPU offload (if VRAM allows).
Set OLLAMA_NUM_GPU=0 to force CPU-only inference.
For partial offload, set to specific layer count based on your VRAM.
3. Context Length Optimization
Default context is often 2048 or 4096 tokens. Longer contexts use more VRAM.
Set context in Modelfile: PARAMETER num_ctx 8192
Or via API: "options": {"num_ctx": 8192}
Rule of thumb: Each 4K context adds ~1GB VRAM for 7B models.
4. Model Selection for Speed
Smaller models are faster: Phi-3 and Llama 3 8B offer great speed/quality.
Quantization matters: Q4_K_M is typically fastest, Q4_K_S slightly smaller.
Consider specialized models: CodeLlama for code, Llama 3 for general tasks.
MoE models (Mixtral) need more VRAM but can be faster per-token.
5. Advanced Performance Tips
Keep models loaded: OLLAMA_KEEP_ALIVE=-1 prevents unloading between requests.
Use streaming: Reduces time-to-first-token perception.
Batch requests: Multiple concurrent requests can improve throughput.
Monitor with ollama ps: Check which models are loaded and VRAM usage.
Update regularly: Ollama frequently improves performance.
◈ Related Guides
Need Help Choosing Hardware?
Compare specs and pricing for all AI hardware in our catalog.
Open Compare Tool →