Tips & Tricks

Ollama Optimization: Speed Up Local LLM Inference

Advanced tips for faster Ollama performance. GPU offloading, context length tuning, and model selection strategies.

By HardwareHQ Team6 min readDecember 15, 2024

1. Getting Started with Ollama

Ollama provides the easiest way to run local LLMs with automatic GPU detection, model management, and an OpenAI-compatible API. These tips help you squeeze maximum performance from your hardware.

2. GPU Offloading Configuration

By default, Ollama automatically offloads layers to GPU. Control this with OLLAMA_NUM_GPU environment variable.

Set OLLAMA_NUM_GPU=999 to force full GPU offload (if VRAM allows).

Set OLLAMA_NUM_GPU=0 to force CPU-only inference.

For partial offload, set to specific layer count based on your VRAM.

3. Context Length Optimization

Default context is often 2048 or 4096 tokens. Longer contexts use more VRAM.

Set context in Modelfile: PARAMETER num_ctx 8192

Or via API: "options": {"num_ctx": 8192}

Rule of thumb: Each 4K context adds ~1GB VRAM for 7B models.

4. Model Selection for Speed

Smaller models are faster: Phi-3 and Llama 3 8B offer great speed/quality.

Quantization matters: Q4_K_M is typically fastest, Q4_K_S slightly smaller.

Consider specialized models: CodeLlama for code, Llama 3 for general tasks.

MoE models (Mixtral) need more VRAM but can be faster per-token.

5. Advanced Performance Tips

Keep models loaded: OLLAMA_KEEP_ALIVE=-1 prevents unloading between requests.

Use streaming: Reduces time-to-first-token perception.

Batch requests: Multiple concurrent requests can improve throughput.

Monitor with ollama ps: Check which models are loaded and VRAM usage.

Update regularly: Ollama frequently improves performance.

Related Guides

Need Help Choosing Hardware?

Compare specs and pricing for all AI hardware in our catalog.

Open Compare Tool →