Want to run large models on your laptop? Quantization makes models smaller and faster while keeping quality. This guide explains how.
Coming from Software Engineering? Quantization is like image compression (JPEG vs. PNG) or video encoding (bitrate settings) applied to neural network weights. You're trading precision for size/speed — going from 32-bit floats to 8-bit or 4-bit integers, just like going from lossless to lossy compression. The quality-vs-size tradeoff curves behave similarly: the first rounds of compression are nearly free, but aggressive compression eventually degrades output quality noticeably.
A model is, under the hood, just a giant list of numbers (called weights or parameters) — a 7B model has 7 billion of them, each stored as a 32-bit float by default. Quantization stores each number in fewer bits, the same way you would downcast a
doubleto afloat, or afloatto anint8, to save memory.
This is a two-part day. Part 1 covers quantization formats (GGUF / AWQ / GPTQ) and choosing one for your hardware. Part 2 ("Swapping OpenAI for Local Models") shows how to point existing code at a local model with minimal changes. Part 1 is the concepts; Part 2 is the migration recipe.
What is Quantization?
This expands on the quantization teaser from Day 074 — here we go deeper on formats and choosing one.
Quantization reduces precision of model weights:
- FP32: 32 bits per parameter (full precision)
- FP16: 16 bits per parameter (half precision)
- INT8: 8 bits per parameter (1/4 size)
- INT4: 4 bits per parameter (1/8 size)
Fewer bits means each number is rounded more coarsely — like storing 3.14159 as 3.14, or as just 3. Do that across billions of numbers and the model's answers drift slightly. That rounding error is the whole quality-vs-size tradeoff.
Why Quantize?
| Aspect | Original (FP32) | Quantized (4-bit) |
|---|---|---|
| Size | 28 GB | 4 GB |
| RAM needed | 32+ GB | 6 GB |
| Speed | Slow | Fast |
| Quality | Best | Very close for most tasks (measure your own prompts) |
Trade-offs:
- Smaller = fits on consumer hardware
- Faster = better inference speed
- Slight quality loss = usually acceptable
GGUF Format
GGUF (GPT-Generated Unified Format) is the standard for llama.cpp:
Understanding GGUF Naming
llama-2-7b-chat.Q4_K_M.gguf
│ │ │
│ │ └─ M = Medium (size variant)
│ └──── K = K-quant method
└────────────── Q4 = 4-bit quantization
The K (K-quant) just means a newer, smarter rounding scheme than the original method — you rarely need to care beyond picking a row from the table below. The S/M/L suffix trades a little more size for a little more quality.
Quantization Levels
| Name | Bits | Size (7B) | Quality | Use Case |
|---|---|---|---|---|
| Q2_K | 2-bit | ~2.5 GB | Low | Extreme constraints |
| Q3_K_S | 3-bit | ~3 GB | Fair | Very limited RAM |
| Q4_K_S | 4-bit | ~4 GB | Good | Balanced |
| Q4_K_M | 4-bit | ~4.5 GB | Better | Recommended |
| Q5_K_S | 5-bit | ~5 GB | Great | Quality focus |
| Q5_K_M | 5-bit | ~5.5 GB | Excellent | Best balance |
| Q6_K | 6-bit | ~6 GB | Near FP16 | Quality priority |
| Q8_0 | 8-bit | ~7.5 GB | ~FP16 | Maximum quality |
Using GGUF with Ollama
# Ollama automatically uses GGUF
ollama pull llama3.1:8b-instruct-q4_K_M
# Or specify quantization
ollama run llama3.1:8b-instruct-q5_K_M
Using GGUF with llama.cpp
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
# Run inference (the CLI binary was renamed from `main` to `llama-cli`)
./llama-cli -m llama-2-7b-chat.Q4_K_M.gguf -p "Hello, how are you?"
Using with Python
# script_id: day_075_quantization_and_swapping_models/gguf_llama_cpp
from llama_cpp import Llama
# Load quantized model
llm = Llama(
model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
n_ctx=2048, # Context window
n_threads=4 # CPU threads
)
# Generate
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7
)
print(output["choices"][0]["text"])
Heads up: the AWQ and GPTQ examples below need an NVIDIA GPU —
.to("cuda")will fail on a CPU-only or Apple-Silicon laptop. On a laptop, stick with the GGUF / Ollama path above, which runs on CPU.
AWQ Quantization
AWQ (Activation-aware Weight Quantization) preserves important weights:
Not all of a model's numbers matter equally — a small fraction carry most of the weight on output quality. AWQ runs a sample of real prompts through the model to spot which numbers are most-used, keeps those at higher precision, and rounds the rest aggressively. Result: better quality at the same average bit-width.
Why AWQ?
- Smarter quantization than uniform methods
- Better quality at same bit-width
- GPU optimized (faster than GGUF on GPU)
Using AWQ Models
# Install dependencies
pip install autoawq transformers
# Or with vLLM
pip install vllm
Note: the standalone
autoawqpackage is in maintenance mode and no longer actively developed. The code below still runs, but for new work prefer loading/serving AWQ checkpoints via vLLM (shown next), which is the maintained path. Verify the current state at the autoawq and vLLM repos before relying on it.
# script_id: day_075_quantization_and_swapping_models/awq_load_generate
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Load AWQ model
model_path = "TheBloke/Llama-2-7B-Chat-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_path,
fuse_layers=True,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Generate
prompt = "What is quantum computing?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7
)
print(tokenizer.decode(outputs[0]))
AWQ with vLLM
# script_id: day_075_quantization_and_swapping_models/awq_with_vllm
from vllm import LLM, SamplingParams
# Load AWQ model with vLLM
llm = LLM(
model="TheBloke/Llama-2-7B-Chat-AWQ",
quantization="awq",
dtype="half"
)
# Generate
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256
)
outputs = llm.generate(["What is AI?"], sampling_params)
print(outputs[0].outputs[0].text)
GPTQ Quantization
Another popular GPU-focused quantization:
GPTQ is an older, very widely supported GPU quantization method. Quality is similar to AWQ — AWQ is usually a touch better and faster — but GPTQ has been around longer, so more tools and model repos ship GPTQ builds. That broader support is the "compatibility" the decision tree means.
# script_id: day_075_quantization_and_swapping_models/gptq_usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load GPTQ model
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Choosing the Right Format
| Format | Best For | Hardware |
|---|---|---|
| GGUF | CPU or mixed | Any |
| AWQ | GPU inference | NVIDIA GPU |
| GPTQ | GPU inference | NVIDIA GPU |
Quantizing Your Own Models
Convert to GGUF
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Convert HuggingFace model to GGUF
# (the script was renamed from convert.py to convert_hf_to_gguf.py)
python convert_hf_to_gguf.py /path/to/model --outfile model-f16.gguf
# Quantize to desired level (binary renamed from `quantize` to `llama-quantize`)
./llama-quantize model-f16.gguf model-q4_K_M.gguf Q4_K_M
Create AWQ Model
# script_id: day_075_quantization_and_swapping_models/create_awq_model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantize
quant_config = {
"zero_point": True, # allow an offset so the rounded range need not be centered on 0
"q_group_size": 128, # how many numbers share one rounding scale; 128 is the common default
"w_bit": 4 # target precision: 4-bit
}
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
These defaults match most published AWQ models — change them only if you re-quantize and measure. Use any current AWQ/GPTQ repo for your model — these IDs are examples; verify the repo exists on Hugging Face before pulling.
Quality Comparison
# script_id: day_075_quantization_and_swapping_models/compare_quantizations
# fragment: illustrative comparison helper / not standalone-runnable
def compare_quantizations(prompt: str, original_model, quantized_model):
"""Compare output quality between models."""
# Original
original_output = original_model.generate(prompt)
# Quantized
quantized_output = quantized_model.generate(prompt)
print("Original:", original_output[:200])
print("Quantized:", quantized_output[:200])
# Simple similarity check
from difflib import SequenceMatcher
similarity = SequenceMatcher(None, original_output, quantized_output).ratio()
print(f"Similarity: {similarity:.2%}")
Checkpoint
Load the Q4_K_M GGUF with Llama(model_path=...) from the Python example above, generate a few sentences, and note the process RAM use versus the original FP16 file size — it should be roughly a quarter. If the load fails, check the file path points to the quantized .gguf and that llama-cpp-python is installed.
Summary
Quick Reference
# Ollama with quantization
ollama pull llama3.1:8b-instruct-q4_K_M
# GGUF quantization levels
Q4_K_M # Balanced (recommended)
Q5_K_M # Better quality
Q8_0 # Maximum quality
# script_id: day_075_quantization_and_swapping_models/quantization_quick_ref
# GGUF with llama-cpp-python
llm = Llama(model_path="model.Q4_K_M.gguf")
# AWQ with AutoAWQ
model = AutoAWQForCausalLM.from_quantized("model-awq")
# GPTQ with transformers
model = AutoModelForCausalLM.from_pretrained("model-gptq")
Exercises
- Read the labels. Given the tags
Q4_K_M,Q5_K_M, andQ8_0, rank them by file size and by expected quality, and explain the tradeoff in one sentence each. - Pick a format for hardware. You have (a) a CPU-only laptop and (b) a single 24GB GPU. Choose GGUF/AWQ/GPTQ for each and justify it.
- Quantize your own. Take a Hugging Face model and produce a
Q4_K_MGGUF using llama.cpp (convert_hf_to_gguf.pythenllama-quantize). Note the size before/after. - Measure the cost of compression. Run the same 5 prompts through the full-precision and the Q4 version; record any quality regressions you can spot.
Solutions (approaches)
- Size:
Q4_K_M < Q5_K_M < Q8_0; quality is the reverse. Lower bits = smaller + faster but more rounding error. - CPU-only → GGUF (built for CPU via llama.cpp); 24GB GPU → AWQ or GPTQ (GPU-optimized, activation-aware AWQ usually edges quality).
python convert_hf_to_gguf.py ./model --outfile model.f16.ggufthenllama-quantize model.f16.gguf model.Q4_K_M.gguf Q4_K_M; comparels -lh.- Loop the prompts through both, diff outputs; watch for degraded reasoning/formatting on the quantized one.
Part 2 — Swapping OpenAI for Local Models
You've built with OpenAI. Now run locally for free! This part shows you how to swap in local models with minimal code changes.
Why Swap to Local?
Benefits:
- Cost savings: No per-token charges
- Privacy: Data never leaves your machine
- Offline: Works without internet
- Speed: No network latency
The OpenAI-Compatible Interface
Most local tools support OpenAI's API format:
# script_id: day_075_quantization_and_swapping_models/openai_compatible_interface
# OpenAI original
from openai import OpenAI
client = OpenAI()
# Ollama (same interface!)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but not used
)
# The rest of your code stays the same!
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Hello!"}]
)
Method 1: Environment Variable Switch
# script_id: day_075_quantization_and_swapping_models/env_variable_switch
import os
from openai import OpenAI
def get_client():
"""Get appropriate client based on environment."""
provider = os.environ.get("LLM_PROVIDER", "openai")
if provider == "ollama":
return OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
elif provider == "local":
return OpenAI(
base_url="http://localhost:8000/v1",
api_key="local"
)
else:
return OpenAI() # Default to OpenAI
# Usage
client = get_client()
Set environment:
# Use OpenAI
export LLM_PROVIDER=openai
# Use Ollama
export LLM_PROVIDER=ollama
# Use local vLLM
export LLM_PROVIDER=local
Method 2: Configuration-Based
# script_id: day_075_quantization_and_swapping_models/config_based_client
from dataclasses import dataclass
from openai import OpenAI
from typing import Optional
@dataclass
class LLMConfig:
provider: str = "openai"
model: str = "gpt-4o-mini"
base_url: Optional[str] = None
api_key: Optional[str] = None
# Preset configurations
CONFIGS = {
"openai": LLMConfig(
provider="openai",
model="gpt-4o-mini"
),
"ollama-llama3.1": LLMConfig(
provider="ollama",
model="llama3.1:8b",
base_url="http://localhost:11434/v1",
api_key="ollama"
),
"ollama-mistral": LLMConfig(
provider="ollama",
model="mistral",
base_url="http://localhost:11434/v1",
api_key="ollama"
),
"local-vllm": LLMConfig(
provider="vllm",
model="meta-llama/Llama-2-7b-chat-hf",
base_url="http://localhost:8000/v1",
api_key="token"
)
}
class LLMClient:
"""Unified LLM client supporting multiple providers."""
def __init__(self, config_name: str = "openai"):
self.config = CONFIGS[config_name]
self._setup_client()
def _setup_client(self):
if self.config.base_url:
self.client = OpenAI(
base_url=self.config.base_url,
api_key=self.config.api_key
)
else:
self.client = OpenAI()
def chat(self, messages: list, **kwargs) -> str:
response = self.client.chat.completions.create(
model=self.config.model,
messages=messages,
**kwargs
)
return response.choices[0].message.content
# Usage
# Easy to switch!
client = LLMClient("ollama-llama3.1")
response = client.chat([{"role": "user", "content": "Hello!"}])
Method 3: Drop-In Replacement
Create a wrapper that works like OpenAI:
# script_id: day_075_quantization_and_swapping_models/universal_llm
from openai import OpenAI
import os
class UniversalLLM:
"""Drop-in replacement for OpenAI client."""
def __init__(self, provider: str = None):
provider = provider or os.environ.get("LLM_PROVIDER", "openai")
self.provider = provider
self.client = self._create_client()
self.model_map = self._get_model_map()
def _create_client(self) -> OpenAI:
configs = {
"openai": {"base_url": None, "api_key": None},
"ollama": {"base_url": "http://localhost:11434/v1", "api_key": "ollama"},
"vllm": {"base_url": "http://localhost:8000/v1", "api_key": "token"},
}
config = configs.get(self.provider, configs["openai"])
if config["base_url"]:
return OpenAI(base_url=config["base_url"], api_key=config["api_key"])
return OpenAI()
def _get_model_map(self) -> dict:
"""Map OpenAI model names to local equivalents."""
return {
"openai": {
"gpt-4o-mini": "gpt-4o-mini",
"gpt-4o": "gpt-4o",
},
"ollama": {
"gpt-4o-mini": "llama3.1:8b",
"gpt-4o": "mixtral",
},
"vllm": {
"gpt-4o-mini": "meta-llama/Llama-3.1-8B-Instruct",
"gpt-4o": "meta-llama/Llama-2-70b-chat-hf",
}
}
def _map_model(self, model: str) -> str:
"""Map requested model to provider's model."""
return self.model_map.get(self.provider, {}).get(model, model)
def chat_completion(self, model: str, messages: list, **kwargs) -> str:
"""Create chat completion - same interface as OpenAI."""
actual_model = self._map_model(model)
response = self.client.chat.completions.create(
model=actual_model,
messages=messages,
**kwargs
)
return response.choices[0].message.content
# Usage - exactly like OpenAI!
llm = UniversalLLM("ollama")
# This works regardless of provider
response = llm.chat_completion(
model="gpt-4o-mini", # Automatically mapped to llama3.1:8b
messages=[{"role": "user", "content": "Hello!"}]
)
LangChain Provider Swapping
# script_id: day_075_quantization_and_swapping_models/langchain_swap
from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama
def get_langchain_llm(provider: str = "openai"):
"""Get LangChain LLM based on provider."""
if provider == "ollama":
return ChatOllama(model="llama3.1:8b")
elif provider == "local":
return ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
model="local-model"
)
else:
return ChatOpenAI()
# Usage
llm = get_langchain_llm("ollama")
response = llm.invoke("Hello!")
Model Quality Mapping
Choose appropriate local models:
# script_id: day_075_quantization_and_swapping_models/model_quality_map
MODEL_QUALITY_MAP = {
# OpenAI -> Local equivalents by capability
"gpt-4o-mini": {
"ollama": "llama3.1:8b",
"alternatives": ["mistral:7b", "neural-chat"],
"notes": "Good general purpose"
},
"gpt-4o": {
"ollama": "mixtral:8x7b",
"alternatives": ["llama3:70b", "deepseek-coder:33b"],
"notes": "Requires more resources"
}
}
def suggest_local_model(openai_model: str) -> dict:
"""Suggest local model equivalent."""
return MODEL_QUALITY_MAP.get(openai_model, {
"ollama": "llama3.1:8b",
"notes": "Default fallback"
})
# Usage
suggestion = suggest_local_model("gpt-4o-mini")
print(f"Use: {suggestion['ollama']}")
Handling Differences
Local models may behave differently:
# script_id: day_075_quantization_and_swapping_models/adaptive_llm
class AdaptiveLLM:
"""LLM client that adapts to provider differences."""
def __init__(self, provider: str):
self.provider = provider
self.client = self._create_client()
def generate(self, prompt: str, **kwargs) -> str:
# Adapt parameters for local models
if self.provider in ["ollama", "vllm"]:
# Local models may need different defaults
kwargs.setdefault("temperature", 0.7) # Often need higher temp
kwargs.setdefault("max_tokens", 512) # Limit for speed
# Some features not supported
kwargs.pop("response_format", None) # JSON mode may not work
kwargs.pop("functions", None) # Function calling limited
messages = [{"role": "user", "content": prompt}]
try:
response = self.client.chat.completions.create(
model=self._get_model(),
messages=messages,
**kwargs
)
return response.choices[0].message.content
except Exception as e:
# Fallback behavior
print(f"Error with {self.provider}: {e}")
return self._fallback_generate(prompt)
def _fallback_generate(self, prompt: str) -> str:
"""Fallback to simpler generation if features fail."""
response = self.client.chat.completions.create(
model=self._get_model(),
messages=[{"role": "user", "content": prompt}],
max_tokens=256
)
return response.choices[0].message.content
Testing Your Swap
# script_id: day_075_quantization_and_swapping_models/universal_llm
def test_provider_swap():
"""Test that local model produces reasonable output."""
providers = ["openai", "ollama"]
test_prompt = "What is 2 + 2? Reply with just the number."
results = {}
for provider in providers:
try:
client = UniversalLLM(provider)
response = client.chat_completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": test_prompt}]
)
results[provider] = {
"success": True,
"response": response,
"contains_4": "4" in response
}
except Exception as e:
results[provider] = {
"success": False,
"error": str(e)
}
# Compare results
print("Provider Test Results:")
for provider, result in results.items():
status = "✅" if result.get("success") else "❌"
print(f" {status} {provider}: {result}")
test_provider_swap()
Checkpoint
Start ollama serve, pull the model, and run test_provider_swap(). Confirm the ollama path returns a response containing 4. If it errors, check Ollama is running on :11434 and the model is pulled.
Summary
Quick Reference
# script_id: day_075_quantization_and_swapping_models/swap_quick_ref
# fragment: illustrative cheat-sheet / not standalone-runnable
# Quick swap using base_url
client = OpenAI(
base_url="http://localhost:11434/v1", # Ollama
api_key="ollama"
)
# Model mapping
openai_to_local = {
"gpt-4o-mini": "llama3.1:8b",
"gpt-4o": "mixtral"
}
# Environment-based
export LLM_PROVIDER=ollama
Exercises
- One-line provider switch. Take a working OpenAI call and make it hit a local Ollama server by changing only the
base_urlandapi_key. No other code should change. - Config-driven selection. Add an
LLM_PROVIDERenv var that picks betweenopenaiandollamaat startup, mapping a friendly model name to the right concrete model per provider. - Build a model map. Write a dict that maps each OpenAI model you use to its closest local equivalent, and a helper that resolves the name based on the active provider.
- Add a fallback. Wrap the local call so that if the local server is down, it transparently falls back to OpenAI (and logs which path it took).
Solutions (approaches)
OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")— the request body stays identical.- Read
os.getenv("LLM_PROVIDER", "openai"); branch the client constructor and look the model up in a per-provider table. {"gpt-4o-mini": "llama3.1:8b", "gpt-4o": "mixtral"};resolve(name, provider)returns the mapped value when provider is local, elsename.try:local call;except (ConnectionError, APIError):fall back to the OpenAI client andlogging.warning("fell back to openai").
What's Next?
Next up: vLLM for Production Inference — PagedAttention, continuous batching, and serving open-weight models at scale.