Want to run large models on your laptop? Quantization makes models smaller and faster while keeping quality. This guide explains how.
Coming from Software Engineering? Quantization is like image compression (JPEG vs. PNG) or video encoding (bitrate settings) applied to neural network weights. You're trading precision for size/speed — going from 32-bit floats to 8-bit or 4-bit integers, just like going from lossless to lossy compression. The quality-vs-size tradeoff curves behave similarly: the first rounds of compression are nearly free, but aggressive compression eventually degrades output quality noticeably.
What is Quantization?
Quantization reduces precision of model weights:
- FP32: 32 bits per parameter (full precision)
- FP16: 16 bits per parameter (half precision)
- INT8: 8 bits per parameter (1/4 size)
- INT4: 4 bits per parameter (1/8 size)
Why Quantize?
| Aspect | Original (FP32) | Quantized (4-bit) |
|---|---|---|
| Size | 28 GB | 4 GB |
| RAM needed | 32+ GB | 6 GB |
| Speed | Slow | Fast |
| Quality | Best | ~95% of original |
Trade-offs:
- Smaller = fits on consumer hardware
- Faster = better inference speed
- Slight quality loss = usually acceptable
GGUF Format
GGUF (GPT-Generated Unified Format) is the standard for llama.cpp:
Understanding GGUF Naming
llama-2-7b-chat.Q4_K_M.gguf
│ │ │
│ │ └─ M = Medium (size variant)
│ └──── K = K-quant method
└────────────── Q4 = 4-bit quantization
Quantization Levels
| Name | Bits | Size (7B) | Quality | Use Case |
|---|---|---|---|---|
| Q2_K | 2-bit | ~2.5 GB | Low | Extreme constraints |
| Q3_K_S | 3-bit | ~3 GB | Fair | Very limited RAM |
| Q4_K_S | 4-bit | ~4 GB | Good | Balanced |
| Q4_K_M | 4-bit | ~4.5 GB | Better | Recommended |
| Q5_K_S | 5-bit | ~5 GB | Great | Quality focus |
| Q5_K_M | 5-bit | ~5.5 GB | Excellent | Best balance |
| Q6_K | 6-bit | ~6 GB | Near FP16 | Quality priority |
| Q8_0 | 8-bit | ~7.5 GB | ~FP16 | Maximum quality |
Using GGUF with Ollama
# Ollama automatically uses GGUF
ollama pull llama3.3:7b-q4_K_M
# Or specify quantization
ollama run llama3.3:7b-chat-q5_K_M
Using GGUF with llama.cpp
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
# Run inference
./main -m llama-2-7b-chat.Q4_K_M.gguf -p "Hello, how are you?"
Using with Python
# script_id: day_075_quantization_and_swapping_models/gguf_llama_cpp
from llama_cpp import Llama
# Load quantized model
llm = Llama(
model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
n_ctx=2048, # Context window
n_threads=4 # CPU threads
)
# Generate
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7
)
print(output["choices"][0]["text"])
AWQ Quantization
AWQ (Activation-aware Weight Quantization) preserves important weights:
Why AWQ?
- Smarter quantization than uniform methods
- Better quality at same bit-width
- GPU optimized (faster than GGUF on GPU)
Using AWQ Models
# Install dependencies
pip install autoawq transformers
# Or with vLLM
pip install vllm
# script_id: day_075_quantization_and_swapping_models/awq_load_generate
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Load AWQ model
model_path = "TheBloke/Llama-2-7B-Chat-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_path,
fuse_layers=True,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Generate
prompt = "What is quantum computing?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7
)
print(tokenizer.decode(outputs[0]))
AWQ with vLLM
# script_id: day_075_quantization_and_swapping_models/awq_with_vllm
from vllm import LLM, SamplingParams
# Load AWQ model with vLLM
llm = LLM(
model="TheBloke/Llama-2-7B-Chat-AWQ",
quantization="awq",
dtype="half"
)
# Generate
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256
)
outputs = llm.generate(["What is AI?"], sampling_params)
print(outputs[0].outputs[0].text)
GPTQ Quantization
Another popular GPU-focused quantization:
# script_id: day_075_quantization_and_swapping_models/gptq_usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load GPTQ model
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Choosing the Right Format
| Format | Best For | Hardware |
|---|---|---|
| GGUF | CPU or mixed | Any |
| AWQ | GPU inference | NVIDIA GPU |
| GPTQ | GPU inference | NVIDIA GPU |
Quantizing Your Own Models
Convert to GGUF
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Convert HuggingFace model to GGUF
python convert.py /path/to/model --outfile model-f16.gguf
# Quantize to desired level
./quantize model-f16.gguf model-q4_K_M.gguf Q4_K_M
Create AWQ Model
# script_id: day_075_quantization_and_swapping_models/create_awq_model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama-2-7b-awq"
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantize
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4
}
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Quality Comparison
# script_id: day_075_quantization_and_swapping_models/compare_quantizations
def compare_quantizations(prompt: str, original_model, quantized_model):
"""Compare output quality between models."""
# Original
original_output = original_model.generate(prompt)
# Quantized
quantized_output = quantized_model.generate(prompt)
print("Original:", original_output[:200])
print("Quantized:", quantized_output[:200])
# Simple similarity check
from difflib import SequenceMatcher
similarity = SequenceMatcher(None, original_output, quantized_output).ratio()
print(f"Similarity: {similarity:.2%}")
Summary
Quick Reference
# Ollama with quantization
ollama pull llama3.3:7b-q4_K_M
# GGUF quantization levels
Q4_K_M # Balanced (recommended)
Q5_K_M # Better quality
Q8_0 # Maximum quality
# script_id: day_075_quantization_and_swapping_models/quantization_quick_ref
# GGUF with llama-cpp-python
llm = Llama(model_path="model.Q4_K_M.gguf")
# AWQ with AutoAWQ
model = AutoAWQForCausalLM.from_quantized("model-awq")
# GPTQ with transformers
model = AutoModelForCausalLM.from_pretrained("model-gptq")
What's Next?
Now let's learn how to swap OpenAI for local models in your existing code!
Swapping OpenAI for Local Models
You've built with OpenAI. Now run locally for free! This guide shows you how to swap in local models with minimal code changes.
Why Swap to Local?
Benefits:
- Cost savings: No per-token charges
- Privacy: Data never leaves your machine
- Offline: Works without internet
- Speed: No network latency
The OpenAI-Compatible Interface
Most local tools support OpenAI's API format:
# script_id: day_075_quantization_and_swapping_models/openai_compatible_interface
# OpenAI original
from openai import OpenAI
client = OpenAI()
# Ollama (same interface!)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but not used
)
# The rest of your code stays the same!
response = client.chat.completions.create(
model="llama3.3",
messages=[{"role": "user", "content": "Hello!"}]
)
Method 1: Environment Variable Switch
# script_id: day_075_quantization_and_swapping_models/env_variable_switch
import os
from openai import OpenAI
def get_client():
"""Get appropriate client based on environment."""
provider = os.environ.get("LLM_PROVIDER", "openai")
if provider == "ollama":
return OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
elif provider == "local":
return OpenAI(
base_url="http://localhost:8000/v1",
api_key="local"
)
else:
return OpenAI() # Default to OpenAI
# Usage
client = get_client()
Set environment:
# Use OpenAI
export LLM_PROVIDER=openai
# Use Ollama
export LLM_PROVIDER=ollama
# Use local vLLM
export LLM_PROVIDER=local
Method 2: Configuration-Based
# script_id: day_075_quantization_and_swapping_models/config_based_client
from dataclasses import dataclass
from openai import OpenAI
from typing import Optional
@dataclass
class LLMConfig:
provider: str = "openai"
model: str = "gpt-4o-mini"
base_url: Optional[str] = None
api_key: Optional[str] = None
# Preset configurations
CONFIGS = {
"openai": LLMConfig(
provider="openai",
model="gpt-4o-mini"
),
"ollama-llama3.3": LLMConfig(
provider="ollama",
model="llama3.3",
base_url="http://localhost:11434/v1",
api_key="ollama"
),
"ollama-mistral": LLMConfig(
provider="ollama",
model="mistral",
base_url="http://localhost:11434/v1",
api_key="ollama"
),
"local-vllm": LLMConfig(
provider="vllm",
model="meta-llama/Llama-2-7b-chat-hf",
base_url="http://localhost:8000/v1",
api_key="token"
)
}
class LLMClient:
"""Unified LLM client supporting multiple providers."""
def __init__(self, config_name: str = "openai"):
self.config = CONFIGS[config_name]
self._setup_client()
def _setup_client(self):
if self.config.base_url:
self.client = OpenAI(
base_url=self.config.base_url,
api_key=self.config.api_key
)
else:
self.client = OpenAI()
def chat(self, messages: list, **kwargs) -> str:
response = self.client.chat.completions.create(
model=self.config.model,
messages=messages,
**kwargs
)
return response.choices[0].message.content
# Usage
# Easy to switch!
client = LLMClient("ollama-llama3.3")
response = client.chat([{"role": "user", "content": "Hello!"}])
Method 3: Drop-In Replacement
Create a wrapper that works like OpenAI:
# script_id: day_075_quantization_and_swapping_models/universal_llm
from openai import OpenAI
import os
class UniversalLLM:
"""Drop-in replacement for OpenAI client."""
def __init__(self, provider: str = None):
provider = provider or os.environ.get("LLM_PROVIDER", "openai")
self.provider = provider
self.client = self._create_client()
self.model_map = self._get_model_map()
def _create_client(self) -> OpenAI:
configs = {
"openai": {"base_url": None, "api_key": None},
"ollama": {"base_url": "http://localhost:11434/v1", "api_key": "ollama"},
"vllm": {"base_url": "http://localhost:8000/v1", "api_key": "token"},
}
config = configs.get(self.provider, configs["openai"])
if config["base_url"]:
return OpenAI(base_url=config["base_url"], api_key=config["api_key"])
return OpenAI()
def _get_model_map(self) -> dict:
"""Map OpenAI model names to local equivalents."""
return {
"openai": {
"gpt-4o-mini": "gpt-4o-mini",
"gpt-4": "gpt-4",
},
"ollama": {
"gpt-4o-mini": "llama3.3",
"gpt-4": "mixtral",
},
"vllm": {
"gpt-4o-mini": "meta-llama/Llama-2-7b-chat-hf",
"gpt-4": "meta-llama/Llama-2-70b-chat-hf",
}
}
def _map_model(self, model: str) -> str:
"""Map requested model to provider's model."""
return self.model_map.get(self.provider, {}).get(model, model)
def chat_completion(self, model: str, messages: list, **kwargs) -> str:
"""Create chat completion - same interface as OpenAI."""
actual_model = self._map_model(model)
response = self.client.chat.completions.create(
model=actual_model,
messages=messages,
**kwargs
)
return response.choices[0].message.content
# Usage - exactly like OpenAI!
llm = UniversalLLM("ollama")
# This works regardless of provider
response = llm.chat_completion(
model="gpt-4o-mini", # Automatically mapped to llama3.3
messages=[{"role": "user", "content": "Hello!"}]
)
LangChain Provider Swapping
# script_id: day_075_quantization_and_swapping_models/langchain_swap
from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama
def get_langchain_llm(provider: str = "openai"):
"""Get LangChain LLM based on provider."""
if provider == "ollama":
return ChatOllama(model="llama3.3")
elif provider == "local":
return ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
model="local-model"
)
else:
return ChatOpenAI()
# Usage
llm = get_langchain_llm("ollama")
response = llm.invoke("Hello!")
Model Quality Mapping
Choose appropriate local models:
# script_id: day_075_quantization_and_swapping_models/model_quality_map
MODEL_QUALITY_MAP = {
# OpenAI -> Local equivalents by capability
"gpt-4o-mini": {
"ollama": "llama3.3:7b",
"alternatives": ["mistral:7b", "neural-chat"],
"notes": "Good general purpose"
},
"gpt-4": {
"ollama": "mixtral:8x7b",
"alternatives": ["llama3.3:70b", "deepseek-coder:33b"],
"notes": "Requires more resources"
},
"gpt-4-turbo": {
"ollama": "mixtral:8x7b",
"alternatives": ["llama3:70b"],
"notes": "Best local option"
}
}
def suggest_local_model(openai_model: str) -> dict:
"""Suggest local model equivalent."""
return MODEL_QUALITY_MAP.get(openai_model, {
"ollama": "llama3.3:7b",
"notes": "Default fallback"
})
# Usage
suggestion = suggest_local_model("gpt-4o-mini")
print(f"Use: {suggestion['ollama']}")
Handling Differences
Local models may behave differently:
# script_id: day_075_quantization_and_swapping_models/adaptive_llm
class AdaptiveLLM:
"""LLM client that adapts to provider differences."""
def __init__(self, provider: str):
self.provider = provider
self.client = self._create_client()
def generate(self, prompt: str, **kwargs) -> str:
# Adapt parameters for local models
if self.provider in ["ollama", "vllm"]:
# Local models may need different defaults
kwargs.setdefault("temperature", 0.7) # Often need higher temp
kwargs.setdefault("max_tokens", 512) # Limit for speed
# Some features not supported
kwargs.pop("response_format", None) # JSON mode may not work
kwargs.pop("functions", None) # Function calling limited
messages = [{"role": "user", "content": prompt}]
try:
response = self.client.chat.completions.create(
model=self._get_model(),
messages=messages,
**kwargs
)
return response.choices[0].message.content
except Exception as e:
# Fallback behavior
print(f"Error with {self.provider}: {e}")
return self._fallback_generate(prompt)
def _fallback_generate(self, prompt: str) -> str:
"""Fallback to simpler generation if features fail."""
response = self.client.chat.completions.create(
model=self._get_model(),
messages=[{"role": "user", "content": prompt}],
max_tokens=256
)
return response.choices[0].message.content
Testing Your Swap
# script_id: day_075_quantization_and_swapping_models/universal_llm
def test_provider_swap():
"""Test that local model produces reasonable output."""
providers = ["openai", "ollama"]
test_prompt = "What is 2 + 2? Reply with just the number."
results = {}
for provider in providers:
try:
client = UniversalLLM(provider)
response = client.chat_completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": test_prompt}]
)
results[provider] = {
"success": True,
"response": response,
"contains_4": "4" in response
}
except Exception as e:
results[provider] = {
"success": False,
"error": str(e)
}
# Compare results
print("Provider Test Results:")
for provider, result in results.items():
status = "✅" if result.get("success") else "❌"
print(f" {status} {provider}: {result}")
test_provider_swap()
Summary
Quick Reference
# script_id: day_075_quantization_and_swapping_models/swap_quick_ref
# Quick swap using base_url
client = OpenAI(
base_url="http://localhost:11434/v1", # Ollama
api_key="ollama"
)
# Model mapping
openai_to_local = {
"gpt-4o-mini": "llama3.3",
"gpt-4": "mixtral"
}
# Environment-based
export LLM_PROVIDER=ollama
What's Next?
Now let's learn how to wrap agents in APIs using FastAPI for production deployment!