akashnotes — Structured Learning for Engineers

Want to run large models on your laptop? Quantization makes models smaller and faster while keeping quality. This guide explains how.

Coming from Software Engineering? Quantization is like image compression (JPEG vs. PNG) or video encoding (bitrate settings) applied to neural network weights. You're trading precision for size/speed — going from 32-bit floats to 8-bit or 4-bit integers, just like going from lossless to lossy compression. The quality-vs-size tradeoff curves behave similarly: the first rounds of compression are nearly free, but aggressive compression eventually degrades output quality noticeably.

A model is, under the hood, just a giant list of numbers (called weights or parameters) — a 7B model has 7 billion of them, each stored as a 32-bit float by default. Quantization stores each number in fewer bits, the same way you would downcast a double to a float, or a float to an int8, to save memory.

This is a two-part day. Part 1 covers quantization formats (GGUF / AWQ / GPTQ) and choosing one for your hardware. Part 2 ("Swapping OpenAI for Local Models") shows how to point existing code at a local model with minimal changes. Part 1 is the concepts; Part 2 is the migration recipe.

What is Quantization?

This expands on the quantization teaser from Day 074 — here we go deeper on formats and choosing one.

Quantization reduces precision of model weights:

FP32: 32 bits per parameter (full precision)
FP16: 16 bits per parameter (half precision)
INT8: 8 bits per parameter (1/4 size)
INT4: 4 bits per parameter (1/8 size)

Fewer bits means each number is rounded more coarsely — like storing 3.14159 as 3.14, or as just 3. Do that across billions of numbers and the model's answers drift slightly. That rounding error is the whole quality-vs-size tradeoff.

Why Quantize?

Aspect	Original (FP32)	Quantized (4-bit)
Size	28 GB	4 GB
RAM needed	32+ GB	6 GB
Speed	Slow	Fast
Quality	Best	Very close for most tasks (measure your own prompts)

Trade-offs:

Smaller = fits on consumer hardware
Faster = better inference speed
Slight quality loss = usually acceptable

GGUF Format

GGUF (GPT-Generated Unified Format) is the standard for llama.cpp:

Understanding GGUF Naming

llama-2-7b-chat.Q4_K_M.gguf
         │         │  │
         │         │  └─ M = Medium (size variant)
         │         └──── K = K-quant method
         └────────────── Q4 = 4-bit quantization

The K (K-quant) just means a newer, smarter rounding scheme than the original method — you rarely need to care beyond picking a row from the table below. The S/M/L suffix trades a little more size for a little more quality.

Quantization Levels

Name	Bits	Size (7B)	Quality	Use Case
Q2_K	2-bit	~2.5 GB	Low	Extreme constraints
Q3_K_S	3-bit	~3 GB	Fair	Very limited RAM
Q4_K_S	4-bit	~4 GB	Good	Balanced
Q4_K_M	4-bit	~4.5 GB	Better	Recommended
Q5_K_S	5-bit	~5 GB	Great	Quality focus
Q5_K_M	5-bit	~5.5 GB	Excellent	Best balance
Q6_K	6-bit	~6 GB	Near FP16	Quality priority
Q8_0	8-bit	~7.5 GB	~FP16	Maximum quality

Using GGUF with Ollama

# Ollama automatically uses GGUF
ollama pull llama3.1:8b-instruct-q4_K_M

# Or specify quantization
ollama run llama3.1:8b-instruct-q5_K_M

Using GGUF with llama.cpp

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

# Run inference (the CLI binary was renamed from `main` to `llama-cli`)
./llama-cli -m llama-2-7b-chat.Q4_K_M.gguf -p "Hello, how are you?"

Using with Python

# script_id: day_075_quantization_and_swapping_models/gguf_llama_cpp
from llama_cpp import Llama

# Load quantized model
llm = Llama(
    model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
    n_ctx=2048,  # Context window
    n_threads=4  # CPU threads
)

# Generate
output = llm(
    "What is machine learning?",
    max_tokens=256,
    temperature=0.7
)

print(output["choices"][0]["text"])

Heads up: the AWQ and GPTQ examples below need an NVIDIA GPU — .to("cuda") will fail on a CPU-only or Apple-Silicon laptop. On a laptop, stick with the GGUF / Ollama path above, which runs on CPU.

AWQ Quantization

AWQ (Activation-aware Weight Quantization) preserves important weights:

Not all of a model's numbers matter equally — a small fraction carry most of the weight on output quality. AWQ runs a sample of real prompts through the model to spot which numbers are most-used, keeps those at higher precision, and rounds the rest aggressively. Result: better quality at the same average bit-width.

Why AWQ?

Smarter quantization than uniform methods
Better quality at same bit-width
GPU optimized (faster than GGUF on GPU)

Using AWQ Models

# Install dependencies
pip install autoawq transformers

# Or with vLLM
pip install vllm

Note: the standalone autoawq package is in maintenance mode and no longer actively developed. The code below still runs, but for new work prefer loading/serving AWQ checkpoints via vLLM (shown next), which is the maintained path. Verify the current state at the autoawq and vLLM repos before relying on it.

# script_id: day_075_quantization_and_swapping_models/awq_load_generate
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Load AWQ model
model_path = "TheBloke/Llama-2-7B-Chat-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_path,
    fuse_layers=True,
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Generate
prompt = "What is quantum computing?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7
)

print(tokenizer.decode(outputs[0]))

AWQ with vLLM

# script_id: day_075_quantization_and_swapping_models/awq_with_vllm
from vllm import LLM, SamplingParams

# Load AWQ model with vLLM
llm = LLM(
    model="TheBloke/Llama-2-7B-Chat-AWQ",
    quantization="awq",
    dtype="half"
)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=256
)

outputs = llm.generate(["What is AI?"], sampling_params)
print(outputs[0].outputs[0].text)

GPTQ Quantization

Another popular GPU-focused quantization:

GPTQ is an older, very widely supported GPU quantization method. Quality is similar to AWQ — AWQ is usually a touch better and faster — but GPTQ has been around longer, so more tools and model repos ship GPTQ builds. That broader support is the "compatibility" the decision tree means.

# script_id: day_075_quantization_and_swapping_models/gptq_usage
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load GPTQ model
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Choosing the Right Format

Format	Best For	Hardware
GGUF	CPU or mixed	Any
AWQ	GPU inference	NVIDIA GPU
GPTQ	GPU inference	NVIDIA GPU

Quantizing Your Own Models

Convert to GGUF

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert HuggingFace model to GGUF
# (the script was renamed from convert.py to convert_hf_to_gguf.py)
python convert_hf_to_gguf.py /path/to/model --outfile model-f16.gguf

# Quantize to desired level (binary renamed from `quantize` to `llama-quantize`)
./llama-quantize model-f16.gguf model-q4_K_M.gguf Q4_K_M

Create AWQ Model

# script_id: day_075_quantization_and_swapping_models/create_awq_model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
quant_config = {
    "zero_point": True,    # allow an offset so the rounded range need not be centered on 0
    "q_group_size": 128,   # how many numbers share one rounding scale; 128 is the common default
    "w_bit": 4             # target precision: 4-bit
}

model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

These defaults match most published AWQ models — change them only if you re-quantize and measure. Use any current AWQ/GPTQ repo for your model — these IDs are examples; verify the repo exists on Hugging Face before pulling.

Quality Comparison

# script_id: day_075_quantization_and_swapping_models/compare_quantizations
# fragment: illustrative comparison helper / not standalone-runnable
def compare_quantizations(prompt: str, original_model, quantized_model):
    """Compare output quality between models."""

    # Original
    original_output = original_model.generate(prompt)

    # Quantized
    quantized_output = quantized_model.generate(prompt)

    print("Original:", original_output[:200])
    print("Quantized:", quantized_output[:200])

    # Simple similarity check
    from difflib import SequenceMatcher
    similarity = SequenceMatcher(None, original_output, quantized_output).ratio()
    print(f"Similarity: {similarity:.2%}")

Checkpoint

Load the Q4_K_M GGUF with Llama(model_path=...) from the Python example above, generate a few sentences, and note the process RAM use versus the original FP16 file size — it should be roughly a quarter. If the load fails, check the file path points to the quantized .gguf and that llama-cpp-python is installed.

Summary

Quick Reference

# Ollama with quantization
ollama pull llama3.1:8b-instruct-q4_K_M

# GGUF quantization levels
Q4_K_M  # Balanced (recommended)
Q5_K_M  # Better quality
Q8_0    # Maximum quality

# script_id: day_075_quantization_and_swapping_models/quantization_quick_ref
# GGUF with llama-cpp-python
llm = Llama(model_path="model.Q4_K_M.gguf")

# AWQ with AutoAWQ
model = AutoAWQForCausalLM.from_quantized("model-awq")

# GPTQ with transformers
model = AutoModelForCausalLM.from_pretrained("model-gptq")

Exercises

Read the labels. Given the tags Q4_K_M, Q5_K_M, and Q8_0, rank them by file size and by expected quality, and explain the tradeoff in one sentence each.
Pick a format for hardware. You have (a) a CPU-only laptop and (b) a single 24GB GPU. Choose GGUF/AWQ/GPTQ for each and justify it.
Quantize your own. Take a Hugging Face model and produce a Q4_K_M GGUF using llama.cpp (convert_hf_to_gguf.py then llama-quantize). Note the size before/after.
Measure the cost of compression. Run the same 5 prompts through the full-precision and the Q4 version; record any quality regressions you can spot.

Solutions (approaches)

Size: Q4_K_M < Q5_K_M < Q8_0; quality is the reverse. Lower bits = smaller + faster but more rounding error.
CPU-only → GGUF (built for CPU via llama.cpp); 24GB GPU → AWQ or GPTQ (GPU-optimized, activation-aware AWQ usually edges quality).
python convert_hf_to_gguf.py ./model --outfile model.f16.gguf then llama-quantize model.f16.gguf model.Q4_K_M.gguf Q4_K_M; compare ls -lh.
Loop the prompts through both, diff outputs; watch for degraded reasoning/formatting on the quantized one.

Part 2 — Swapping OpenAI for Local Models

You've built with OpenAI. Now run locally for free! This part shows you how to swap in local models with minimal code changes.

Why Swap to Local?

Benefits:

Cost savings: No per-token charges
Privacy: Data never leaves your machine
Offline: Works without internet
Speed: No network latency

The OpenAI-Compatible Interface

Most local tools support OpenAI's API format:

# script_id: day_075_quantization_and_swapping_models/openai_compatible_interface
# OpenAI original
from openai import OpenAI
client = OpenAI()

# Ollama (same interface!)
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not used
)

# The rest of your code stays the same!
response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Hello!"}]
)

Method 1: Environment Variable Switch

# script_id: day_075_quantization_and_swapping_models/env_variable_switch
import os
from openai import OpenAI

def get_client():
    """Get appropriate client based on environment."""

    provider = os.environ.get("LLM_PROVIDER", "openai")

    if provider == "ollama":
        return OpenAI(
            base_url="http://localhost:11434/v1",
            api_key="ollama"
        )
    elif provider == "local":
        return OpenAI(
            base_url="http://localhost:8000/v1",
            api_key="local"
        )
    else:
        return OpenAI()  # Default to OpenAI

# Usage
client = get_client()

Set environment:

# Use OpenAI
export LLM_PROVIDER=openai

# Use Ollama
export LLM_PROVIDER=ollama

# Use local vLLM
export LLM_PROVIDER=local

Method 2: Configuration-Based

# script_id: day_075_quantization_and_swapping_models/config_based_client
from dataclasses import dataclass
from openai import OpenAI
from typing import Optional

@dataclass
class LLMConfig:
    provider: str = "openai"
    model: str = "gpt-4o-mini"
    base_url: Optional[str] = None
    api_key: Optional[str] = None

# Preset configurations
CONFIGS = {
    "openai": LLMConfig(
        provider="openai",
        model="gpt-4o-mini"
    ),
    "ollama-llama3.1": LLMConfig(
        provider="ollama",
        model="llama3.1:8b",
        base_url="http://localhost:11434/v1",
        api_key="ollama"
    ),
    "ollama-mistral": LLMConfig(
        provider="ollama",
        model="mistral",
        base_url="http://localhost:11434/v1",
        api_key="ollama"
    ),
    "local-vllm": LLMConfig(
        provider="vllm",
        model="meta-llama/Llama-2-7b-chat-hf",
        base_url="http://localhost:8000/v1",
        api_key="token"
    )
}

class LLMClient:
    """Unified LLM client supporting multiple providers."""

    def __init__(self, config_name: str = "openai"):
        self.config = CONFIGS[config_name]
        self._setup_client()

    def _setup_client(self):
        if self.config.base_url:
            self.client = OpenAI(
                base_url=self.config.base_url,
                api_key=self.config.api_key
            )
        else:
            self.client = OpenAI()

    def chat(self, messages: list, **kwargs) -> str:
        response = self.client.chat.completions.create(
            model=self.config.model,
            messages=messages,
            **kwargs
        )
        return response.choices[0].message.content

# Usage
# Easy to switch!
client = LLMClient("ollama-llama3.1")
response = client.chat([{"role": "user", "content": "Hello!"}])

Method 3: Drop-In Replacement

Create a wrapper that works like OpenAI:

# script_id: day_075_quantization_and_swapping_models/universal_llm
from openai import OpenAI
import os

class UniversalLLM:
    """Drop-in replacement for OpenAI client."""

    def __init__(self, provider: str = None):
        provider = provider or os.environ.get("LLM_PROVIDER", "openai")

        self.provider = provider
        self.client = self._create_client()
        self.model_map = self._get_model_map()

    def _create_client(self) -> OpenAI:
        configs = {
            "openai": {"base_url": None, "api_key": None},
            "ollama": {"base_url": "http://localhost:11434/v1", "api_key": "ollama"},
            "vllm": {"base_url": "http://localhost:8000/v1", "api_key": "token"},
        }

        config = configs.get(self.provider, configs["openai"])

        if config["base_url"]:
            return OpenAI(base_url=config["base_url"], api_key=config["api_key"])
        return OpenAI()

    def _get_model_map(self) -> dict:
        """Map OpenAI model names to local equivalents."""
        return {
            "openai": {
                "gpt-4o-mini": "gpt-4o-mini",
                "gpt-4o": "gpt-4o",
            },
            "ollama": {
                "gpt-4o-mini": "llama3.1:8b",
                "gpt-4o": "mixtral",
            },
            "vllm": {
                "gpt-4o-mini": "meta-llama/Llama-3.1-8B-Instruct",
                "gpt-4o": "meta-llama/Llama-2-70b-chat-hf",
            }
        }

    def _map_model(self, model: str) -> str:
        """Map requested model to provider's model."""
        return self.model_map.get(self.provider, {}).get(model, model)

    def chat_completion(self, model: str, messages: list, **kwargs) -> str:
        """Create chat completion - same interface as OpenAI."""
        actual_model = self._map_model(model)

        response = self.client.chat.completions.create(
            model=actual_model,
            messages=messages,
            **kwargs
        )

        return response.choices[0].message.content

# Usage - exactly like OpenAI!
llm = UniversalLLM("ollama")

# This works regardless of provider
response = llm.chat_completion(
    model="gpt-4o-mini",  # Automatically mapped to llama3.1:8b
    messages=[{"role": "user", "content": "Hello!"}]
)

LangChain Provider Swapping

# script_id: day_075_quantization_and_swapping_models/langchain_swap
from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama

def get_langchain_llm(provider: str = "openai"):
    """Get LangChain LLM based on provider."""

    if provider == "ollama":
        return ChatOllama(model="llama3.1:8b")
    elif provider == "local":
        return ChatOpenAI(
            base_url="http://localhost:8000/v1",
            api_key="not-needed",
            model="local-model"
        )
    else:
        return ChatOpenAI()

# Usage
llm = get_langchain_llm("ollama")
response = llm.invoke("Hello!")

Model Quality Mapping

Choose appropriate local models:

# script_id: day_075_quantization_and_swapping_models/model_quality_map
MODEL_QUALITY_MAP = {
    # OpenAI -> Local equivalents by capability
    "gpt-4o-mini": {
        "ollama": "llama3.1:8b",
        "alternatives": ["mistral:7b", "neural-chat"],
        "notes": "Good general purpose"
    },
    "gpt-4o": {
        "ollama": "mixtral:8x7b",
        "alternatives": ["llama3:70b", "deepseek-coder:33b"],
        "notes": "Requires more resources"
    }
}

def suggest_local_model(openai_model: str) -> dict:
    """Suggest local model equivalent."""
    return MODEL_QUALITY_MAP.get(openai_model, {
        "ollama": "llama3.1:8b",
        "notes": "Default fallback"
    })

# Usage
suggestion = suggest_local_model("gpt-4o-mini")
print(f"Use: {suggestion['ollama']}")

Handling Differences

Local models may behave differently:

# script_id: day_075_quantization_and_swapping_models/adaptive_llm
class AdaptiveLLM:
    """LLM client that adapts to provider differences."""

    def __init__(self, provider: str):
        self.provider = provider
        self.client = self._create_client()

    def generate(self, prompt: str, **kwargs) -> str:
        # Adapt parameters for local models
        if self.provider in ["ollama", "vllm"]:
            # Local models may need different defaults
            kwargs.setdefault("temperature", 0.7)  # Often need higher temp
            kwargs.setdefault("max_tokens", 512)  # Limit for speed

            # Some features not supported
            kwargs.pop("response_format", None)  # JSON mode may not work
            kwargs.pop("functions", None)  # Function calling limited

        messages = [{"role": "user", "content": prompt}]

        try:
            response = self.client.chat.completions.create(
                model=self._get_model(),
                messages=messages,
                **kwargs
            )
            return response.choices[0].message.content
        except Exception as e:
            # Fallback behavior
            print(f"Error with {self.provider}: {e}")
            return self._fallback_generate(prompt)

    def _fallback_generate(self, prompt: str) -> str:
        """Fallback to simpler generation if features fail."""
        response = self.client.chat.completions.create(
            model=self._get_model(),
            messages=[{"role": "user", "content": prompt}],
            max_tokens=256
        )
        return response.choices[0].message.content

Testing Your Swap

# script_id: day_075_quantization_and_swapping_models/universal_llm
def test_provider_swap():
    """Test that local model produces reasonable output."""

    providers = ["openai", "ollama"]
    test_prompt = "What is 2 + 2? Reply with just the number."

    results = {}
    for provider in providers:
        try:
            client = UniversalLLM(provider)
            response = client.chat_completion(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": test_prompt}]
            )
            results[provider] = {
                "success": True,
                "response": response,
                "contains_4": "4" in response
            }
        except Exception as e:
            results[provider] = {
                "success": False,
                "error": str(e)
            }

    # Compare results
    print("Provider Test Results:")
    for provider, result in results.items():
        status = "✅" if result.get("success") else "❌"
        print(f"  {status} {provider}: {result}")

test_provider_swap()

Checkpoint

Start ollama serve, pull the model, and run test_provider_swap(). Confirm the ollama path returns a response containing 4. If it errors, check Ollama is running on :11434 and the model is pulled.

Summary

Quick Reference

# script_id: day_075_quantization_and_swapping_models/swap_quick_ref
# fragment: illustrative cheat-sheet / not standalone-runnable
# Quick swap using base_url
client = OpenAI(
    base_url="http://localhost:11434/v1",  # Ollama
    api_key="ollama"
)

# Model mapping
openai_to_local = {
    "gpt-4o-mini": "llama3.1:8b",
    "gpt-4o": "mixtral"
}

# Environment-based
export LLM_PROVIDER=ollama

Exercises

One-line provider switch. Take a working OpenAI call and make it hit a local Ollama server by changing only the base_url and api_key. No other code should change.
Config-driven selection. Add an LLM_PROVIDER env var that picks between openai and ollama at startup, mapping a friendly model name to the right concrete model per provider.
Build a model map. Write a dict that maps each OpenAI model you use to its closest local equivalent, and a helper that resolves the name based on the active provider.
Add a fallback. Wrap the local call so that if the local server is down, it transparently falls back to OpenAI (and logs which path it took).

Solutions (approaches)

OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") — the request body stays identical.
Read os.getenv("LLM_PROVIDER", "openai"); branch the client constructor and look the model up in a per-provider table.
{"gpt-4o-mini": "llama3.1:8b", "gpt-4o": "mixtral"}; resolve(name, provider) returns the mapped value when provider is local, else name.
try: local call; except (ConnectionError, APIError): fall back to the OpenAI client and logging.warning("fell back to openai").

What's Next?

Next up: vLLM for Production Inference — PagedAttention, continuous batching, and serving open-weight models at scale.

Understanding Quantization: GGUF & AWQ

What is Quantization?

Why Quantize?

GGUF Format

Understanding GGUF Naming

Quantization Levels

Using GGUF with Ollama

Using GGUF with llama.cpp

Using with Python

AWQ Quantization

Why AWQ?

Using AWQ Models

AWQ with vLLM

GPTQ Quantization

Choosing the Right Format

Quantizing Your Own Models

Convert to GGUF

Create AWQ Model

Quality Comparison

Checkpoint

Summary

Quick Reference

Exercises

Part 2 — Swapping OpenAI for Local Models

Why Swap to Local?

The OpenAI-Compatible Interface

Method 1: Environment Variable Switch

Method 2: Configuration-Based

Method 3: Drop-In Replacement

LangChain Provider Swapping

Model Quality Mapping

Handling Differences

Testing Your Swap

Checkpoint

Summary

Quick Reference

Exercises

What's Next?

On this page