Welcome to Phase 6! You've learned to build powerful AI systems. Now let's explore running models locally - no API calls, no costs, complete privacy.
Coming from Software Engineering? Ollama is like Docker for ML models — you
pulla model,runit, and it exposes a local API on a port. If you've used Docker Hub to pull images and run containers locally, or even Homebrew to install services, the workflow is nearly identical. The local API is OpenAI-compatible, so your existing API integration code works unchanged — just swap the base URL tolocalhost:11434.
Why Run Models Locally?
Benefits:
- Privacy: Data never leaves your machine
- Cost: No per-token charges
- Offline: Works without internet
- Customization: Fine-tune for your needs
- Speed: No network latency
Installing Ollama
Ollama makes running local models easy:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows - Download from ollama.com
# Start the server
ollama serve
Pulling Models
The number after a model (3B, 70B) is its size in billions of parameters — the learned weights, roughly like a compiled binary's size telling you how much it takes to load. Bigger = smarter but more RAM and slower. Rule of thumb at the common 4-bit setting: ~1GB of RAM per billion parameters.
# Pull popular models
ollama pull llama3.2 # Meta's Llama 3.2 (3B, the default tag)
ollama pull llama3.1:70b # Larger 70B model
ollama pull mistral # Mistral 7B
ollama pull qwen2.5-coder # Code-specialized (modern alternative to codellama)
ollama pull phi4 # Microsoft's small model
# List downloaded models
ollama list
Using Ollama from Python
Direct API
# script_id: day_074_ollama_local_models/ollama_api
import requests
def ollama_generate(prompt: str, model: str = "llama3.2") -> str:
"""Generate text using Ollama."""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Usage
result = ollama_generate("Explain Python in one sentence")
print(result)
Chat API
# script_id: day_074_ollama_local_models/ollama_api
def ollama_chat(messages: list, model: str = "llama3.2") -> str:
"""Chat using Ollama."""
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": model,
"messages": messages,
"stream": False
}
)
return response.json()["message"]["content"]
# Usage
result = ollama_chat([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
])
print(result)
Using the Ollama Python Library
pip install ollama
# script_id: day_074_ollama_local_models/ollama_library
import ollama
# Simple generation
response = ollama.generate(model='llama3.2', prompt='Why is the sky blue?')
print(response['response'])
# Chat
response = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': 'Hello!'}
])
print(response['message']['content'])
# Streaming
for chunk in ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Tell me a joke'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
OpenAI-Compatible Interface
Use Ollama as a drop-in replacement for OpenAI:
# script_id: day_074_ollama_local_models/openai_compatible
from openai import OpenAI
# Point to local Ollama server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but not used
)
# Use exactly like OpenAI!
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
)
print(response.choices[0].message.content)
Day 075 covers env-var/config-driven provider switching between local and cloud.
Understanding Quantization
Local models use quantization to fit in memory. Quantization is lossy compression for model weights — like saving a photo as a smaller JPEG: fewer bits per number, smaller file, slightly lower fidelity. Q4 is aggressive but still good; Q2 is visibly degraded. Day 075 covers quantization formats and bit-level tradeoffs in depth.
Choosing Model Size
The practical takeaway you need now is matching a model to your available RAM:
# script_id: day_074_ollama_local_models/recommend_model
def recommend_model(available_ram_gb: int) -> str:
"""Recommend model based on available RAM."""
if available_ram_gb >= 64:
return "llama3.3:70b" # Best quality
elif available_ram_gb >= 32:
return "llama3.1:70b-instruct-q4_K_M" # Good quality, fits in RAM
elif available_ram_gb >= 16:
return "llama3.1:8b" # 8B model
elif available_ram_gb >= 8:
return "phi4-mini" # Small but capable
else:
return "tinyllama" # Minimal requirements
Model Comparison
# script_id: day_074_ollama_local_models/benchmark_models
import ollama
import time
def benchmark_models(prompt: str, models: list) -> dict:
"""Compare models on the same prompt."""
results = {}
for model in models:
try:
start = time.time()
response = ollama.generate(model=model, prompt=prompt)
elapsed = time.time() - start
results[model] = {
"response": response["response"][:200],
"time_seconds": elapsed,
"tokens_per_second": response.get("eval_count", 0) / elapsed if elapsed > 0 else 0
}
except Exception as e:
results[model] = {"error": str(e)}
return results
# Compare
prompt = "Explain recursion in programming"
models = ["llama3.2", "mistral", "phi4"]
results = benchmark_models(prompt, models)
for model, data in results.items():
print(f"\n{model}:")
if "error" in data:
print(f" Error: {data['error']}")
else:
print(f" Time: {data['time_seconds']:.2f}s")
# tokens/second — roughly how many word-pieces it generates per second;
# the practical measure of speed (a comfortable reading pace is ~10+ tok/s).
print(f" Speed: {data['tokens_per_second']:.1f} tok/s")
Using Local Models in Your Code
Replace Cloud Calls
# script_id: day_074_ollama_local_models/llm_provider
class LLMProvider:
"""Unified LLM provider supporting local and cloud."""
def __init__(self, provider: str = "openai"):
self.provider = provider
if provider == "ollama":
self.client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
self.default_model = "llama3.2"
elif provider == "openai":
self.client = OpenAI()
self.default_model = "gpt-4o-mini"
def chat(self, messages: list, **kwargs) -> str:
"""Send chat completion request."""
model = kwargs.pop("model", self.default_model)
response = self.client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
return response.choices[0].message.content
def embed(self, text: str) -> list:
"""Get embeddings."""
if self.provider == "ollama":
import ollama
# ollama.embed (input=) is the current API; it returns a list of
# vectors under "embeddings". (The legacy ollama.embeddings(prompt=)
# with a singular "embedding" key still exists but is deprecated.)
response = ollama.embed(model="nomic-embed-text", input=text)
return response["embeddings"][0]
else:
response = self.client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Usage
llm = LLMProvider("ollama") # or "openai"
response = llm.chat([{"role": "user", "content": "Hello!"}])
Local Embeddings
Recall from Day 019: embeddings turn text into a fixed-length list of numbers (here ~768 — that is the "dimension") so you can compare meaning by distance. Chat models do not produce these, so you pull a dedicated embedding model like nomic-embed-text.
# script_id: day_074_ollama_local_models/local_embeddings
import ollama
# Pull embedding model
# ollama pull nomic-embed-text
def local_embed(texts: list) -> list:
"""Generate embeddings locally."""
embeddings = []
for text in texts:
response = ollama.embed(
model="nomic-embed-text",
input=text
)
embeddings.append(response["embeddings"][0])
return embeddings
# Usage
texts = ["Hello world", "Machine learning is cool"]
embeddings = local_embed(texts)
print(f"Got {len(embeddings)} embeddings of dimension {len(embeddings[0])}")
Performance Tips
GPU Acceleration
# Check if GPU is being used
ollama run llama3.2 --verbose
# For NVIDIA GPUs, install CUDA drivers
# Models automatically use GPU if available
Batch Processing
# script_id: day_074_ollama_local_models/concurrent_requests
import ollama
def process_batch(prompts: list, model: str = "llama3.2"):
"""Process multiple prompts (note: Ollama processes sequentially)."""
results = []
for prompt in prompts:
response = ollama.generate(model=model, prompt=prompt)
results.append(response["response"])
return results
# For true concurrency, run multiple Ollama instances
# or use vLLM for production batching
Checkpoint
Run the benchmark_models script (or just ollama run llama3.2 from the shell) and confirm you get a coherent completion back with a tokens-per-second number printed. If the call hangs or you get a connection-refused error, check that the Ollama daemon is actually running (ollama serve) and that you've pulled the model first (ollama pull llama3.2).
Summary
Quick Reference
# Ollama commands
ollama pull llama3.2 # Download model
ollama run llama3.2 # Interactive chat
ollama list # Show models
ollama rm llama3.2 # Delete model
# script_id: day_074_ollama_local_models/quick_reference
# fragment: illustrative cheat-sheet / not standalone-runnable
# Python usage
import ollama
response = ollama.chat(model='llama3.2', messages=[...])
# OpenAI-compatible
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
Exercises
- Pull and chat. Install Ollama,
ollama pull llama3.2, and run a single chat turn from Python with theollamapackage. Print the response text. - Swap the client, keep the code. Point the OpenAI SDK at Ollama's
http://localhost:11434/v1base URL and send the same request shape you'd send to OpenAI. Confirm your existing code path works unchanged. - Compare two local models. Pull a second model (e.g.
qwen2.5), ask both the same prompt, and eyeball the difference in quality and latency. - Inspect what you're running. Use
ollama listandollama show llama3.2to report the model's parameter count and quantization level.
Solutions (approaches)
import ollama; print(ollama.chat(model='llama3.2', messages=[{'role':'user','content':'hi'}])['message']['content']).client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama"), thenclient.chat.completions.create(model="llama3.2", messages=[...])— only the client constructor changes.- Loop over
["llama3.2", "qwen2.5"], time eachollama.chatcall withtime.perf_counter(), print response + elapsed. ollama listshows size/quant in the tag;ollama show <model>prints the full modelfile including parameter count.
What's Next?
Now let's go deeper on Quantization and Model Swapping — the GGUF/AWQ/GPTQ formats, bit-level tradeoffs, and choosing the right quantization for your hardware.