You've parsed documents, chunked them, and stored them in a vector database. Now comes the magic moment: combining retrieved context with LLM prompts to create a RAG system that actually works!
Coming from Software Engineering? Context injection in RAG is just dependency injection for knowledge. Instead of injecting a database connection, you're injecting relevant documents into the prompt. The same dependency-injection mindset applies to managing context.
The RAG Flow
Basic Context Injection
The simplest approach - stuff context into the prompt:
# script_id: day_026_context_injection/rag_with_context_management
from openai import OpenAI
import chromadb
openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./vectordb")
collection = chroma_client.get_collection("documents")
def simple_rag(question: str, n_results: int = 3) -> str:
"""Basic RAG implementation."""
# 1. Get question embedding
emb_response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=question
)
query_embedding = emb_response.data[0].embedding
# 2. Retrieve relevant chunks
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
include=["documents"]
)
context = "\n\n".join(results["documents"][0])
# 3. Build prompt with context
prompt = f"""Answer the question based on the following context.
If the context doesn't contain relevant information, say "I don't have enough information to answer that."
Context:
{context}
Question: {question}
Answer:"""
# 4. Get LLM response
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content
# Usage
answer = simple_rag("What are the key features of Python?")
print(answer)
Structured Prompt Templates
Use clear sections for better results:
# script_id: day_026_context_injection/rag_with_context_management
RAG_PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on provided context.
## Instructions
- Only use information from the provided context
- If the context doesn't contain the answer, clearly state that
- Cite specific parts of the context when possible
- Be concise but complete
## Context
{context}
## Question
{question}
## Answer"""
def format_context(chunks: list[dict]) -> str:
"""Format retrieved chunks with source info."""
formatted_parts = []
for i, chunk in enumerate(chunks, 1):
source = chunk.get("metadata", {}).get("source", "Unknown")
text = chunk.get("document", chunk.get("text", ""))
formatted_parts.append(f"[Source {i}: {source}]\n{text}")
return "\n\n---\n\n".join(formatted_parts)
def structured_rag(question: str, n_results: int = 5) -> dict:
"""RAG with structured prompt and metadata."""
# Get embeddings and search
emb_response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=question
)
results = collection.query(
query_embeddings=[emb_response.data[0].embedding],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
# Format chunks with metadata
chunks = [
{
"document": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
# Chroma returns a *distance* (0 = identical, bigger = less similar);
# flip to a 0-1 similarity so higher = better, matching Day 20.
"similarity": 1 - results["distances"][0][i]
}
for i in range(len(results["ids"][0]))
]
context = format_context(chunks)
prompt = RAG_PROMPT_TEMPLATE.format(context=context, question=question)
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return {
"answer": response.choices[0].message.content,
"sources": [c["metadata"].get("source") for c in chunks],
"top_similarity": chunks[0]["similarity"] if chunks else 0
}
Context Window Management
Don't overflow the context window!
# script_id: day_026_context_injection/rag_with_context_management
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
"""Count tokens in text."""
encoder = tiktoken.encoding_for_model(model)
return len(encoder.encode(text))
def fit_context_to_window(
chunks: list[str],
question: str,
max_context_tokens: int = 3000,
model: str = "gpt-4o-mini"
) -> str:
"""Select chunks that fit within token budget."""
# Reserve tokens for question and response
question_tokens = count_tokens(question, model)
overhead_tokens = 500 # For prompt template and response buffer
available_tokens = max_context_tokens - question_tokens - overhead_tokens
selected_chunks = []
current_tokens = 0
for chunk in chunks:
chunk_tokens = count_tokens(chunk, model)
if current_tokens + chunk_tokens <= available_tokens:
selected_chunks.append(chunk)
current_tokens += chunk_tokens
else:
break # Stop when budget exceeded
return "\n\n".join(selected_chunks)
def token_aware_rag(question: str, max_tokens: int = 3000) -> str:
"""RAG with token budget management."""
# Retrieve more chunks than needed
emb_response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=question
)
results = collection.query(
query_embeddings=[emb_response.data[0].embedding],
n_results=10, # Get extra for selection
include=["documents"]
)
chunks = results["documents"][0]
# Fit to token budget
context = fit_context_to_window(chunks, question, max_tokens)
prompt = f"""Based on the context below, answer the question.
Context:
{context}
Question: {question}"""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content
Advanced: Multi-Query RAG
Generate multiple queries for better retrieval:
# script_id: day_026_context_injection/rag_with_context_management
def generate_query_variations(question: str, n_variations: int = 3) -> list[str]:
"""Generate variations of the question for better retrieval."""
prompt = f"""Generate {n_variations} different ways to ask this question.
Return only the questions, one per line.
Original question: {question}"""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
variations = response.choices[0].message.content.strip().split("\n")
return [question] + [v.strip() for v in variations if v.strip()]
def multi_query_rag(question: str, n_results: int = 3) -> str:
"""RAG with multiple query variations."""
# Generate query variations
queries = generate_query_variations(question)
print(f"Searching with {len(queries)} query variations")
# Collect unique chunks from all queries
all_chunks = {}
for query in queries:
emb_response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=query
)
results = collection.query(
query_embeddings=[emb_response.data[0].embedding],
n_results=n_results,
include=["documents", "distances"]
)
for i, doc_id in enumerate(results["ids"][0]):
similarity = 1 - results["distances"][0][i]
if doc_id not in all_chunks or all_chunks[doc_id]["similarity"] < similarity:
all_chunks[doc_id] = {
"text": results["documents"][0][i],
"similarity": similarity
}
# Sort by similarity and take top results
sorted_chunks = sorted(all_chunks.values(), key=lambda x: x["similarity"], reverse=True)
context = "\n\n".join(c["text"] for c in sorted_chunks[:n_results * 2])
# Generate answer
prompt = f"""Answer based on this context:
{context}
Question: {question}"""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content
Complete RAG System
# script_id: day_026_context_injection/complete_rag_system
from dataclasses import dataclass
from typing import Optional
import chromadb
from openai import OpenAI
@dataclass
class RAGResponse:
answer: str
sources: list[dict]
confidence: float
tokens_used: int
class RAGSystem:
"""Production-ready RAG system."""
def __init__(self, collection_name: str, persist_dir: str = "./vectordb"):
self.openai = OpenAI()
self.chroma = chromadb.PersistentClient(path=persist_dir)
self.collection = self.chroma.get_or_create_collection(collection_name)
self.system_prompt = """You are a helpful assistant that answers questions based on provided context.
Guidelines:
- Only use information from the context provided
- If you can't answer from the context, say so clearly
- Be concise but thorough
- Cite sources when possible using [Source N] notation"""
def query(
self,
question: str,
n_results: int = 5,
# 0.5 is a loose starting point. Calibrate it by printing similarities for
# known-good vs off-topic queries on your own data, then raise it until junk is excluded.
similarity_threshold: float = 0.5
) -> RAGResponse:
"""Process a question through the RAG pipeline."""
# Embed question
emb = self.openai.embeddings.create(
model="text-embedding-3-small",
input=question
)
# Retrieve
results = self.collection.query(
query_embeddings=[emb.data[0].embedding],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
# Filter by threshold
sources = []
context_parts = []
for i in range(len(results["ids"][0])):
similarity = 1 - results["distances"][0][i]
if similarity >= similarity_threshold:
sources.append({
"id": results["ids"][0][i],
"text": results["documents"][0][i][:200] + "...",
"metadata": results["metadatas"][0][i],
"similarity": similarity
})
context_parts.append(
f"[Source {len(sources)}]\n{results['documents'][0][i]}"
)
if not sources:
return RAGResponse(
answer="I couldn't find relevant information to answer your question.",
sources=[],
confidence=0.0,
tokens_used=0
)
context = "\n\n---\n\n".join(context_parts)
# Generate response (with error handling for production)
try:
response = self.openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
temperature=0,
timeout=30.0 # Don't let API calls hang
)
except Exception as e:
return RAGResponse(
answer=f"Sorry, I encountered an error generating a response: {type(e).__name__}",
sources=sources,
confidence=0.0,
tokens_used=0
)
avg_similarity = sum(s["similarity"] for s in sources) / len(sources)
return RAGResponse(
answer=response.choices[0].message.content,
sources=sources,
confidence=avg_similarity,
tokens_used=response.usage.total_tokens
)
# Usage
rag = RAGSystem("knowledge_base")
# Add some documents first
rag.collection.add(
ids=["1", "2", "3"],
documents=[
"Python is a programming language known for its simplicity.",
"Machine learning uses algorithms to learn from data.",
"RAG combines retrieval with generation for better answers."
],
metadatas=[{"topic": "python"}, {"topic": "ml"}, {"topic": "rag"}]
)
# Query
result = rag.query("What is Python?")
print(f"Answer: {result.answer}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Sources: {len(result.sources)}")
Multi-Tenant RAG: Security Considerations
Coming from Software Engineering? If you've built multi-tenant SaaS systems, you know that User A should never see User B's data. The same principle applies to RAG — but it's easy to accidentally leak documents across tenants through vector search.
In production RAG systems with multiple users or organizations, you must ensure document isolation:
The Problem
Without access control, a vector similarity search returns the closest matches by vector distance (the "nearest neighbors") across all documents in your collection — regardless of who uploaded them.
Solutions
1. Metadata filtering (simplest): Tag every document with an owner_id and filter at query time.
# script_id: day_026_context_injection/multi_tenant_metadata_filtering
# When indexing
collection.add(
documents=["Secret quarterly results..."],
metadatas=[{"owner_id": "org_123", "department": "finance"}],
ids=["doc_1"]
)
# When querying — ALWAYS include the owner filter
# This block highlights the where= filter; in this lesson's stack you'd embed the query
# and pass query_embeddings=[...] alongside where={...}, as in the blocks above.
results = collection.query(
query_texts=["quarterly results"],
where={"owner_id": "org_123"}, # Critical: never omit this
n_results=5
)
2. Separate collections per tenant: Stronger isolation but more operational overhead.
3. Row-level security in the vector DB: Pgvector with PostgreSQL RLS policies gives database-level enforcement.
Rule of thumb: Start with metadata filtering for prototypes. Move to separate collections or database-level RLS before handling sensitive data in production. Never rely on the application layer alone — defense in depth applies here just like any other data access pattern.
Reranking: Improving Retrieval Quality
Vector similarity search is fast but approximate. The top-K chunks from embedding search aren't always the most useful results — they're just the nearest neighbors in embedding space; they're just the closest by raw distance, which isn't the same as the most useful answer to the question. Reranking adds a second pass: a more powerful model re-scores each candidate for actual relevance to the query.
Two-Stage Retrieval Pipeline
Coming from Software Engineering? This is the same pattern as a two-phase search: a fast, approximate index scan (like Elasticsearch BM25 or vector ANN) followed by a precise re-scoring pass. If you've built search systems with a "candidate generation -> ranking" pipeline, reranking in RAG is exactly the same idea.
Cohere Rerank (Managed API)
# script_id: day_026_context_injection/cohere_rerank
# pip install cohere
# assumes `vector_results` = your top-20 candidates from the vector search above (each with a .text attribute)
import cohere
co = cohere.ClientV2() # Uses COHERE_API_KEY env var
# After vector search returns 20 candidates
results = co.rerank(
query="How does authentication work?",
documents=[chunk.text for chunk in vector_results],
top_n=5,
model="rerank-v3.5",
)
# Use reranked results for context injection
reranked_chunks = [vector_results[r.index] for r in results.results]
Cross-Encoder Reranking (Open-Source)
If you want to self-host and avoid API costs, cross-encoders give you the same two-stage pattern with no external dependency:
A cross-encoder reads the question and a candidate chunk together and outputs one relevance score — unlike embedding search, which scores them separately. That joint look is slower but more accurate, which is why it makes a good second pass.
# script_id: day_026_context_injection/cross_encoder_rerank
# assumes `vector_results` = your top-20 candidates from the vector search above (each with a .text attribute)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, chunk.text) for chunk in vector_results]
scores = reranker.predict(pairs)
# Sort by reranker score
reranked = sorted(zip(scores, vector_results), reverse=True)[:5]
When to Use Reranking
- Large corpora where top-K embedding results are noisy
- High-stakes applications (legal, medical) where retrieval precision matters
- Existing RAG pipelines where answers sometimes miss the right context — reranking is often the highest-ROI fix
The Long-Context Alternative
Everything above assumes you need retrieval. But the ground is shifting fast.
SWE Analogy: RAG is like building a microservice with a search index to find relevant config files. Long-context is like just... loading the entire config directory into memory. If it fits, why build the plumbing?
With Claude and Gemini both supporting up to ~1M tokens (as of 2026-06; verify with the provider), many use cases that previously required RAG can now just stuff all the documents directly into the context window. No chunking, no embeddings, no vector database — just read the files and send them.
When to Use RAG vs Long-Context
Rules of thumb:
- Corpus fits in the window and doesn't change often → try long-context first
- Corpus exceeds the window or updates frequently → RAG is the way
- Need pinpoint retrieval across thousands of documents → RAG
- Building a quick prototype or internal tool → long-context saves you days of infra
Tradeoffs at a Glance
| Aspect | Long-Context | RAG |
|---|---|---|
| Complexity | Simple (no chunking, no vector DB) | Complex (chunking, embedding, retrieval) |
| Cost per query | Higher (more input tokens) | Lower (only relevant chunks) |
| Indexing cost | None | Embedding + storage |
| Scalability | Limited by context window | Scales to millions of docs |
| Freshness | Re-send each query | Re-index only changed docs |
| Accuracy | Good for small corpora | Better for large corpora with precise retrieval |
In code, the long-context approach is trivial: read the whole corpus and drop it into the system prompt ({"role": "system", "content": f"Answer based on these docs:\n\n{all_docs}"}) — no embeddings, no vector DB, no chunking logic. If your documents fit the window, that's dramatically less code to ship and maintain than the RAG system above.
The Hybrid Approach
In practice, production systems often combine both strategies: use RAG to retrieve the top-K relevant chunks from a massive corpus, then stuff those chunks plus surrounding context into a generous context window. You get the scalability of retrieval with the comprehension benefits of feeding the model more complete documents.
# script_id: day_026_context_injection/hybrid_rag_approach
def hybrid_rag(question: str, n_chunks: int = 5, context_padding: int = 2) -> str:
"""Retrieve top chunks via RAG, then expand context for the LLM."""
# Step 1: RAG retrieves the most relevant chunks
top_chunks = retrieve_chunks(question, n_results=n_chunks)
# Step 2: Expand each chunk with its neighbors for richer context
expanded = []
for chunk in top_chunks:
neighbors = get_surrounding_chunks(
chunk["doc_id"], chunk["chunk_index"], padding=context_padding
)
expanded.append("\n".join(neighbors))
full_context = "\n\n---\n\n".join(expanded)
# Step 3: Send the expanded context to a long-context model
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer from this context:\n\n{full_context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Bottom line: RAG isn't going away — it's essential for large, dynamic corpora. But before you reach for the vector database, ask yourself: does this actually need retrieval, or can I just send everything? The cheapest infrastructure is the infrastructure you don't build.
Checkpoint
Run the RAGSystem end to end on a question your documents can answer and confirm: the answer cites real retrieved chunks (check result.sources) rather than the model's own memory. Now ask something not in your corpus and confirm it declines or flags low confidence instead of confabulating. If answers ignore the context entirely, check that format_context is actually injecting the retrieved chunks into the prompt and that the token budget didn't truncate them all away.
Summary
Quick Reference
# script_id: day_026_context_injection/quick_reference
# Basic RAG prompt
prompt = f"""Context: {context}
Question: {question}
Answer based only on the context above."""
# With system prompt
messages = [
{"role": "system", "content": "Answer using only the provided context."},
{"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
]
Framework Note: Frameworks like LangChain and LlamaIndex (covered in Phase 3) can significantly simplify RAG pipeline construction -- handling chunking, embedding, retrieval, and prompt assembly with just a few lines of code. If you find yourself rebuilding these patterns from scratch, consider reaching for a framework.
Exercises
- Add source citations to answers. Number each retrieved chunk in the context (
[1],[2], ...) and instruct the model to cite the number it used. Verify the answer references real chunks. - Enforce a token budget. Before building the prompt, drop the lowest-ranked chunks until the context fits a fixed token count (estimate with
len(text) // 4). Log how many chunks you kept. - Make the system refuse politely. Add an instruction so that when no retrieved chunk answers the question, the model replies "I don't have that information" instead of hallucinating. Test it with an off-topic question.
- Compare RAG vs. long-context. For a small corpus that fits in the window, answer the same question two ways — RAG-injected chunks vs. dumping the whole corpus — and compare answer quality and token cost.
Solutions (approaches)
- Format context as
f"[{i}] {chunk}"; add "Cite the bracketed source number(s) you used." to the prompt. Spot-check that cited numbers exist. - Sort chunks by score, accumulate until the running token estimate hits your cap, then stop. Keep at least one chunk so the prompt is never empty.
- Add "If the context does not contain the answer, say you don't know." Threshold-filter weak matches first so truly irrelevant chunks never reach the prompt.
- Token-count both prompts; long-context is simpler but costs more per query and loses source attribution — exactly the tradeoff in the Long-Context section above.
What's Next?
Now you've built a complete RAG system! Next, we'll go beyond flat vector search with GraphRAG and Knowledge Graphs — handling multi-hop questions that single-shot retrieval struggles with. (Tool calling follows in Day 28.)