Prompt engineering gets you 80% of the way, but sometimes you need a model that deeply understands your domain, follows your exact format, or runs cheaply on a small model. Fine-tuning adapts a pre-trained model to your specific task -- and with LoRA and QLoRA, you can do it on consumer hardware.
Coming from Software Engineering? LoRA is like monkey-patching a large library -- instead of forking and modifying the whole codebase, you inject small changes at key points. The original weights stay frozen (like the original library), and you train tiny adapter matrices that modify behavior at specific layers. At inference time, these adapters merge cleanly into the base model, just like monkey patches get applied at runtime.
When Fine-tuning Beats Prompting
| Approach | Cost | Time | When to Use |
|---|---|---|---|
| Prompt engineering | Free | Minutes | First attempt, always |
| Few-shot prompting | Low (more tokens) | Hours | Need format/style guidance |
| Fine-tuning (LoRA) | Medium ($10-100) | Hours | Domain-specific behavior |
| Full fine-tuning | High ($100-10K) | Days | Maximum performance |
| Continued pre-training | Very high | Weeks | New language or domain |
Full Fine-tuning vs PEFT
Full fine-tuning updates every parameter in the model. For a 7B model, that means modifying 7 billion weights -- requiring massive GPU memory and risking catastrophic forgetting.
Parameter-Efficient Fine-Tuning (PEFT) methods update only a tiny fraction of parameters while keeping most weights frozen.
| Method | Trainable Params | VRAM (7B) | Quality |
|---|---|---|---|
| Full fine-tuning | 100% | ~56 GB | Best |
| LoRA | 0.1-1% | ~16 GB | ~98% of full |
| QLoRA | 0.1-1% | ~6 GB | ~96% of full |
LoRA Explained
LoRA (Low-Rank Adaptation) inserts small trainable matrices into the model's attention layers. Instead of updating a massive weight matrix W directly, it learns two small matrices A and B such that the update is W + BA.
Key Parameters
# script_id: day_078_finetuning_fundamentals/lora_config_example
# LoRA configuration
lora_config = {
"r": 16, # Rank: size of the low-rank matrices
"lora_alpha": 32, # Scaling factor (usually 2x rank)
"lora_dropout": 0.05, # Dropout for regularization
"target_modules": [ # Which layers to adapt
"q_proj", # Query projection
"k_proj", # Key projection
"v_proj", # Value projection
"o_proj", # Output projection
],
}
Rank (r): Controls adapter capacity
r=8: Minimal, good for simple tasksr=16: Balanced (most common)r=32-64: For complex tasks requiring more capacity- Higher rank = more parameters = more VRAM
Alpha: Scaling factor, typically alpha = 2 * r
Target modules: Which weight matrices get LoRA adapters
- At minimum:
q_proj,v_proj(attention queries and values) - Better: add
k_proj,o_proj(all attention projections) - Maximum: add
gate_proj,up_proj,down_proj(MLP layers too)
QLoRA: Fine-tuning on Consumer GPUs
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning on a single GPU with 6-8 GB of VRAM.
Key innovations in QLoRA:
- 4-bit NormalFloat (NF4): Quantization format optimized for normally-distributed neural network weights
- Double quantization: Quantize the quantization constants too, saving additional memory
- Paged optimizers: Offload optimizer states to CPU when GPU runs out
# script_id: day_078_finetuning_fundamentals/qlora_peft_workflow
from transformers import BitsAndBytesConfig
import torch
# QLoRA quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Load model in 4-bit
bnb_4bit_quant_type="nf4", # NormalFloat4 quantization
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
bnb_4bit_use_double_quant=True, # Double quantization
)
Training Hyperparameters
# script_id: day_078_finetuning_fundamentals/training_args
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
# Learning rate
learning_rate=2e-4, # Standard for LoRA (higher than full FT)
lr_scheduler_type="cosine", # Cosine decay works well
# Batch size
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
# Duration
num_train_epochs=3, # 1-3 epochs for most tasks
max_steps=-1, # Or set specific step count
# Precision
bf16=True, # Use bfloat16 mixed precision
# Logging
logging_steps=10,
save_strategy="steps",
save_steps=100,
# Optimization
warmup_ratio=0.03, # Warmup for 3% of training
weight_decay=0.01,
optim="paged_adamw_8bit", # Memory-efficient optimizer
)
Hyperparameter Guidelines
| Parameter | Small Dataset (<1K) | Medium (1K-10K) | Large (>10K) |
|---|---|---|---|
| Learning rate | 1e-4 | 2e-4 | 2e-4 to 5e-4 |
| Epochs | 3-5 | 2-3 | 1-2 |
| Batch size | 8-16 | 16-32 | 32-64 |
| LoRA rank | 8-16 | 16-32 | 16-64 |
| Warmup ratio | 0.05 | 0.03 | 0.03 |
Hardware Requirements
| Model Size | Full FT (FP16) | LoRA (FP16) | QLoRA (4-bit) |
|---|---|---|---|
| 1-3B | 24 GB | 12 GB | 6 GB |
| 7-8B | 56 GB | 16 GB | 8 GB |
| 13B | 104 GB | 32 GB | 12 GB |
| 34B | 272 GB | 80 GB | 24 GB |
| 70B | 560 GB | 160 GB | 48 GB |
Consumer GPU options for QLoRA:
- RTX 3090 / 4090 (24 GB): Up to 13B models
- RTX 3080 / 4080 (16 GB): Up to 8B models
- RTX 3070 (8 GB): Up to 3B models
Cloud options:
- A100 40GB ($1-2/hr): Up to 34B with QLoRA
- A100 80GB ($2-4/hr): Up to 70B with QLoRA
- H100 80GB ($3-5/hr): Fastest training for any size
Putting It Together: LoRA with PEFT
# script_id: day_078_finetuning_fundamentals/qlora_peft_workflow
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config, # QLoRA: load in 4-bit
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# Apply LoRA to model
model = get_peft_model(model, lora_config)
# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,030,261,248 || trainable%: 0.17%
Saving and Loading Adapters
# script_id: day_078_finetuning_fundamentals/qlora_peft_workflow
# Save only the LoRA adapter (tiny file, ~50-100 MB)
model.save_pretrained("./my-lora-adapter")
# Load adapter onto base model later
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(model_name)
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
# Merge adapter into base model for faster inference
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
Separate adapters are useful when you have multiple fine-tuned versions of the same base model -- just swap the adapter file instead of loading a whole new model.
Decision Framework
Summary
Quick Reference
# script_id: day_078_finetuning_fundamentals/quick_reference
# QLoRA quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# LoRA config
lora_config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# Apply and check
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Save adapter
model.save_pretrained("./adapter")
# Merge for deployment
merged = model.merge_and_unload()
Exercises
-
Parameter Calculator: Write a script that calculates the number of trainable parameters for a given model size, LoRA rank, and target modules. Compare r=8, r=16, r=32, and r=64 for a 7B model.
-
Config Explorer: Load a 7B model with QLoRA (using
BitsAndBytesConfig) and try different target module combinations. Printmodel.print_trainable_parameters()for each and create a table comparing them. -
Cost Estimator: Build a calculator that estimates fine-tuning cost given: model size, dataset size, number of epochs, and cloud GPU pricing. Include both QLoRA (single GPU) and LoRA (multi-GPU) estimates.
What's Next?
Theory done! Let's get hands-on and fine-tune a real model using Unsloth!