akashnotes — Structured Learning for Engineers

Prompt engineering gets you 80% of the way, but sometimes you need a model that deeply understands your domain, follows your exact format, or runs cheaply on a small model. Fine-tuning adapts a pre-trained model to your specific task -- and with LoRA and QLoRA, you can do it on consumer hardware.

Coming from Software Engineering? LoRA is like monkey-patching a large library -- instead of forking and modifying the whole codebase, you inject small changes at key points. The original weights stay frozen (like the original library), and you train tiny adapter matrices that modify behavior at specific layers. At inference time, these adapters merge cleanly into the base model, just like monkey patches get applied at runtime.

When Fine-tuning Beats Prompting

Approach	Cost	Time	When to Use
Prompt engineering	Free	Minutes	First attempt, always
Few-shot prompting	Low (more tokens)	Hours	Need format/style guidance
Fine-tuning (LoRA)	Medium ($10-100)	Hours	Domain-specific behavior
Full fine-tuning	High ($100-10K)	Days	Maximum performance
Continued pre-training	Very high	Weeks	New language or domain

Full Fine-tuning vs PEFT

Full fine-tuning updates every parameter in the model. For a 7B model, that means modifying 7 billion weights -- requiring massive GPU memory and risking catastrophic forgetting -- where retraining on your narrow data overwrites general skills the model already had, like a global find-and-replace that fixes one file but corrupts everything else it touched.

Parameter-Efficient Fine-Tuning (PEFT) methods update only a tiny fraction of parameters while keeping most weights frozen.

Method	Trainable Params	VRAM (7B)	Quality
Full fine-tuning	100%	~56 GB	Best
LoRA	0.1-1%	~16 GB	~98% of full
QLoRA	0.1-1%	~6 GB	~96% of full

LoRA Explained

LoRA (Low-Rank Adaptation) inserts small trainable matrices into the model's attention layers. Instead of updating a massive weight matrix W directly, it learns two small matrices A and B such that the update is W + BA.

A model layer is just a big grid of numbers (a matrix W) that transforms its input. Full fine-tuning rewrites all d x d of those numbers. LoRA leaves W untouched and instead learns two skinny grids, A and B, whose product BA has the same shape as W but is built from far fewer numbers -- that skinniness is what "low-rank" means. At runtime the model computes the original Wx plus a small correction BAx.

Key Parameters

# script_id: day_078_finetuning_fundamentals/lora_config_example
# LoRA configuration
lora_config = {
    "r": 16,              # Rank: size of the low-rank matrices
    "lora_alpha": 32,     # Scaling factor (usually 2x rank)
    "lora_dropout": 0.05, # Dropout for regularization
    "target_modules": [   # Which layers to adapt
        "q_proj",         # Query projection
        "k_proj",         # Key projection
        "v_proj",         # Value projection
        "o_proj",         # Output projection
    ],
}

Rank (r) is the width of those skinny A and B grids -- bigger r means more numbers the adapter can use, so it learns more but costs more memory.

r=8: Minimal, good for simple tasks
r=16: Balanced (most common)
r=32-64: For complex tasks requiring more capacity
Higher rank = more parameters = more VRAM

Alpha controls how strongly the adapter's correction is applied on top of the frozen model; the common rule alpha = 2 * r keeps that strength balanced as you change rank.

Target modules: Which weight matrices get LoRA adapters. These strings are just PEFT's names for specific layers inside the model -- you don't need the math, only which to target. The q/k/v/o layers are the attention layers (where the model decides which parts of the input to focus on); the gate/up/down layers are its general-purpose compute layers. Start with the attention ones and add the rest only if quality falls short.

At minimum: q_proj, v_proj (attention queries and values)
Better: add k_proj, o_proj (all attention projections)
Maximum: add gate_proj, up_proj, down_proj (MLP layers too)

QLoRA: Fine-tuning on Consumer GPUs

QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning on a single GPU with 6-8 GB of VRAM.

Key innovations in QLoRA:

4-bit NormalFloat (NF4): a 4-bit number format tuned for the typical spread of values in model weights (most cluster near zero, a few are large), so it loses less quality than generic 4-bit rounding
Double quantization: Quantize the quantization constants too, saving additional memory
Paged optimizers: Offload optimizer states to CPU when GPU runs out

# script_id: day_078_finetuning_fundamentals/qlora_peft_workflow
from transformers import BitsAndBytesConfig
import torch

# QLoRA quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Load model in 4-bit
    bnb_4bit_quant_type="nf4",            # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
    bnb_4bit_use_double_quant=True,       # Double quantization
)

Training Hyperparameters

A few knobs you'll actually tune below:

epoch = one full pass over your dataset (like one loop over a list)
learning_rate = how big each adjustment step is -- too high overshoots, too low never finishes
batch size = how many examples are processed before each adjustment
warmup = start with tiny steps so early noisy updates do not wreck the adapter

The gradient/optimizer settings (gradient_accumulation_steps, weight_decay, optim) are sensible defaults -- leave them as-is.

# script_id: day_078_finetuning_fundamentals/training_args
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",

    # Learning rate
    learning_rate=2e-4,           # Standard for LoRA (higher than full FT)
    lr_scheduler_type="cosine",   # Cosine decay works well

    # Batch size
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16

    # Duration
    num_train_epochs=3,           # 1-3 epochs for most tasks
    max_steps=-1,                 # Or set specific step count

    # Precision
    bf16=True,                    # Use bfloat16 mixed precision

    # Logging
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,

    # Optimization
    warmup_ratio=0.03,            # Warmup for 3% of training
    weight_decay=0.01,
    optim="paged_adamw_8bit",     # Memory-efficient optimizer
)

Hyperparameter Guidelines

Parameter	Small Dataset (<1K)	Medium (1K-10K)	Large (>10K)
Learning rate	1e-4	2e-4	2e-4 to 5e-4
Epochs	3-5	2-3	1-2
Batch size	8-16	16-32	32-64
LoRA rank	8-16	16-32	16-64
Warmup ratio	0.05	0.03	0.03

Hardware Requirements

Model Size	Full FT (FP16)	LoRA (FP16)	QLoRA (4-bit)
1-3B	24 GB	12 GB	6 GB
7-8B	56 GB	16 GB	8 GB
13B	104 GB	32 GB	12 GB
34B	272 GB	80 GB	24 GB
70B	560 GB	160 GB	48 GB

Consumer GPU options for QLoRA:

RTX 3090 / 4090 (24 GB): Up to 13B models
RTX 3080 / 4080 (16 GB): Up to 8B models
RTX 3070 (8 GB): Up to 3B models

Cloud options:

A100 40GB ($1-2/hr): Up to 34B with QLoRA
A100 80GB ($2-4/hr): Up to 70B with QLoRA
H100 80GB ($3-5/hr): Fastest training for any size

Cloud GPU prices shown are rough as-of-2026 figures and vary widely by provider and spot availability — verify current rates.

Putting It Together: LoRA with PEFT

# script_id: day_078_finetuning_fundamentals/qlora_peft_workflow
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,  # QLoRA: load in 4-bit
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = prepare_model_for_kbit_training(model)  # required prep step when training a 4-bit model

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,030,261,248 || trainable%: 0.17%

Saving and Loading Adapters

# script_id: day_078_finetuning_fundamentals/qlora_peft_workflow
# Save only the LoRA adapter (tiny file, ~50-100 MB)
model.save_pretrained("./my-lora-adapter")

# Load adapter onto base model later
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(model_name)
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# Merge adapter into base model for faster inference
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

Separate adapters are useful when you have multiple fine-tuned versions of the same base model -- just swap the adapter file instead of loading a whole new model.

Decision Framework

Checkpoint

Run the qlora_peft_workflow and confirm that print_trainable_parameters() reports only a small fraction (typically <1%) of weights as trainable — that's the whole point of LoRA. If it reports nearly 100%, check that the LoRA adapter was actually attached (get_peft_model(...)) before you built the trainer.

Summary

Quick Reference

# script_id: day_078_finetuning_fundamentals/quick_reference
# QLoRA quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# LoRA config
lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Apply and check
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Save adapter
model.save_pretrained("./adapter")

# Merge for deployment
merged = model.merge_and_unload()

Exercises

Parameter Calculator: Write a script that calculates the number of trainable parameters for a given model size, LoRA rank, and target modules. Compare r=8, r=16, r=32, and r=64 for a 7B model.
Config Explorer: Load a 7B model with QLoRA (using BitsAndBytesConfig) and try different target module combinations. Print model.print_trainable_parameters() for each and create a table comparing them.
Cost Estimator: Build a calculator that estimates fine-tuning cost given: model size, dataset size, number of epochs, and cloud GPU pricing. Include both QLoRA (single GPU) and LoRA (multi-GPU) estimates.

What's Next?

Theory done! On Day 79 we get hands-on and fine-tune a real model using Unsloth.

Fine-tuning Fundamentals (LoRA, QLoRA)