Phase 6Advanced Finetuning·12 min read

Hands-on Fine-tuning with Unsloth

Phase 6 of 8

Time to get your hands dirty. Unsloth is an optimized fine-tuning library that's 2x faster and uses 50% less memory than standard HuggingFace training -- same output quality, dramatically faster iteration. This guide walks you through fine-tuning a real model end-to-end.

Coming from Software Engineering? Fine-tuning with Unsloth is like using a profiler-optimized build tool -- same output, dramatically faster iteration cycles. Think of it as the difference between gcc -O0 and gcc -O3: your code is identical, but the optimized toolchain compiles (trains) much faster. Unsloth achieves this through custom CUDA kernels and memory optimizations under the hood.


Why Unsloth?

Aspect Standard HF + PEFT Unsloth
Training speed Baseline 2x faster
Memory usage Baseline 50% less
Supported models All Llama, Mistral, Phi, Gemma, Qwen
API compatibility HuggingFace HuggingFace (drop-in)
Custom CUDA kernels No Yes
Free tier N/A Yes (open source)

Installation

# Basic installation
pip install unsloth

# With all dependencies for training
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Optional: Weights & Biases for monitoring
pip install wandb

Step 1: Load the Base Model

# script_id: day_079_handson_finetuning/finetune_workflow
from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048
dtype = None  # Leave as None -- Unsloth picks the right number format for your GPU automatically (you don't need to choose).
load_in_4bit = True  # QLoRA: use 4-bit quantization

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.1-8B-Instruct",  # Or "unsloth/Phi-4"
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(f"Model loaded: {model.config._name_or_path}")
print(f"Parameters: {model.num_parameters():,}")

Unsloth provides pre-optimized model downloads that are faster to load:

  • unsloth/Llama-3.1-8B-Instruct -- Meta's Llama 3.1
  • unsloth/Phi-4 -- Microsoft's Phi-4
  • unsloth/Mistral-7B-Instruct-v0.3 -- Mistral AI
  • unsloth/gemma-2-9b-it -- Google's Gemma 2

Step 2: Configure LoRA

# script_id: day_079_handson_finetuning/finetune_workflow
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA rank
    # These are the model's internal weight matrices LoRA attaches to. This is the
    # standard "all layers" setting -- copy as-is; you rarely change it.
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP
    ],
    lora_alpha=16,                 # Scaling factor
    lora_dropout=0,                # Unsloth recommends 0 (uses its own regularization)
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized checkpointing
    random_state=42,
)

# Check what we're training
model.print_trainable_parameters()
# Example: trainable params ~42M of ~8B total (LoRA r=16) -- under 1% of the model.

Step 3: Prepare the Dataset

# script_id: day_079_handson_finetuning/finetune_workflow
from datasets import load_dataset
import json

# Tiny demo dataset so this runs end-to-end. Swap in your real Day 77 data later.
rows = [
    {"messages": [
        {"role": "user", "content": "Write a Python function to reverse a string."},
        {"role": "assistant", "content": "def reverse(s):\n    return s[::-1]"},
    ]},
    {"messages": [
        {"role": "user", "content": "How do I check if a list is empty in Python?"},
        {"role": "assistant", "content": "Use `if not my_list:` -- an empty list is falsy."},
    ]},
    {"messages": [
        {"role": "user", "content": "Write a function to sum a list of numbers."},
        {"role": "assistant", "content": "def total(nums):\n    return sum(nums)"},
    ]},
]
with open("coding_assistant_train.jsonl", "w") as f:
    for r in rows:
        f.write(json.dumps(r) + "\n")

# Load a dataset (or use your synthetic data from Day 77)
dataset = load_dataset("json", data_files="coding_assistant_train.jsonl", split="train")

# Define the chat template formatting function
def format_chat(example):
    """Format example into the model's chat template."""
    messages = example["messages"]

    # Use the tokenizer's built-in chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )

    return {"text": text}

# Apply formatting
dataset = dataset.map(format_chat)

# Preview a formatted example
print(dataset[0]["text"][:500])

If Your Data Is in Alpaca Format

# script_id: day_079_handson_finetuning/format_alpaca
# Alternative to Step 3: use this INSTEAD of the chat-format loader above.
# Re-run load_dataset for your Alpaca file first.
# Convert Alpaca format to chat format
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

def format_alpaca(example):
    """Format Alpaca-style example."""
    text = alpaca_prompt.format(
        instruction=example["instruction"],
        input=example.get("input", ""),
        output=example["output"],
    )
    return {"text": text}

dataset = dataset.map(format_alpaca)

If Your Data Is in ShareGPT Format

# script_id: day_079_handson_finetuning/format_sharegpt
# Alternative to Step 3: use this INSTEAD of the chat-format loader above.
# Re-run load_dataset for your ShareGPT file first.
from unsloth.chat_templates import get_chat_template

# Unsloth has built-in support for ShareGPT format
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",  # Or "chatml", "mistral", "phi-3", etc.
)

def format_sharegpt(example):
    """Format ShareGPT-style conversations."""
    convos = example["conversations"]
    messages = []
    for turn in convos:
        role = "user" if turn["from"] == "human" else "assistant"
        messages.append({"role": role, "content": turn["value"]})

    text = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": text}

dataset = dataset.map(format_sharegpt)

Step 4: Train

# script_id: day_079_handson_finetuning/finetune_workflow
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=dataset,
    args=SFTConfig(
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc=2,           # Parallel data processing
        packing=False,                # packing crams several short examples into one sequence to waste less space (faster) but can blur where one ends -- leave False while learning.
        output_dir="./output",

        # Training duration
        num_train_epochs=3,
        max_steps=-1,             # Set to positive number to override epochs

        # Batch size
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch = 2 * 4 = 8

        # Learning rate
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_steps=10,

        # Precision and optimization
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        optim="adamw_8bit",

        # Logging
        logging_steps=10,
        save_strategy="steps",
        save_steps=100,

        # Weights & Biases (optional)
        report_to="wandb",        # Remove this line if not using W&B

        # Misc
        seed=42,
    ),
)

# Show GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
print(f"GPU: {gpu_stats.name}")
print(f"VRAM: {gpu_stats.total_memory / 1024**3:.1f} GB")

# Start training
trainer_stats = trainer.train()

# Print results
print(f"Training time: {trainer_stats.metrics['train_runtime']:.0f}s")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")

In plain terms: each step runs the examples through the model (forward pass), measures how far off it was (loss), figures out how to nudge the LoRA weights closer (backward pass -- the nudges are the "gradients"), and applies the nudge.


Step 5: Monitor with Weights & Biases

Run this wandb.init block BEFORE trainer.train() in Step 4 -- since report_to="wandb" auto-starts a run, initializing here sets the project name and config for that run. Keep wandb.log/wandb.finish after training.

# script_id: day_079_handson_finetuning/finetune_workflow
import wandb

# Initialize W&B (run before training)
wandb.init(
    project="my-finetune",
    name="llama-3.1-8b-coding-assistant",
    config={
        "model": "Llama-3.1-8B-Instruct",
        "lora_r": 16,
        "lora_alpha": 16,
        "learning_rate": 2e-4,
        "epochs": 3,
        "dataset_size": len(dataset),
    },
)

# After training, log final metrics
wandb.log({
    "final_loss": trainer_stats.metrics["train_loss"],
    "training_time_seconds": trainer_stats.metrics["train_runtime"],
    "samples_per_second": trainer_stats.metrics["train_samples_per_second"],
})

wandb.finish()

What to watch in W&B:

  • Loss is the model's error on your training examples -- think of it like a test-failure count. Lower means it matches your data better; you want it to fall and then flatten.
  • Loss curve: Should decrease smoothly, flatten by end of training
  • Learning rate: Should follow your scheduler (cosine decay)
  • GPU memory: Stable, no OOM spikes
  • Training-step size ("gradient norm" in W&B): how big each adjustment is. A sudden huge spike usually means the learning rate is too high or a bad example slipped in -- restart with a lower learning rate.

Step 6: Test the Fine-tuned Model

# script_id: day_079_handson_finetuning/finetune_workflow
# Switch to inference mode
FastLanguageModel.for_inference(model)

# Test with a prompt
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to find the longest common subsequence of two strings."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
)

# Decode and print (skip the input tokens)
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Batch Testing

# script_id: day_079_handson_finetuning/finetune_workflow
test_prompts = [
    "Explain what a decorator is in Python.",
    "Write a SQL query to find duplicate emails in a users table.",
    "What's the difference between a list and a tuple?",
    "Debug this: TypeError: 'NoneType' object is not iterable",
]

for prompt in test_prompts:
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to("cuda")

    outputs = model.generate(input_ids=inputs, max_new_tokens=256, temperature=0.3)
    response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)

    print(f"Q: {prompt}")
    print(f"A: {response[:200]}...")
    print("---")

Step 7: Save the LoRA Adapter

# script_id: day_079_handson_finetuning/finetune_workflow
# Save just the LoRA adapter (small, ~50-100 MB)
model.save_pretrained("./lora-adapter")
tokenizer.save_pretrained("./lora-adapter")

print("Adapter saved! Contents:")
import os
for f in os.listdir("./lora-adapter"):
    size = os.path.getsize(f"./lora-adapter/{f}") / 1024 / 1024
    print(f"  {f}: {size:.1f} MB")

Step 8: Merge and Export

For deployment, you can merge the LoRA adapter back into the base model.

# script_id: day_079_handson_finetuning/finetune_workflow
# Option 1: Save merged model in HuggingFace format (for vLLM, TGI)
model.save_pretrained_merged(
    "./merged-model",
    tokenizer,
    save_method="merged_16bit",  # Full precision merged model
)

# Option 2: Save as GGUF for Ollama / llama.cpp
model.save_pretrained_gguf(
    "./model-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Recommended quantization
)

# Option 3: Push to HuggingFace Hub
model.push_to_hub_merged(
    "your-username/my-coding-assistant",
    tokenizer,
    save_method="merged_16bit",
    token="hf_your_token",
)

Loading Your Model in Ollama

After exporting to GGUF, create a Modelfile for Ollama:

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-gguf/unsloth.Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM "You are a helpful coding assistant."
EOF

# Build and run
ollama create my-coding-assistant -f Modelfile
ollama run my-coding-assistant "Write a Python decorator for caching"

Common Issues and Fixes

Issue Cause Fix
CUDA out of memory Model too large for GPU Reduce per_device_train_batch_size or max_seq_length
Loss not decreasing Learning rate too low/high Try 1e-4 to 5e-4 range
Loss spikes Batch too small or bad data Increase gradient_accumulation_steps, check data quality
Garbage output Overfitting or wrong template Reduce epochs, verify chat template matches model
Slow training Not using Unsloth optimizations Ensure use_gradient_checkpointing="unsloth"

Checkpoint

Run the finetune_workflow end to end on the tiny sample dataset and confirm the training loss prints and trends downward across steps, and that a checkpoint directory gets written at the end. If loss is flat or NaN, check that your data made it through the formatter (format_alpaca/format_sharegpt) and that the tokenizer's pad token is set.

Summary


Quick Reference

# script_id: day_079_handson_finetuning/quick_reference
# fragment: illustrative cheat-sheet / not standalone-runnable
# Load model
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.1-8B-Instruct", load_in_4bit=True
)

# Add LoRA
model = FastLanguageModel.get_peft_model(model, r=16, target_modules=[...])

# Train
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(model=model, processing_class=tokenizer, train_dataset=dataset,
                     args=SFTConfig(dataset_text_field="text", output_dir="./output"))
trainer.train()

# Save adapter
model.save_pretrained("./adapter")

# Export GGUF
model.save_pretrained_gguf("./gguf", tokenizer, quantization_method="q4_k_m")

Exercises

  1. Your First Fine-tune: Fine-tune Llama 3.1 8B (or Phi-4) on a small dataset (100-500 examples) using QLoRA. Export to GGUF and run in Ollama. Compare the fine-tuned model's responses to the base model on 10 test prompts.
Solution

The expected flow follows this lesson end to end:

  1. Load with 4-bit QLoRA: FastLanguageModel.from_pretrained("unsloth/Llama-3.1-8B-Instruct", load_in_4bit=True).
  2. Attach LoRA adapters: FastLanguageModel.get_peft_model(model, r=16, target_modules=[...]).
  3. Train with SFTTrainer + SFTConfig on your formatted dataset (trainer.train()).
  4. Export to GGUF: model.save_pretrained_gguf("./gguf", tokenizer, quantization_method="q4_k_m").
  5. Create and run in Ollama: ollama create my-model -f Modelfile then ollama run my-model.

"Better" means the fine-tuned model follows your dataset's style and format more closely than the base model on the 10 prompts -- not necessarily smarter, but more on-pattern.

  1. Hyperparameter Sweep: Train the same model 3 times with different LoRA ranks (r=8, r=16, r=32). Compare final loss, training time, adapter size, and output quality. Which rank gives the best quality-to-cost ratio?

  2. End-to-End Pipeline: Combine Day 77 (synthetic data) with today's lesson. Generate 500 synthetic examples for a specific domain (e.g., customer support, code review), fine-tune a model on them, export to GGUF, and deploy with Ollama. Evaluate with 20 held-out test cases.


What's Next?

Now let's fine-tune specifically for agentic tasks -- tool calling, JSON adherence, and function execution!