Phase 1LLM Foundations·8 min read

Temperature, Top-P, and Frequency Penalties Part 2

Phase 1 of 8

Coming from Software Engineering? Configuring sampling parameters is like tuning a system's performance knobs — connection pool sizes, cache TTLs, retry intervals. There's no universal best setting; it depends on your use case. The decision tree here is your equivalent of a runbook for model configuration.

Temperature vs Top-P: When to Use Which?

Quick Guide

Goal Use Settings
Deterministic output Temperature temperature=0
Creative but safe Top-P top_p=0.9 (leave temperature at its default 1.0)
Maximum creativity Temperature temperature=1.5
Balanced general use Either temperature=0.7 or top_p=0.9

Leaving temperature at its default 1.0 means you are effectively only tuning top_p — that is still one knob.


Frequency and Presence Penalties

These parameters help prevent repetition in outputs.

Coming from Software Engineering? Think of these as two rate-limiting strategies. Frequency penalty is a per-request throttle that gets stricter the more times a token is used (more uses = bigger penalty). Presence penalty is a one-time flag: once a token has appeared at all it is penalized a flat amount regardless of count — like a feature flag flipping on after first use.

Frequency Penalty

Reduces the likelihood of tokens that have already appeared, proportional to how often they've appeared.

Presence Penalty

Reduces the likelihood of tokens that have appeared at all, regardless of how many times.

A penalty simply lowers a token's chance of being picked before sampling — the bigger the penalty, the less likely that token. Frequency scales that reduction by how many times the token already appeared; presence applies it once, flat.

Code Example

# script_id: day_005_temperature_and_sampling_part2/penalties_example
from openai import OpenAI

client = OpenAI()

def generate_with_penalties(
    prompt: str,
    frequency_penalty: float = 0,
    presence_penalty: float = 0
):
    """Generate text with repetition penalties."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=100,
        frequency_penalty=frequency_penalty,
        presence_penalty=presence_penalty
    )
    return response.choices[0].message.content

prompt = "Write a paragraph about cats. Include lots of details."

print("--- No penalties ---")
print(generate_with_penalties(prompt))

print("\n--- With frequency_penalty=0.5 ---")
print(generate_with_penalties(prompt, frequency_penalty=0.5))

print("\n--- With presence_penalty=0.5 ---")
print(generate_with_penalties(prompt, presence_penalty=0.5))

print("\n--- With both penalties ---")
print(generate_with_penalties(prompt, frequency_penalty=0.5, presence_penalty=0.5))

Penalty Value Ranges

Both penalties range from -2.0 to 2.0:

Value Effect
-2.0 Strongly ENCOURAGE repetition
0 No effect (default)
0.5 Mild discouragement
1.0 Moderate discouragement
2.0 Strong discouragement

Putting It All Together: A Practical Configuration Guide

# script_id: day_005_temperature_and_sampling_part2/practical_config_guide
from openai import OpenAI

client = OpenAI()

# Different configurations for different tasks
# Note: a few presets nudge both temperature and top_p off-default as a pragmatic recipe.
# In your own code, start by tuning just one and reach for the second only if the first is not enough.
CONFIGS = {
    "code_generation": {
        "temperature": 0,
        "top_p": 1,
        "frequency_penalty": 0,
        "presence_penalty": 0
    },
    "creative_writing": {
        "temperature": 0.9,
        "top_p": 0.95,
        "frequency_penalty": 0.5,
        "presence_penalty": 0.5
    },
    "factual_qa": {
        "temperature": 0.3,
        "top_p": 0.9,
        "frequency_penalty": 0,
        "presence_penalty": 0
    },
    "brainstorming": {
        "temperature": 1.2,
        "top_p": 0.95,
        "frequency_penalty": 0.7,
        "presence_penalty": 0.7
    },
    "chat_assistant": {
        "temperature": 0.7,
        "top_p": 0.9,
        "frequency_penalty": 0.3,
        "presence_penalty": 0.3
    }
}

def smart_generate(prompt: str, task_type: str):
    """Generate text with task-appropriate settings."""
    config = CONFIGS.get(task_type, CONFIGS["chat_assistant"])

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
        **config
    )
    return response.choices[0].message.content

# Usage examples
print(smart_generate("Write a Python function to sort a list", "code_generation"))
print(smart_generate("Give me 5 unique startup ideas", "brainstorming"))
print(smart_generate("What is the capital of France?", "factual_qa"))

Decision Tree: Choosing Your Settings


Common Mistakes to Avoid

Mistake 1: Using High Temperature AND Low Top-P

# script_id: day_005_temperature_and_sampling_part2/conflicting_settings_mistake
# fragment: illustrative cheat-sheet / not standalone-runnable
# BAD: Conflicting settings
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[...],
    temperature=1.5,  # "Be creative!"
    top_p=0.1  # "But only use common tokens!"
)

# GOOD: Pick one approach
# For creativity:
temperature=1.2, top_p=1.0

# For focus:
temperature=0.3, top_p=1.0
# OR
temperature=1.0, top_p=0.5

Mistake 2: Extreme Penalties

# script_id: day_005_temperature_and_sampling_part2/extreme_penalties_mistake
# BAD: Penalties too high
frequency_penalty=2.0
presence_penalty=2.0
# Result: Model avoids ALL repeated words, even necessary ones like "the", "is"

# GOOD: Moderate penalties
frequency_penalty=0.5
presence_penalty=0.3

Checkpoint

Run smart_generate across the three task types and confirm: the factual-QA call comes back tight and consistent while the brainstorming call produces varied, non-repeating ideas — because the function dials temperature and the penalties differently per task. If brainstorming output still loops on the same phrase, check that frequency_penalty/presence_penalty are actually being passed through (and aren't cranked so high they degrade into gibberish).


Summary


Quick Reference

Parameter What it does Typical range Reach for it when
temperature Reshapes the whole distribution 0 – 2 You want a single creativity dial
top_p Keeps only the top probability mass 0.1 – 1.0 You want nucleus sampling (instead of temperature)
frequency_penalty Penalizes tokens by how often they've appeared 0.0–1.0 typical (API accepts -2 to 2) Output keeps repeating the same words
presence_penalty Penalizes tokens that have appeared at all 0.0–1.0 typical (API accepts -2 to 2) You want the model to introduce new topics

Rule of thumb: tune one of temperature / top_p, then add a small penalty only if you see repetition.


Exercises

  1. Temperature Explorer: Create a script that generates the same prompt with temperatures from 0 to 2 in 0.2 increments. Visualize how the outputs change.
Solution

Loop temperature from 0 to 2 in 0.2 steps, calling the model with the same prompt each time and printing the result alongside its temperature. You should see near-identical, "safe" outputs at low values and increasingly varied (eventually erratic) outputs as you climb past ~1.2. A simple list of (temperature, output) pairs is enough to eyeball the trend.

  1. Repetition Fighter: Take a prompt that tends to produce repetitive output. Find the optimal frequency/presence penalty combination.
Solution

Start with both penalties at 0 to confirm the repetition, then sweep frequency_penalty in steps of 0.2 up to ~1.0. Add a small presence_penalty (~0.3) only if whole topics keep recurring rather than individual words. The "optimal" value is the lowest one that removes the loop without making the text feel forced — usually in the 0.3–0.7 range.

  1. Task Matcher: Given these tasks, choose appropriate settings:
    • Generating unit tests
    • Writing poetry
    • Extracting dates from text
    • Generating product descriptions
Solution
  • Generating unit tests → temperature=0 (you want deterministic, correct code)
  • Writing poetry → temperature≈1.2 plus frequency_penalty≈0.5 (creative and non-repetitive)
  • Extracting dates from text → temperature=0 (exact, repeatable extraction)
  • Generating product descriptions → temperature≈0.8 or top_p≈0.9 (lively but on-topic)

What's Next?

You now understand the three pillars of LLM interaction:

  1. How they process text (Transformers)
  2. How they see text (Tokenization)
  3. How to control their output (Sampling Parameters)

Tomorrow (Day 6) we start prompting techniques proper with Zero-Shot vs Few-Shot Prompting — how giving the model a few worked examples changes what you get back.