You ran an LLM with Ollama, got a decent response, and then someone told you to “lower the temperature” for more consistent output. You changed the number, the responses shifted slightly, and now you are wondering what exactly you changed. Then you noticed the API also accepts top_p and top_k parameters and you are not sure whether to touch those at all.
These three parameters control how the model picks the next word at each step. They are not magic dials. Each one does something specific and understanding the mechanism lets you tune outputs deliberately rather than by trial and error.
This tutorial explains how each parameter works, what happens to output when you change it, and runs real experiments using Ollama’s REST API and a Python script on Ubuntu. By the end, you will know exactly what to change for a given use case and why.
Who This Is For
This is for developers and engineers who are already running a local LLM with Ollama and want to understand the generation parameters well enough to tune them intentionally. You do not need a machine learning background. You need to be comfortable running commands in a terminal and reading a short Python script.
If you have not set up Ollama yet, read Ollama vs LM Studio: Choosing the Right Tool to Run Local LLMs on Ubuntu first. That article covers installation and getting a model running. Come back here once ollama run llama3.2 works.
How Text Generation Actually Works
Before touching any parameters, you need a mental model of what the LLM is doing at each step.
When you send a prompt, the model does not produce the entire response in one shot. It generates one token at a time. A token is roughly a word or a word fragment, “running” might be one token, “unexpected” might be two. The general rule is 1000 tokens equal to 750 words.
At each step, the model produces a list of all possible next tokens along with a probability score for each one. Think of it as a ranked list:
"the" → 28.4%
"a" → 18.1%
"this" → 9.7%
"an" → 7.3%
"that" → 6.2%
... (tens of thousands of other tokens with tiny probabilities)
The model could always pick the highest-probability token every time. That is called greedy decoding. It produces consistent, predictable output, but it also tends to be repetitive and formulaic because the model always plays it safe.
Temperature, Top-P, and Top-K are ways to reshape or filter this probability list before the model makes its selection. Each one changes a different aspect of how that final pick happens.
The Three Parameters
Temperature
Temperature is a scaling factor applied to the raw probability scores (called logits) before they are converted into probabilities.
- Temperature = 1.0 is the default. Probabilities are used as-is.
- Temperature < 1.0 (e.g., 0.2) makes high-probability tokens even more dominant. The top token becomes overwhelmingly likely, lower-ranked options effectively disappear.
- Temperature > 1.0 (e.g., 1.5) flattens the distribution. The gap between the top token and lower-ranked tokens shrinks, more tokens become plausible candidates.
A concrete way to think about it: at temperature 0.1, the model almost always picks the token it was most confident about. At temperature 1.5, it will sometimes pick a token it was only mildly confident about. At temperature 2.0+, it starts sampling from tokens that are outright unlikely, and responses start degrading into incoherent text.
When to lower temperature: code generation, factual Q&A, structured data extraction, anything where you want the most likely correct answer reliably.
When to raise temperature: creative writing, brainstorming, generating varied outputs from the same prompt, cases where you want the model to surprise you.
Top-K
Top-K sampling truncates the probability list before sampling. If top_k = 40, the model discards every token except the 40 highest-probability ones, then redistributes probability mass across those 40, then samples.
The effect: no matter how flat or peaked the distribution is, the model will never pick a token ranked below 40th. This prevents the model from selecting very rare or nonsensical tokens that happen to get a non-zero probability.
- top_k = 1 is equivalent to greedy decoding, always pick the top token, no randomness.
- top_k = 40–100 is the common range for conversational use.
- top_k = 0 (or unset) means no Top-K filtering is applied.
Top-K is a blunt filter. A vocabulary typically has 32,000–100,000 tokens. Setting top_k = 40 means 99.9% of the vocabulary is always excluded regardless of how the actual probabilities are distributed. This can be too aggressive when the probability distribution is naturally flat, and too permissive when it is naturally sharp.
Top-P (Nucleus Sampling)
Top-P sampling is an adaptive version of Top-K. Instead of cutting off at a fixed number of tokens, it cuts off based on cumulative probability.
If top_p = 0.9, the model ranks all tokens by probability (highest first), then keeps adding tokens to the candidate set until the cumulative probability reaches 90%. Everything below that threshold is discarded.
The key difference from Top-K: the candidate pool size varies with the distribution.
- When the model is very confident (one token has 85% probability), the 0.9 threshold might include only 2–3 tokens.
- When the model is uncertain (probability spread across many tokens), the 0.9 threshold might include 200+ tokens.
This adaptive behavior is why Top-P tends to produce better results than Top-K alone in most scenarios. It limits randomness when the model is confident and allows more variety when the model has genuine uncertainty.
- top_p = 1.0 means no Top-P filtering, all tokens are eligible (this is the default).
- top_p = 0.9 is a common practical value for general use.
- top_p = 0.5 is aggressive filtering, you are only ever sampling from the top half of probability mass, which often produces safe but generic output.
How They Interact
In practice, the model applies these filters in this order:
- Compute the probability distribution.
- Apply temperature to reshape the distribution.
- Apply Top-K to truncate to the top K candidates (if set).
- Apply Top-P to further truncate to the nucleus (if set).
- Sample from the remaining candidates.
Using all three simultaneously can be counterproductive. If you set top_k = 40 and top_p = 0.9, the tighter of the two constraints always wins. Most practitioners pick one or the other. The most common production combinations are:
temperature = 0.2,top_p = 1.0,top_k = 0, deterministic-ish output, good for structured taskstemperature = 0.7,top_p = 0.9,top_k = 0, balanced for conversational usetemperature = 1.0,top_p = 0.9,top_k = 0, creative tasks with Top-P as the only constraint
Prerequisites
Before running the examples:
- Ubuntu 20.04, 22.04, or 24.04
- Ollama installed and running (
ollama serveshould start without error) - At least one model pulled,
llama3.2ormistralwork well for these experiments curlandjqfor REST API experiments- Python 3.8+ with
pipfor the comparison script
Verify Ollama is up:
curl -s http://localhost:11434/api/version | jq
You should see a JSON response with the Ollama version. If it fails, start Ollama:
ollama serve
Pull a model if you have not already:
ollama pull llama3.2
Step 1: Baseline Request Without Parameters
First, make a plain request to see default behavior. This uses Ollama’s /api/generate endpoint.
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Write a one-sentence description of a sunset.",
"stream": false
}' | jq -r '.response'
Run this three times in a row. You will likely get slightly different sentences each time, or sometimes the same. This is the model at its default settings, temperature 0.8 in Ollama’s defaults, which adds a small amount of randomness.
Note one of the responses for comparison.
Step 2: Experimenting With Temperature
Now send the same prompt with different temperature values to see the effect directly.
Low temperature (0.1): highly deterministic:
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Write a one-sentence description of a sunset.",
"stream": false,
"options": {
"temperature": 0.1
}
}' | jq -r '.response'
Run this three times. You will get nearly identical or completely identical output each time. The model is picking its most confident token at every step.
High temperature (1.4): more variety:
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Write a one-sentence description of a sunset.",
"stream": false,
"options": {
"temperature": 1.4
}
}' | jq -r '.response'
Run this three times. Responses will vary more. Some will be more inventive, some may be awkward or grammatically loose. That is the wider sampling at work.
Temperature 0.0: fully greedy:
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Write a one-sentence description of a sunset.",
"stream": false,
"options": {
"temperature": 0
}
}' | jq -r '.response'
This will produce the exact same response every run. No randomness, always the top token.
Step 3: Experimenting With Top-K
Use the same prompt and a moderate temperature so the randomness is visible, then change top_k:
Default comparison (top_k = 40, Ollama’s built-in default):
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Describe the feeling of cold water on your skin in one sentence.",
"stream": false,
"options": {
"temperature": 1.0,
"top_k": 40
}
}' | jq -r '.response'
Very low top_k (top_k = 5):
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Describe the feeling of cold water on your skin in one sentence.",
"stream": false,
"options": {
"temperature": 1.0,
"top_k": 5
}
}' | jq -r '.response'
With top_k = 5 and temperature = 1.0, you are sampling randomly but only ever from the top five candidates. Output will be coherent but less varied than top_k = 40.
High top_k (top_k = 200):
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Describe the feeling of cold water on your skin in one sentence.",
"stream": false,
"options": {
"temperature": 1.0,
"top_k": 200
}
}' | jq -r '.response'
With 200 candidates in play at each step, the model can reach for more unusual word choices. Quality still depends heavily on temperature.
Step 4: Experimenting With Top-P
Top-P is most useful to test by comparing its behavior when the model has varying confidence.
top_p = 0.9 (common default):
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Explain what DNS is in one sentence.",
"stream": false,
"options": {
"temperature": 1.0,
"top_p": 0.9,
"top_k": 0
}
}' | jq -r '.response'
top_p = 0.3 (aggressive filtering):
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Explain what DNS is in one sentence.",
"stream": false,
"options": {
"temperature": 1.0,
"top_p": 0.3,
"top_k": 0
}
}' | jq -r '.response'
At top_p = 0.3 with a factual prompt like DNS, you will likely see very standard, textbook-style answers. The model is only sampling from the 30% most likely probability mass, for well-known concepts, that almost always means the canonical phrasing.
top_p = 0.99 (nearly unconstrained):
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Explain what DNS is in one sentence.",
"stream": false,
"options": {
"temperature": 1.0,
"top_p": 0.99,
"top_k": 0
}
}' | jq -r '.response'
With top_p = 0.99 the nucleus includes nearly the full vocabulary. Combined with temperature = 1.0, outputs will vary noticeably between runs.
Step 5: Side-by-Side Comparison With Python
Running curl commands manually makes comparison tedious. This Python script sends the same prompt ten times with two different parameter sets and prints all responses for direct comparison.
Create the script:
mkdir -p ~/llm-params && nano ~/llm-params/compare.py
Paste this content:
import requests
import json
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "llama3.2"
PROMPT = "In one sentence, explain why the sky is blue."
configs = {
"deterministic (temp=0.1)": {
"temperature": 0.1,
"top_p": 1.0,
"top_k": 0,
},
"creative (temp=1.0, top_p=0.9)": {
"temperature": 1.0,
"top_p": 0.9,
"top_k": 0,
},
}
for label, options in configs.items():
print(f"\n{'='*60}")
print(f"Config: {label}")
print(f"Options: {options}")
print(f"{'='*60}")
for i in range(5):
resp = requests.post(OLLAMA_URL, json={
"model": MODEL,
"prompt": PROMPT,
"stream": False,
"options": options,
})
text = resp.json().get("response", "").strip()
print(f" [{i+1}] {text}")
Install the requests library if needed, then run:
pip3 install requests
python3 ~/llm-params/compare.py
You will see ten responses for each configuration. The deterministic config will produce nearly identical sentences across all five runs. The creative config will produce varied sentences, some with different vocabulary, different word order, or different framing.
This side-by-side view is the most practical way to build intuition for how the parameters behave on your specific model.
Step 6: Practical Configuration for Common Tasks
Here are parameter sets that work well in practice for common LLM use cases.
Code generation and structured output:
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Write a Python function that checks if a string is a valid IPv4 address.",
"stream": false,
"options": {
"temperature": 0.2,
"top_p": 1.0,
"top_k": 0
}
}' | jq -r '.response'
Low temperature reduces hallucinated syntax. You want the model’s most confident output for code.
Conversational assistant:
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "What are three things I can do to improve my morning routine?",
"stream": false,
"options": {
"temperature": 0.7,
"top_p": 0.9,
"top_k": 0
}
}' | jq -r '.response'
A moderate temperature with Top-P gives natural-sounding responses that vary slightly between runs without going off the rails.
Creative writing:
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Write the opening sentence of a noir detective novel set in a rainy city.",
"stream": false,
"options": {
"temperature": 1.1,
"top_p": 0.95,
"top_k": 0
}
}' | jq -r '.response'
Slightly above 1.0 temperature encourages unexpected word choices. Keep top_p below 1.0 to prevent complete incoherence.
Common Mistakes and Troubleshooting
Setting temperature to 0 and expecting it to fix hallucinations
Temperature 0 makes output deterministic, not accurate. If the model’s most confident answer is wrong, it will be confidently and consistently wrong. Lowering temperature is useful for repeatability, not for correctness. Fix hallucinations with better prompts, retrieval (RAG), or a model fine-tuned on your domain.
Using top_k and top_p together without understanding which constrains more
If you set top_k = 10 and top_p = 0.95, the top_k filter applies first and limits candidates to 10. On most prompts, 10 candidates will already cover far more than 95% of the probability mass, so top_p = 0.95 does nothing. Use one or the other. Most practitioners disable Top-K (top_k = 0) and rely on Top-P alone.
Increasing temperature to fix repetitive output
If your responses are repetitive or looping, temperature is not usually the problem. Check the repeat_penalty parameter (Ollama supports this). Repetition is often caused by the model getting stuck in a probability loop, not by the temperature being too low. Set repeat_penalty: 1.1 to 1.2 to discourage repeated phrases.
Expecting consistent output with temperature > 0 in production
If you are building a system that depends on consistent output structure (for parsing, for example), do not rely on temperature alone to enforce it. Use structured output formats or prompt the model explicitly to respond in JSON. Then set a low temperature as an additional safety net.
Ollama not responding to options
Make sure you are passing options as a nested JSON object, not at the top level:
{
"model": "llama3.2",
"prompt": "...",
"stream": false,
"options": {
"temperature": 0.2
}
}
Passing "temperature": 0.2 directly at the top level will be silently ignored.
Best Practices
Start with temperature only. For most use cases, only adjusting temperature gets you 90% of the way there. Start at the default (0.8 in Ollama) and move it down for precision tasks or up for creative tasks. Add Top-P only if you need finer control over variety at higher temperatures.
Disable Top-K in favor of Top-P for language generation. Top-K is a leftover from early NLP work where vocabulary sizes were small enough that a fixed cutoff made sense. For modern LLMs with 32k+ token vocabularies, Top-P adapts better to the distribution shape. Set top_k = 0 and control sampling with top_p.
Keep a config object per use case. If you have a code assistant feature and a creative brainstorming feature in the same application, define separate option objects for each and pass them explicitly. Do not use a single global temperature setting for the whole application.
Test with multiple runs before settling on parameters. A single response does not reveal much. Run five to ten samples with a fixed prompt and read all of them before adjusting parameters. The comparison script from Step 5 is useful for this.
Log the parameters alongside the model name in production. When debugging unexpected output, knowing that the request used temperature=0.9, top_p=0.95, model=llama3.2:3b is essential. Without this, the same prompt can produce wildly different output depending on which parameter set was active.
For deterministic testing or CI, use temperature 0. If you have a test that checks model output against an expected string, you need deterministic output. Set temperature to 0 for tests only. Keep production settings at a sensible non-zero value unless your use case truly requires identical output every time.
Conclusion
Temperature, Top-P, and Top-K all operate on the same step of the generation process but each controls a different aspect of it.
Temperature sets how sharp or flat the probability distribution is before sampling. Top-K truncates the candidate pool to a fixed number. Top-P adapts the candidate pool size to the distribution’s actual shape.
In practice: adjust temperature first, use Top-P if needed, and avoid stacking Top-K on top of Top-P. For deterministic tasks set temperature low or to zero. For creative tasks push it toward 1.0 or slightly above, and let Top-P keep the output from degrading.