Setting Top-K, Top-P and Temperature in LLMs

Ruman
9 min readApr 3, 2024

--

Imagine you have the ability to control how smart or dumb ChatGPT, Mistral, or any LLMs sounds. We can control LLMs output using Top-K, Top-P, and Temperature. In this article, we’ll dive into what these settings do.

https://www.dezeen.com/2022/04/21/openai-dall-e-2-unseen-images-basic-text-technology/

Outline

  • Pottery Wheel Analogy
  • Top-p, Top-k and Temperature
  • How do these parameters work together?
  • Tuning for Your Use Case
  • Conclusion

Pottery Wheel Analogy

https://i.gifer.com/origin/0c/0cd45746e40a2488f7c4d00940d91b00.gif

Let’s take an example of making pottery or a dish on a pottery wheel. When working on a pottery wheel, the wheel spins at a constant speed, and it all comes down to how you shape the clay. You can either create a mess or craft a beautiful piece of pottery.

Similarly, we can say that LLMs are like pottery wheels, they will eventually produce some result, but as developers, we can control the output of LLMs with help of Top-p, Top-k, and Temperature. Just as an artist uses their hands on the pottery wheel to create a perfect piece of art, we can shape the LLM’s output to be more creative if our task is creative, like generating poetry, or more precise if our task is critical, like generating code.

This analogy highlights the importance of these settings. We’ll explore each one in detail.

Before moving a head lets quickly look at Greedy sampling and Random Sampling —

Greedy sampling and Random Sampling in Context of LLMs

I promise this long explanation will make sense! 😊

Greedy Sampling

If ordering in a restaurant, greedy sampling would be equivalent to always ordering the single most common or popular dish on the menu. For example, if the most frequently ordered dish is the Caesar salad, then greedy sampling would result in ordering “I’ll have the Caesar salad” no matter what all the time.

This relates to language models using Top-K=1 and Temperature=1.0, where the model always chooses the single most likely next word according to its probability distribution. Just like always defaulting to the most popular menu item lacks creativity and variety.

Random Sampling

If ordering in a restaurant, random sampling would be the equivalent of choosing your order by literally pulling a menu item out of a hat at random, with no regard for what type of dish it is or whether it even makes sense.

So you might end up ordering something like “I’ll have the chicken fried steak soup” or “I’ll have the cheesecake burger” completely random combinations that don’t form a coherent dish or meal.

This relates to language models using moderate values for Top-K, such as Top-K=50 or Top-K=100, along with a high Temperature like 1.5 or 2.0. With such settings, the model can generate creative and surprising outputs by sampling from the broad set of potential next words. However, extremely high Top-K values like 10000 essentially approximates random sampling across the entire vocabulary, producing nonsensical gibberish.

So in summary:

  • Greedy sampling (Top-K=1) leads to coherent but uncreative outputs like always ordering the single most popular/likely menu item
  • Random sampling enables maximum creativity but outputs are incoherent like ordering by pulling any menu item out of a hat randomly

Techniques like Top-K, Top-P, Temperature allow controlling the trade-off between coherence and creativity

The goal is tuning these parameters to achieve a desired balance between coherence (sticking to conventional menu items) and creativity (ordering something new or unexpected sometimes) for the specific use case.

Top-p, Top-k and Temperature

Photo by Google DeepMind

Each of these parameters is very important, and they all work in coordination with each other. Let’s discuss each :

Top-k

Top-k limits the model’s output to the top-k most probable tokens at each step. This can help reduce incoherent or nonsensical output by restricting the model’s vocabulary.

For example, let’s say for “I’ll have the…” the words in the vocabulary and their probabilities are:

mat: 0.6 
couch: 0.2
bed: 0.1
chair: 0.05
car: 0.003
bike: 0.01
bucket: 0.3
……

With top-k sampling and let’s say K=5, it does the following:

  • It considers only the top 5 highest probability words in the distribution after sorting them.
  • It re-normalizes the probabilities among just those 5 words to sum to 1.
  • It samples the next word from this re-normalized distribution over the top 5 words.

So if k=5, it will only consider the words: {mat, bucket, bed, couch and chair} and sample from the re-normalized probabilities among just those 5.

This “limits the output” by not even considering any words beyond the top-k most probable tokens according to the original distribution.

This allows trading off between focusing on most likely/coherent tokens or allowing more creative/random sampling.

Top-p

Top-p filters out tokens whose cumulative probability is less than a specified threshold (p). It allows for more diversity in the output while still avoiding low-probability tokens.

For example, let’s say after “I’ll have the…” the words and their probabilities are:

salad: 0.4 
burger: 0.3
pasta: 0.1
steak: 0.08

With Top-P=0.8, it will include salad (0.4), burger (0.3), pasta (0.1) since 0.4 + 0.3 + 0.1 = 0.8. This covers 80% of the probability mass in just the top 3 words.

Probability mass — refers to the total probability value that is distributed across all the possible next word choices.

So the model now samples from just {salad, burger, pasta} instead of the full vocabulary.

Top-P will consider a broader, more inclusive set of word choices compared to using Top-K

Let me break down this statement —

Top-P sampling with P=0.8 will consider a broader, more inclusive set of word choices compared to using Top-K=5. With Top-K=5, the model only considers the 5 highest probability words after the context, no matter how low the probabilities are after that. But with Top-P=0.8, it will include as many words as needed until their cumulative probabilities reach 0.8.

Temperature

Temperature adjusts the randomness or confidence level of the model’s predictions by scaling the log probabilities. Higher temperatures lead to more diverse but potentially nonsensical outputs, while lower temperatures yield more focused and predictable responses.

A low temperature value < 1 (e.g. 0.2 or 0.5):

  • Makes the model more confident and peaks the probability distribution
  • It concentrates most of the probability mass on most likely next words
  • This results in more coherent, and repetitive text generations but with less creativity or exploration of less likely options

A high temperature value > 1 (e.g. 1.5 or 2.0):

  • Makes the model’s predictions more spread out and “uncertain”
  • It flattens and distribute the probability distribution more evenly over words
  • This allows the model to more frequently sample from less likely word choices enabling more creative, exploratory, and “surprising” text generations
  • But potentially at the cost of coherence or plausibility if taken to extremes

Earlier I mentioned about “temperature adjusts the confidence level of the model’s predictions by scaling the log probabilities.”

So, here’s how the temperature scaling works :

  1. The language model first computes the unnormalized log probability scores for each word in the vocabulary given the previous context.
  2. These log probabilities are then divided by the temperature value: log_prob_scaled = log_prob / temperature
  3. If temperature < 1, this has the effect of making the log probabilities more extreme (i.e. higher log probs become higher, lower ones become lower)
  4. If temperature > 1, it has the opposite effect of making the log probabilities less extreme and shrinking towards 0.
  5. After scaling by temperature, a softmax function is applied to convert these scaled log probabilities into a proper probability distribution over the vocabulary that sums to 1.

So in essence, temperature acts as a scaling factor on the log probability values before applying softmax.

A low temperature < 1 amplifies the difference between high and low probability values, leading to a sharper distribution focused on a few likely words.

A high temperature > 1 reduces the difference between log probabilities, resulting in a flatter distribution that gives more chance to low probability words.

The probablity distribution of low and high temperatures will look somewhat like this :

This explicit scaling of the log probability values is what allows temperature to control the overall confidence and spread of the resulting normalized probability distribution over the vocabulary.

How do these parameters work together?

Photo by Natalia Goryaeva: https://www.pexels.com/photo/bicycles-parked-beside-the-street-10033832/

How the text generation process would work ?

If you set Temperature=0.8, Top-K=35, and Top-P=0.7 ??

1. First, the model computes the full unnormalized log probability distribution over the entire vocabulary based on the previous context.

2. It applies the Temperature=0.8

  • It applies the Temperature=0.8 scaling by dividing each log probability by 0.8.
  • Since 0.8 < 1, this will make the log probabilities more extreme (higher logprobs get higher, lower ones get lower)
  • Effectively making the model “more confident” in its predictions before normalization

3. It applies the Top-K=35 filtering

  • It selects the 35 tokens with the highest scaled log probabilities

4. It applies the Top-P=0.7 filtering

  • From this Top-K=35 set, it applies the Top-P=0.7 filtering:
  • It goes through these 35 tokens from highest to lowest scaled probability
  • Keeping tokens in order until their cumulative probability mass reaches 0.7 or 70%
  • Let’s say this gives a final set of 25 tokens from the original Top-K=35

5. It then renormalizes just the scaled log probabilities of these 25 final tokens to sum to 1

  • Applying a softmax to convert them to proper probabilities

6. Finally, it samples the next token from this temperature-scaled and Top-K/Top-P filtered probability distribution over the 25 token set

So in summary:

  • Temperature scaling is applied first to make log probabilities more extreme
  • Then Top-K=35 filters down to the top 35 tokens
  • Top-P=0.7 further filters that set down to the highest cumulative 70% mass
  • The final renormalized probabilities over this filtered set are used for sampling

This allows Temperature=0.8 to make the model more confident overall, while Top-K and Top-P control the breadth of sampling from 35 down to around 25 tokens.

Tuning for Your Use Case

https://petapixel.com/2022/07/20/ai-image-generator-dall-e-is-now-available-in-beta/

There is no single optimal setting for these parameters , the ideal values depend on your specific needs.

For creative writing, you may want lower Top-K/Top-P values along with a higher temperature to encourage more surprising and diverse generations.

For analytical tasks where precision is crucial, higher Top-K/Top-P with lower temperature keeps the model focused.

It often takes some experimentation to find the right balance. As a starting point, values like Top-K=50, Top-P=0.95, and Temperature=0.7 provide a reasonable trade-off between coherence and creativity for open-ended language generation.

You can iteratively adjust up or down based on the level of randomness and coherence desired.

If you enjoyed this article, your applause would be greatly appreciated!

Conclusion

In conclusion, understanding how to effectively adjust parameters like Top-K, Top-P, and Temperature in Language Models is crucial for optimizing their performance and refining the accuracy of their outputs. By understadning the nuances of each parameter and fine-tuning them accordingly, we can fully utilize the capabilities of LLMs across a range of applications.

Check out my series of articles covering PyTorch fine-tuning techniques.

Ultimate Guide to Fine-Tuning in PyTorch

4 stories

--

--

Ruman
Ruman

Written by Ruman

Senior ML Engineer | Sharing what I know, work on, learn and come across :)