Comment by Imnimo
14 days ago
>T ∈ (0, 1] is a parameter called temperature which controls the “softness” of the probability distribution. In our experiments we choose T = 1.0 for maximum response variation.
Why is temperature bounded to be <=1? If you want more "creativity" out of the chat model, can you just set T higher and recover a similar distribution to the base model?
Not after RLHF tuning, due to the 'flattened logits' phenomenon (which is the logit-level version of the mode collapse OP documents at higher levels). All the temperature settings wind up yielding pretty much the same output, until you ramp it up so high that it falls apart completely. Completely unlike the base models where you can productively tune the temperature or use very high temperatures with some screening.
Hmm, it's hard to check without access to the prompts used in the paper, but I'm skeptical that the distributions seen in e.g. Figure 2 are so different that you would have crank up the temperature very much to bridge the gap. It looks to me like the entries that are 1-in-100 in the base model are just falling off the top-p cliff and getting set to 0.
Figure 2 (https://arxiv.org/pdf/2406.05587#page=10) is not at the logit level, it's at the whole completion level (entire names classified by nationality).
So you don't know how any sampling would affect that. There could be only a few options at each token, which give rise to that, and higher temperature sampling may shift that around, but it doesn't ever restore the original base model behavior or restore all of the names erased by mode collapse. (Remember, the LLM is an agent, and when you are sampling, it is on-policy because you are letting it make choices of tokens, and it is steering the completion as a whole back to where it wants to be. With mode collapse, all roads lead to Rome, whether you like it or not.)
People do observe that increasing the temperature does not help, eg. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10936766/ finds basically no difference going from 0 to 0.9 (!): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10936766/bin/po... Just the flattened logits (https://arxiv.org/pdf/2303.08774#page=12&org=openai) at work.
They'll tell you "No" and say that you ruin your samplers, but good samplers (dynamic ones) like min_p or typicality are robust to high temperatures, so in actuality yes.
Cite? I don't see how either of those could deal with the fact that the logits become uninformative and 'flattened' after the tuning. How can a sampler undo the erasure of information?