← Back to context

Comment by freehorse

12 days ago

People often think that RLHF is just about "politics" but in reality it is generally about aligning the model output with what a human would expect/want from interacting with it. This is how chatgpt and the like become appealing. Finetuning a model primarily serves for it to be able to respond to instructions in an expected way, eg you ask something and it does not like start autocompleting with some reddit-like dialogue like some it may have been trained on. It is to bias the model to certain outputs. Reducing entropy is exactly the goal, so no surprise they find that. The problem is there is no inherent meaning in the finetuning set from the perspective of the model. Reduction of entropy will not only happen by removing "bad entropy" only as there is no such thing.

So is the reason why LLMs don't say when they don't know something and instead make up something that "sounds right" because the RLHF has taught it to always give an answer?

And if that's the case, why? Is that really what people want an LLM to do? I feel like I would rather it say when it doesn't know something.

  • LLMs do not know what "they know" or they don't. They just autocomplete what sounds best relevant based on their training set. They do not have enough "I don't know" in their training set in the first place most probably.To have them say "I don't know" you have to go into finetuning them heavily. So, if anything, they hallucinate a lot more without RLHF. Which in this paper they call "creativity".

    • In the GPT3 days when everyone was doing few-shot tasks (giving the LLM a couple of examples of question/answer pairs in the prompt) one of the big insights was that adding question/answer pairs with answers like "I don't know" and "this question doesn't make sense" caused the model to actually use those answers appropriately instead of overconfidently stating nonsense.

      Of course that method isn't perfect (GPT3.0 was far from perfect in general). But both in principle and in practice the models do have a notion of what they "know". Knowledge is a strong activation, random noise is a weaker activation, you "just" have to get the model to override those weaker activations with admitting failure.

      You could draw parallels to allowing LLMs to emit pause tokens to get more time to think (https://arxiv.org/abs/2310.02226 and similar). At some level of abstraction that's also just training the model to replace uncertain answers with a special token, in the hope that it eventually reaches more certainty.

  • It's the other way around. RLHF is needed for the model to say "I don't know".

    • Oh, well that's kind of what I mean. I mean I assume the RLHF that's being done isn't teaching it to say "I don't know".

      Which I wonder if it's intentional. Because a fairly big complaint about the systems are how they can sometimes sound confidently correct about something they don't know. And so why train them to be like this if that's an intentional training direction.

      2 replies →

  • All the chat LLMs have a non zero temperature which means they can be looser with the truth or more creative.

This just makes it worse. It's so much harder to get JSON output when it's RLHF'd to give a bunch of flowery language BS.