Comment by pessimizer

12 days ago

I think that works mathematically, but kicks the can down the road to how your original data was assembled, which was definitely with the knowledge of and usually in the belief in the usefulness of the characteristics that you're trying to extract.

The idea that the good data is secretly encoded in uncorrupted form within the bad data I think is a bad idea. It reminds me of trying to make bad mortgages into good CDOs.

> But really the purpose of RLHF is to reduce toxicity.

I don't think that's the goal, I think it's some people's goal. Those people have defined what "toxicity" means to them, and they're mistaking it for a universal. It's just a metaphor about poison, because poison is bad. It's not a coherent concept. For a business, it should be anything that drives customers away and affects profit. That can only be considered statistically: if some people think something is toxic, and other people think that not mentioning that thing is toxic, the winner is whoever improves the bottom line more or damages it less.

That's how the raw data ended up like it is in the first place.

2 comments

pessimizer

Mathnerd314 11 days ago

> it kicks the can down the road to how your original data was assembled

Well, it kicks it to a bias dataset, used in the tuning process. The raw data has no constraints, it can be the same huge corpus it is now.

> The bias dataset must be assembled with the knowledge of and usually in the belief in the usefulness of the characteristics that you're trying to extract.

Certainly, it is subjective, as I said. But that hasn't stopped research in this area, there are existing bias datasets and bias detection algorithms. Like https://huggingface.co/blog/evaluating-llm-bias#toxicity, it would be simple to complete those prompts and build a he/she dataset, and then the debiasing procedure could remove gender biases for those sorts of occupation-related prompts. It is certainly possible to argue over each data point and whether it actually reflects bias, but so far people have been more concerned with algorithms than data set quality, partly because with better algorithms you can algorithmically generate data sets.

> The idea that the good data is secretly encoded in uncorrupted form within the bad data I think is a bad idea. It reminds me of trying to make bad mortgages into good CDOs.

It is empirically true though? Like if you get the model to say something racist, and then ask it if that's racist, it will generally say yes. So the model "knows", it just is not using that knowledge effectively. Similarly with CDOs, there were people complaining about mortgage quality for years before the crisis.

> I don't think [the purpose of RLHF is to reduce toxicity] If some people think something is toxic, and other people think that not mentioning that thing is toxic, the winner is whoever improves the bottom line more or damages it less.

Well, it is true that toxicity is subjective too. But in practice it has a precise meaning, you build a dataset and score each item for toxicity. That's actually one of the things I find cool about LLMs, is that all these previously "vague" or "subjective" terms are now encoded in the model precisely. Arguably since nobody has the last say in what words mean, the LLM's opinions are as good as any, and given the amount of text the LLM has ingested I consider its opinions on language and word choice "first among equals".