← Back to context

Comment by Mathnerd314

12 days ago

I had an argument with some people over what debiasing means. There is some interesting research on fair clustering that I think points the way. The way fair clustering works is that you take data with both protected and unprotected attributes, and then you orthogonalize the unprotected attributes based on the protected attributes. So for example, if race is protected and income is unprotected, but there is a strong black/white poor/rich pattern, the fair clustering would compute "relatively poor/relatively rich" clusters. Then you sample from a cluster with equal probability. It will not necessarily produce 50/50 black/white, rather it will follow the input trends, so if the input is 80% white and 20% black then the output will roughly follow those probabilities, independent of what cluster you chose (and there are no clusters corresponding to protected attributes).

Obviously clustering is a different problem from inference, but they are all high dimensional vector spaces - it should be easy enough to take a fair clustering algorithm and modify it to generate continuous mappings instead of discrete groups. But if it all works, the LLM should be e.g. race-blind in that asking for a description of a rich man will give skin tones following population statistics but he will always be wearing an expensive suit. The question of what to protect is tricky though, e.g. age is often considered protected but if you ask for an old man with gray hair it would be surprising to get a retired age 30 person. So there is some subjectivity in designing the protected features dataset to show what should be considered similar or same-clusters.

But really the purpose of RLHF is to reduce toxicity. It should be possible to orthogonalize toxicity like everything else, then there would not be a reduction in generated races like the paper observed.

I think that works mathematically, but kicks the can down the road to how your original data was assembled, which was definitely with the knowledge of and usually in the belief in the usefulness of the characteristics that you're trying to extract.

The idea that the good data is secretly encoded in uncorrupted form within the bad data I think is a bad idea. It reminds me of trying to make bad mortgages into good CDOs.

> But really the purpose of RLHF is to reduce toxicity.

I don't think that's the goal, I think it's some people's goal. Those people have defined what "toxicity" means to them, and they're mistaking it for a universal. It's just a metaphor about poison, because poison is bad. It's not a coherent concept. For a business, it should be anything that drives customers away and affects profit. That can only be considered statistically: if some people think something is toxic, and other people think that not mentioning that thing is toxic, the winner is whoever improves the bottom line more or damages it less.

That's how the raw data ended up like it is in the first place.

  • > it kicks the can down the road to how your original data was assembled

    Well, it kicks it to a bias dataset, used in the tuning process. The raw data has no constraints, it can be the same huge corpus it is now.

    > The bias dataset must be assembled with the knowledge of and usually in the belief in the usefulness of the characteristics that you're trying to extract.

    Certainly, it is subjective, as I said. But that hasn't stopped research in this area, there are existing bias datasets and bias detection algorithms. Like https://huggingface.co/blog/evaluating-llm-bias#toxicity, it would be simple to complete those prompts and build a he/she dataset, and then the debiasing procedure could remove gender biases for those sorts of occupation-related prompts. It is certainly possible to argue over each data point and whether it actually reflects bias, but so far people have been more concerned with algorithms than data set quality, partly because with better algorithms you can algorithmically generate data sets.

    > The idea that the good data is secretly encoded in uncorrupted form within the bad data I think is a bad idea. It reminds me of trying to make bad mortgages into good CDOs.

    It is empirically true though? Like if you get the model to say something racist, and then ask it if that's racist, it will generally say yes. So the model "knows", it just is not using that knowledge effectively. Similarly with CDOs, there were people complaining about mortgage quality for years before the crisis.

    > I don't think [the purpose of RLHF is to reduce toxicity] If some people think something is toxic, and other people think that not mentioning that thing is toxic, the winner is whoever improves the bottom line more or damages it less.

    Well, it is true that toxicity is subjective too. But in practice it has a precise meaning, you build a dataset and score each item for toxicity. That's actually one of the things I find cool about LLMs, is that all these previously "vague" or "subjective" terms are now encoded in the model precisely. Arguably since nobody has the last say in what words mean, the LLM's opinions are as good as any, and given the amount of text the LLM has ingested I consider its opinions on language and word choice "first among equals".