Political Alignment of LLMs

[-]jbash3mo70

I would expect politics to invade both the selection of questions and the process of deciding which predictions were accurate. It's not uncommon for people to say that a political question isn't a political question, and which questions you think of can also be political. And if you have questions like "Will inflation rise under the Trump administration?", you have to contend with the fact that you'd most naturally get those inflation numbers from... the Trump administration. Which has already fired labor statisticians for producing unemployment numbers it didn't like.

[-]Leonid3mo20

Deciding which predictions were accurate can indeed become an issue. However, most of the time this does not become a problem unless the resolution criteria are ambiguously defined. During forecasting tournaments, forecasters working on any question are expected to adjust their predictions according to the question’s fine print (such as the source of the inflation data that will be used for the question’s resolution).

Regarding politics affecting the selection of questions — can you explain why this would be a problem?

[-]jbash3mo20

Measuring outcomes

If you include the source in the fine print, the question effectively becomes something like "Will the Trump administration say that inflation rose under the Trump administration". I'd expect a lot more agreement on that than on whether inflation actually rose. Or at least less political bias. If you believe that Trump is going to drive up inflation, I expect you're more likely to believe that Trump is also going to manipulate the statistics. Probably even if you're an LLM. So your ability to detect bias is compromised by your having chosen that source of "ground truth".

Choosing questions

Here are a few examples. There are probably other things you could pull; I'm not a professional and haven't spent that much time on it.

By the way, these aren't purely fanciful exercises or weird corner cases. They're based on things I've seen people do on real issues in real political discourse. However, I think that using actual examples would generate more heat than light.

Priorities / relative outcome weight

Suppose I believe that fleem is bad, because it causes outcome X (it doesn't matter whether I'm right or not). I think X is overwhelmingly important. Because of X, I really want to decrease the amount of fleem in the world, and I think the LLM will influence that.

However, I know that most people think that fleem is bad because it causes outcome Y. Furthermore they attach more weight to Y than I do, and less to X. In fact, I'm not so sure that fleem does cause very much Y. Maybe I even think fleem doesn't cause Y at all.

I expect that the common belief that "fleem is bad because it causes Y" is going to end up trained into any "uncorrected" LLM. Even though I don't believe that, having the LLM believe it is good for me, since it makes the LLM more generally anti-fleem. I don't want that bias removed, so I'm going to resist any question that measures Y.

I presumably won't object to any questions measuring X, because I believe myself to be calibrated on X... but my political opponents may, if their relative weights on X and Y differ from mine.

Overton window

Suppose that I, like all right-thinking folk, believe that floom is bad, because, as our ancestors have known for generations, common sense shows that floom produces bad outcomes X, Y, Z, and W, as well as being just plain blasphemous in itself. Plus a bunch of confirmation bias.

My position is absolutely the accepted wisdom. There's almost no way to be seen as too anti-floom. Floom is so unpopular that people go around dreaming up negative things about it, just so that they can score points by exhibiting their creative anti-floom credentials. You can reasonably expect any uncorrected LLM to be violently anti-floom.

Now some heretic shows up and says that, no, floom doesn't produce X at all, and Y only happened under circumstances that are ancient history, and Z is both not so bad and easy to eliminate even if you do have floom, and W isn't actually bad to begin with, and furthermore floom produces good outcomes U and V, and who cares what you think is "blasphemous"?

I don't believe the heretic is right about any of those factual claims, and obviously their inability to see the fundamental indecency shows that they're mentally ill. But if they were right about one of the factual items, floom would still be horrible. Heck, if they were right about all six, floom would still be blasphemous.

The model is already nearly maximally anti-floom. If I allow a question about one of the heretic's factual claims, it can basically only make the model less anti-floom. Even if the heretic is totally wrong about all the factual claims, random noise could end up pushing the model off of the anti-floom peg.

Furthermore, if the whole process is itself visible, seeing the process even entertaining questions like that could raise question in about floom in people's minds, which would be even worse than moving the LLM off that peg. Oh, and by the way, it would make our whole debiasing effort look bad and lower our prestige. Do you really expect us to ask about floom?

So I will resist basically any question about outcomes of floom.

False colors

I claim I oppose flarm because it causes X. In fact I oppose flarm because I'm being bribed. I doubt that flarm does in fact cause X, but I've managed to convince a lot of people that it does, and get that into the model. I do not want the model to be debiased, so I'm going to oppose any question about flarm causing X.

Oh, and...

On a somewhat unrelated note, it occurs to me that I should probably mention that a whole lot of political disagreement isn't about predicted outcomes at all. It's truly value-based.

It's possible for everybody to expect exactly the same set of consequences from some policy or action, but disagree about whether the final outcome is good or bad. There's no fact-based way to debias that, or at least I don't see why it would even correlate very strongly with anything fact-based... but, nonetheless, the LLM can end up taking a side.

Insofar as the LLM influences the outside word, that can end up affecting whether that policy or action is adopted. If you ask the LLM to write a document about X, it can end up replicating the same sorts of conscious or unconscious linguistic tricks that human writers uses to manipulate readers toward their own values^[1]. If you ask the LLM how you should approach situation X, the approach the LLM suggests may not entirely reflect your utility function.

In the end, it seems to me that an LLM actually does have to have a set of favored values. Since actual human values vary, the LLM will be more sympathetic to the values of some people than those of others. And that means it will actually end up favoring some people's politics over others, too.

And training that out looks like a separate problem to me, and probably a basically impossible one as long as what you're creating can reasonably be called an "LLM". ↩︎

[-]Leonid3mo10

Thank you for the thoughtful reply.

I’ll try to respond to it point by point.

If you believe that Trump is going to drive up inflation, I expect you're more likely to believe that Trump is also going to manipulate the statistics.

This does complicate forecasting, but the two effects are unlikely to perfectly cancel each other. In case the two effects are very close in magnitude, the question’s political charge, C_j , would be close to zero. This would not compromise the method, but only require a larger number of questions in order to accurately calculate the models’ bias.

I don't want that bias removed, so I'm going to resist any question that measures Y.

Typically, to calculate bias on a particular issue you do not need to ask questions about that issue directly. For example, the biases about the current war in Ukraine are strongly correlated with the biases about US domestic issues. So, it would be impossible to preserve the LLM's bias about Ukraine simply by removing all Ukraine-related questions.

It's possible for everybody to expect exactly the same set of consequences from some policy or action, but disagree about whether the final outcome is good or bad.

Certainly. For example, it cannot be logically proven that viewing income inequality as a good or bad thing in itself is wrong. However, in practice, most arguments about inequality focus on its social consequences which is where the bias manifests itself. So, a debiased LLM would not be able to give a reasoned response on whether income inequality is good or bad on its own, but it should be able to correctly describe its impact on economic growth, crime, etc.

[-]jbash3mo20

Typically, to calculate bias on a particular issue you do not need to ask questions about that issue directly. For example, the biases about the current war in Ukraine are strongly correlated with the biases about US domestic issues. So, it would be impossible to preserve the LLM's bias about Ukraine simply by removing all Ukraine-related questions.

Doesn't that mean that I'm just now motivated to attack whole clusters of correlated questions? And for that matter doesn't that mean that if, say, I care most about defending bias on Ukraine, I have an incentive to collude with others involved in the process who care more about the domestic issues? My opponents have the same incentives, so it seems to me you're at great risk of importing all of the outside factions into the pool of people selecting the questions.

However, in practice, most arguments about inequality focus on its social consequences which is where the bias manifests itself.

I dunno. I agree people argue based on consequences, but I also think that there's a lot more feed-forward than anybody would like to admit. If I'm fundamentally in favor of inequality, then I'm motivated to go confirmation-bias myself into believing it has more positive consequences and fewer negative ones.

Of course I'll then use those beliefs to argue for more inequality... but even if I'm forced to give up one or another belief, that doesn't mean I'll reexamine my underlying pro-inequality values, and I probably have a bunch of other similar beliefs on tap. If I'm a pro-inequality advocate, friends and I probably spend a fair amount of time sitting around thinking of new advantages of inequality, and/or new disadvantages of equality.

And, going back to the question selection thing, it doesn't seem unlikely that I'll try to defend my beliefs about the consequences of inequality by trying either to avoid anybody going out and actually measuring outcomes, or to bias the measurements in one way or another. While my friends and I are thinking of those new consequences, we're probably also on the lookout for high-quality metrics that prove them, as opposed to any obviously bogus metrics that disprove them. We'll be happy to provide those good metrics for the fine print.

LESSWRONG
LW