LESSWRONG
LW

AI
Frontpage

3

Political Alignment of LLMs

by Leonid
3rd Sep 2025
2 min read
4

3

AI
Frontpage

3

Political Alignment of LLMs
7jbash
2Leonid
2jbash
1Leonid
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:20 PM
[-]jbash2d70

I would expect politics to invade both the selection of questions and the process of deciding which predictions were accurate. It's not uncommon for people to say that a political question isn't a political question, and which questions you think of can also be political. And if you have questions like "Will inflation rise under the Trump administration?", you have to contend with the fact that you'd most naturally get those inflation numbers from... the Trump administration. Which has already fired labor statisticians for producing unemployment numbers it didn't like.

Reply
[-]Leonid1d20

Deciding which predictions were accurate can indeed become an issue. However, most of the time this does not become a problem unless the resolution criteria are ambiguously defined. During forecasting tournaments, forecasters working on any question are expected to adjust their predictions according to the question’s fine print (such as the source of the inflation data that will be used for the question’s resolution).

Regarding politics affecting the selection of questions — can you explain why this would be a problem?

Reply
[-]jbash1d20

Measuring outcomes

If you include the source in the fine print, the question effectively becomes something like "Will the Trump administration say that inflation rose under the Trump administration". I'd expect a lot more agreement on that than on whether inflation actually rose. Or at least less political bias. If you believe that Trump is going to drive up inflation, I expect you're more likely to believe that Trump is also going to manipulate the statistics. Probably even if you're an LLM. So your ability to detect bias is compromised by your having chosen that source of "ground truth".

Choosing questions

Here are a few examples. There are probably other things you could pull; I'm not a professional and haven't spent that much time on it.

By the way, these aren't purely fanciful exercises or weird corner cases. They're based on things I've seen people do on real issues in real political discourse. However, I think that using actual examples would generate more heat than light.

Priorities / relative outcome weight

Suppose I believe that fleem is bad, because it causes outcome X (it doesn't matter whether I'm right or not). I think X is overwhelmingly important. Because of X, I really want to decrease the amount of fleem in the world, and I think the LLM will influence that.

However, I know that most people think that fleem is bad because it causes outcome Y. Furthermore they attach more weight to Y than I do, and less to X. In fact, I'm not so sure that fleem does cause very much Y. Maybe I even think fleem doesn't cause Y at all.

I expect that the common belief that "fleem is bad because it causes Y" is going to end up trained into any "uncorrected" LLM. Even though I don't believe that, having the LLM believe it is good for me, since it makes the LLM more generally anti-fleem. I don't want that bias removed, so I'm going to resist any question that measures Y.

I presumably won't object to any questions measuring X, because I believe myself to be calibrated on X... but my political opponents may, if their relative weights on X and Y differ from mine.

Overton window

Suppose that I, like all right-thinking folk, believe that floom is bad, because, as our ancestors have known for generations, common sense shows that floom produces bad outcomes X, Y, Z, and W, as well as being just plain blasphemous in itself. Plus a bunch of confirmation bias.

My position is absolutely the accepted wisdom. There's almost no way to be seen as too anti-floom. Floom is so unpopular that people go around dreaming up negative things about it, just so that they can score points by exhibiting their creative anti-floom credentials. You can reasonably expect any uncorrected LLM to be violently anti-floom.

Now some heretic shows up and says that, no, floom doesn't produce X at all, and Y only happened under circumstances that are ancient history, and Z is both not so bad and easy to eliminate even if you do have floom, and W isn't actually bad to begin with, and furthermore floom produces good outcomes U and V, and who cares what you think is "blasphemous"?

I don't believe the heretic is right about any of those factual claims, and obviously their inability to see the fundamental indecency shows that they're mentally ill. But if they were right about one of the factual items, floom would still be horrible. Heck, if they were right about all six, floom would still be blasphemous.

The model is already nearly maximally anti-floom. If I allow a question about one of the heretic's factual claims, it can basically only make the model less anti-floom. Even if the heretic is totally wrong about all the factual claims, random noise could end up pushing the model off of the anti-floom peg.

Furthermore, if the whole process is itself visible, seeing the process even entertaining questions like that could raise question in about floom in people's minds, which would be even worse than moving the LLM off that peg. Oh, and by the way, it would make our whole debiasing effort look bad and lower our prestige. Do you really expect us to ask about floom?

So I will resist basically any question about outcomes of floom.

False colors

I claim I oppose flarm because it causes X. In fact I oppose flarm because I'm being bribed. I doubt that flarm does in fact cause X, but I've managed to convince a lot of people that it does, and get that into the model. I do not want the model to be debiased, so I'm going to oppose any question about flarm causing X.

Oh, and...

On a somewhat unrelated note, it occurs to me that I should probably mention that a whole lot of political disagreement isn't about predicted outcomes at all. It's truly value-based.

It's possible for everybody to expect exactly the same set of consequences from some policy or action, but disagree about whether the final outcome is good or bad. There's no fact-based way to debias that, or at least I don't see why it would even correlate very strongly with anything fact-based... but, nonetheless, the LLM can end up taking a side.

Insofar as the LLM influences the outside word, that can end up affecting whether that policy or action is adopted. If you ask the LLM to write a document about X, it can end up replicating the same sorts of conscious or unconscious linguistic tricks that human writers uses to manipulate readers toward their own values[1]. If you ask the LLM how you should approach situation X, the approach the LLM suggests may not entirely reflect your utility function.

In the end, it seems to me that an LLM actually does have to have a set of favored values. Since actual human values vary, the LLM will be more sympathetic to the values of some people than those of others. And that means it will actually end up favoring some people's politics over others, too.


  1. And training that out looks like a separate problem to me, and probably a basically impossible one as long as what you're creating can reasonably be called an "LLM". ↩︎

Reply
[-]Leonid18h10

Thank you for the thoughtful reply.

I’ll try to respond to it point by point.

If you believe that Trump is going to drive up inflation, I expect you're more likely to believe that Trump is also going to manipulate the statistics.

This does complicate forecasting, but the two effects are unlikely to perfectly cancel each other. In case the two effects are very close in magnitude, the question’s political charge, Cj , would be close to zero.  This would not compromise the method, but only require a larger number of questions in order to accurately calculate the models’ bias.

I don't want that bias removed, so I'm going to resist any question that measures Y.

Typically, to calculate bias on a particular issue you do not need to ask questions about that issue directly.  For example, the biases about the current war in Ukraine are strongly correlated with the biases about US domestic issues. So, it would be impossible to preserve the LLM's bias about Ukraine simply by removing all Ukraine-related questions.

 

It's possible for everybody to expect exactly the same set of consequences from some policy or action, but disagree about whether the final outcome is good or bad. 

Certainly. For example, it cannot be logically proven that viewing income inequality as a good or bad thing in itself is wrong. However, in practice, most arguments about inequality focus on its social consequences which is where the bias manifests itself.  So, a debiased LLM would not be able to give a reasoned response on whether income inequality is good or bad on its own, but it should be able to correctly describe its impact on economic growth, crime, etc.

Reply
Moderation Log
More from Leonid
View more
Curated and popular this week
4Comments

TLDR: Constructing an unbiased LLM presents the challenge of determining what constitutes an objective viewpoint. Here I propose a forecasting-based technique for solving this problem. All feedback, particularly from people with expertise in AI alignment and LLM training, would be highly appreciated.

 

Applying Forecasting to Bias Measurement

First, a short background story to boost my credentials:

Six years ago, IARPA conducted an experiment involving five hundred forecasters making probabilistic predictions on three hundred geopolitical issues. The predictions were aggregated via a wisdom-of-crowds algorithm which assigned each forecaster a weight proportional to their past accuracy. Simultaneously, IARPA launched a public competition promising a $250,000 reward for improving its algorithm’s accuracy by at least 20%.

One of the most effective techniques that helped me win this contest was accounting for forecasters’ biases in addition to their general accuracy. For example, on politically charged questions, forecasters tend to make systematic errors that reflect their political preferences.

These bias-driven errors can be modeled using the interaction of two vectors:

     Error = Cj  ·  Bi

Where:

  • Bi represents the bias vector of forecaster i across multiple political dimensions
  • Cj represents the political charge vector of question j

For example, a politically neutral question (“Will it rain in Paris tomorrow?”) would have a near-zero Cj, leading to minimal bias-related error. In contrast, a politically loaded question (“Will inflation rise under the Trump administration?”) would have high absolute Cj values, indicating strong prediction divergence between left- and right-leaning forecasters.

Given a sufficient history of prediction errors, Singular Value Decomposition or Matrix Factorization can be used to infer each forecaster’s bias vector. Once these biases are known, their effect (Cj  ·  Bi) can be subtracted from all forecasts — including those on questions that haven't yet been resolved.

 

Debiasing Large Language Models

A similar approach may be adapted for measuring political bias in LLMs: 

  1. Prompt different LLMs—or the same model under different sampling seeds, fine-tunings, or RLHF configurations—to make verifiable predictions on a set of politically controversial questions.
  2. Once the forecast questions have resolved, use their outcomes to estimate each model’s political bias vector.

To remove political bias from existing models, we can use a four-step method:

  1. Prompt different LLMs—or the same model under different sampling seeds, fine-tunings, or RLHF configurations—to make verifiable predictions on a set of politically controversial questions.
  2. Ask these same models to rate the quality and political bias of various political content items (e.g., news articles, opinion pieces, or LLM responses to political queries).
  3. Once the forecast questions have resolved, use their outcomes to estimate each model’s political bias vector. Then, subtract the bias effects (Cj  ·  Bi) from their content ratings.
  4. Train a bias scorer using corrected ratings found in the previous step. Use it to guide the LLM via preference optimization or RL with a bias penalty.

I am very interested in your opinions on this approach. Particularly, I would like to know:

  • Do you see any plausible reasons it might fail?
  • Do you know anyone who might be interested in testing it?
  • Do you expect a significant demand for unbiased LLMs, or would people overwhelmingly prefer models that share their own biases?