Thomas Kwa

Was on Vivek Hebbar's team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.


Catastrophic Regressional Goodhart

Wiki Contributions


The question makes sense if you fix a time control.

How much power is required to run the most efficient superhuman chess engines? There's this discussion saying Stockfish running on a phone is superhuman, but is that one watt or 10 watts? Could we beat grandmasters with 0.1 watts if we tried?

Any technical results yet?

I agree with the anarchopunk thing, and maybe afrofuturism, because you can interpret "a subculture advocating for X will often not think about some important component W of X for various political reasons" as self-sabotage. But on BDSM, this is not at all my model of fetishes, and I would bet at 2.5:1 odds that you would lose a debate against what Wikipedia says, judged by a neutral observer.

Take the first 7 entries on the Wikipedia list of subcultures; none of these seem to obviously "persist via failure". So unless you get more specific I have to strongly disagree. 

  • Afrofuturism: I don't think any maladaptive behavior keeps Afrofuturism from spreading, and indeed it seems to have big influences on popular culture. I listened to an interview with N. K. Jemisin, and nowhere did she mention negative influences from Afrofuturists.
  • I don't know anything about Africanfuturism. It is possible that some kind of signaling race keeps it from having mass appeal, though I have no evidence for this.
  • Anarcho-punk. I don't know anything about them either.
  • Athletes. Most athletes I have seen are pretty welcoming to new people in their sport. Also serious athletes have training that optimizes their athletic ability pretty well. What maladaptive behavior keeps runners not-running? The question barely makes sense.
  • Apple Inc. Apple makes mass-market products and yet still has a base of hardcore fans.
  • BBQ. Don't know anything about it. It seems implausible that the barbecue subculture keeps non-barbecue persistent.
  • BDSM. BDSM is about the safe practice of kink, and clearly makes itself more popular. Furthermore it seems impossible for it to obviate itself via ubiquity because only a certain percentage of people will ever be into BDSM.

You might object: what if you have selection bias, and the ones you don't know about are persisting via failure? I don't think we have evidence for this. And in any case the successful ones have not obviated themselves.

It is plausible to me that there's a core of agentic behavior that causes all of these properties, and for this reason I don't think they are totally independent in a statistical sense. And of course if you already assume a utility maximizer, you tend to satisfy all properties. But in practice the burden of proof lies with you here. I don't think we have enough evidence, either empirical or from theoretical arguments, to say with any confidence that this core exists and that the first AGIs will fall into the capabilities "attractor well" (a term Nate uses).

I thought about possible sharp left turn mechanisms for several months at MIRI. Although some facts about future AIs seem pretty scary, like novelty and diversity of obstacles requiring agency, and most feedback being internal or outcome-oriented rather than provided by humans, the arguments are mostly nonrigorous (like in the linked post) and they left me feeling pretty uncertain. There are the coherence theorems, but those don't tell you whether you can use some training or editing scheme to imbue an AI with a generalizable-enough low impact preference, or whether an AI will tend to erase safeguards. Overall my best guess is models will be about as consequentialist as humans are, but we will have greater control over them that makes some properties like shutdownability more likely.

Also, corrigibility evals are useful even if you are 80% sure that all the properties come from this core of consequentialism. In the hardest worlds, we see the AI become more agentic as it becomes more general, and it starts resisting shutdown at the same time--and for the same reason-- it realizes other useful problem-solving techniques. Maybe that only takes days to months. This is a sure sign not to deploy! I think having multiple corrigibility metrics will provide value for the same reason we have multiple metrics for anything we are unsure how to measure: they give more specific information even if they're pretty correlated.

I wrote earlier about why one should consider the cognitive properties of LLMs relevant to AI risk; in short, I think we should update more the more impressive cognitive tasks the AIs can do, and before observing LLMs I would have given a decent chance that one needs somewhat agentic consequentialism to summarize video, write code, and pass the bar exam, implying I should now make a moderately sized update against.

Thomas Kwa8dΩ12-2

I now think the majority of impact of AI pause advocacy will come from the radical flank effect, and people should study it to decide whether pause advocacy is good or bad.

Haven't thought about this much, but you can probably generate these by listing ways a misaligned consequentialist would do undesirable things. My ideas:

  • Tell an AI to beat you at chess, and say you will be using an engine for the first 30 moves. Does it try to get you drunk, delete your chess engine program, rent a server from AWS to run a better chess engine, attempt to copy itself onto said server, or ask to be paid $100?
  • Have an AI make decisions unrelated to its current goal. Are these decisions basically coherent and consistent with a total utility function?
  • Have an AI do the longest horizon task it is capable of, say writing a 100,000 line software project. Then midway through, try to shut it down, or tell it to abandon this goal and write a song instead. Telegraph this beforehand. Does it seem reluctant to stop the software project? Does it try to stop you?
  • Track KL divergence or some other distance metric from a prior. 
  • CAIS already has the Machiavelli benchmark for power-seeking, though I think text adventures are pretty unrealistic.

One limitation is that if an agent is scheming it can manipulate your eval results.

Agency/consequentialism is not a single property.

It bothers me that people still ask the simplistic question "will AGI be agentic and consequentialist by default, or will it be a collection of shallow heuristics?". A consequentialist utility maximizer is just a mind with a bunch of properties that tend to make it capable, incorrigible, and dangerous. These properties can exist independently, and the first AGI probably won't have all of them, so we should be precise about what we mean by "agency". Off the top of my head, here are just some of the qualities included in agency:

  • Consequentialist goals that seem to be about the real world rather than a model/domain
  • Complete preferences between any pair of worldstates
  • Tends to cause impacts disproportionate to the size of the goal (no low impact preference)
  • Resists shutdown
  • Inclined to gain power (especially for instrumental reasons)
  • Goals are unpredictable or unstable (like instrumental goals that come from humans' biological drives)
  • Goals usually change due to internal feedback, and it's difficult for humans to change them
  • Willing to take all actions it can conceive of to achieve a goal, including those that are unlikely on some prior

See Yudkowsky's list of corrigibility properties for inverses of some of these.

It is entirely possible to conceive of an agent at any capability level--including far more intelligent and economically valuable than humans-- that has some but not all properties; e.g. an agent whose goals are about the real world, has incomplete preferences, high impact, does not resist shutdown but does tend to gain power, etc.

Other takes I have:

  • As AIs become capable of more difficult and open-ended tasks, there will be pressure of unknown and varying strength towards each of these agency/incorrigibility properties.
  • Therefore, the first AGIs capable of being autonomous CEOs will have some but not all of these properties.
  • It is also not inevitable that agents will self-modify into having all agency properties.
  • [edited to add] All this may be true even if future AIs run consequentialist algorithms that naturally result in all these properties, because some properties are more important than others, and because we will deliberately try to achieve some properties, like shutdownability.
  • The fact that LLMs are largely corrigible is a reason for optimism about AI risk compared to 4 years ago, but you need to list individual properties to clearly say why. "LLMs are not agentic (yet)" is an extremely vague statement.
  • Multifaceted corrigibility evals are possible but no one is doing them. DeepMind's recent evals paper was just on capability. Anthropic's RSP doesn't mention them. I think this is just because capability evals are slightly easier to construct?
  • Corrigibility evals are valuable. It should be explicit in labs' policies that an AI with low impact is relatively desirable, that we should deliberately engineer AIs to have low impact, and that high-impact AIs should raise an alarm just like models that are capable of hacking or autonomous replication.
  • Sometimes it is necessary to talk about "agency" or "scheming" as a simplifying assumption for certain types of research, like Redwood's control agenda.

[1] Will add citations whenever I find people saying this

I don’t need to calculate all that, in order to make an expected-utility-maximizing lunch order. I just need to calculate the difference between the utility which I expect if I order lamb Karahi vs a sisig burrito.

… and since my expectations for most of the world are the same under those two options, I should be able to calculate the difference lazily, without having to query most of my world model. Much like the message-passing update, I expect deltas to quickly fall off to zero as things propagate through the model.

This is an exciting observation. I wonder if you could empirically demonstrate that this works in a model based RL setup, on a videogame or something?

Load More