This is a special post for quick takes by johnswentworth. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
johnswentworth's Shortform
337 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don't look like the prototypical scheming story. (These are copied from a Facebook thread.)

  • Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren't smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback).
  • The "Getting What We Measure" scenario from Paul's old "What Failure Looks Like" post.
  • The "fusion power generator scenario".
  • Perhaps someone trains a STEM-AGI, which can't think about humans much at all. In the course of its work, that AGI reasons that an oxygen-rich atmosphere is very inconvenient for manufacturing, and aims to get rid of it. It doesn't think about humans at all, but the human operators can't understand
... (read more)

Also (separate comment because I expect this one to be more divisive): I think the scheming story has been disproportionately memetically successful largely because it's relatively easy to imagine hacky ways of preventing an AI from intentionally scheming. And that's mostly a bad thing; it's a form of streetlighting.

[-]Buck3224

Most of the problems you discussed here more easily permit hacky solutions than scheming does.

5Zvi
Individually for a particular manifestation of each issue this is true, you can imagine doing a hacky solution to each one. But that assumes there is a list of such particular problems that if you check off all the boxes you win, rather than them being manifestations of broader problems. You do not want to get into a hacking contest if you're not confident your list is complete.
7johnswentworth
True, but Buck's claim is still relevant as a counterargument to my claim about memetic fitness of the scheming story relative to all these other stories.
9Nathan Helm-Burger
This is an interesting point. I disagree that scheming vs these ideas you mention is much of a 'streetlighting' case. I do, however, have my own fears that 'streetlighting' is occurring and causing some hard-but-critical avenues of risk to be relatively neglected. [Edit: on further thought, I think this might not just be a "streetlighting"effect, but also a "keeping my hands clean" effect. I think it's more tempting, especially for companies, to focus on harms that could plausibly be construed as being their fault. It's my impression that, for instance, employees of a given company might spend a disproportionate amount of time thinking about how to keep their company's product from harming people vs the general class of products from harming people. Also, less inclined to think about harm which could be averted via application of their product. This is additional reason for concern that having the bulk of AI safety work being funded by / done in AI companies will lead to correlated oversights.] My concerns that I think are relatively neglected in AI safety discourse are mostly related to interactions with incompetent or evil humans. Good alignment and control techniques don't do any good if someone opts not to use them in some critical juncture. Some potential scenarios: * If AI is very powerful, and held in check tenuously by fragile control systems, it might be released from control by a single misguided human or some unlucky chain of events, and then go rogue. * If algorithmic progress goes surprisingly quickly, we might find ourselves in a regime where a catastrophically dangerous AI can be assembled from some mix of pre-existing open-weights models, plus fine-tuning, plus new models trained with new algorithms, and probably all stitched together with hacky agent frameworks. Then all it would take would be for sufficient hints about this algorithmic discovery to leak, and someone in the world to reverse-engineer it, and then there would be potent rogue AI
[-]BuckΩ23366

IMO the main argument for focusing on scheming risk is that scheming is the main plausible source of catastrophic risk from the first AIs that either pose substantial misalignment risk or that are extremely useful (as I discuss here). These other problems all seem like they require the models to be way smarter in order for them to be a big problem. Though as I said here, I'm excited for work on some non-scheming misalignment risks.

scheming is the main plausible source of catastrophic risk from the first AIs that either pose substantial misalignment risk or that are extremely useful...

Seems quite wrong. The main plausible source of catastrophic risk from the first AIs that either pose substantial misalignment risk or that are extremely useful is that they cause more powerful AIs to be built which will eventually be catastrophic, but which have problems that are not easily iterable-upon (either because problems are hidden, or things move quickly, or ...).

And causing more powerful AIs to be built which will eventually be catastrophic is not something which requires a great deal of intelligent planning; humanity is already racing in that direction on its own, and it would take a great deal of intelligent planning to avert it. This story, for example:

  • People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they
... (read more)
[-]BuckΩ9175
  • People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they're already on the more-powerful next gen, so it's too late.

This story sounds clearly extremely plausible (do you disagree with that?), involves exactly the sort of AI you're talking about ("the first AIs that either pose substantial misalignment risk or that are extremely useful"), but the catastropic risk does not come from that AI scheming.

This problem seems important (e.g. it's my last bullet here). It seems to me much easier to handle, because if this problem is present, we ought to be able to detect its presence by using AIs to do research on other subjects that we already know a lot about (e.g. the string theory analogy here). Scheming is the only reason why the model would try to make it hard for us to notice that this problem is present.

A few problems with this frame.

First: you're making reasonably-pessimistic assumptions about the AI, but very optimistic assumptions about the humans/organization. Sure, someone could look for the problem by using AIs to do research on other subject that we already know a lot about. But that's a very expensive and complicated project - a whole field, and all the subtle hints about it, need to be removed from the training data, and then a whole new model trained! I doubt that a major lab is going to seriously take steps much cheaper and easier than that, let alone something that complicated.

One could reasonably respond "well, at least we've factored apart the hard technical bottleneck from the part which can be solved by smart human users or good org structure". Which is reasonable to some extent, but also... if a product requires a user to get 100 complicated and confusing steps all correct in order for the product to work, then that's usually best thought of as a product design problem, not a user problem. Making the plan at least somewhat robust to people behaving realistically less-than-perfectly is itself part of the problem.

Second: looking for the problem by testing on other f... (read more)

4Dakara
All 3 points seem very reasonable, looking forward to Buck's response to them.
5Dakara
Additionally, I am curious to hear if Ryan's views on the topic are similar to Buck's, given that they work at the same organization.
8Charlie Steiner
One big reason I might expect an AI to do a bad job at alignment research is if it doesn't do a good job (according to humans) of resolving cases where humans are inconsistent or disagree. How do you detect this in string theory research? Part of the reason we know so much about physics is humans aren't that inconsistent about it and don't disagree that much. And if you go to sub-topics where humans do disagree, how do you judge its performance (because 'be very convincing to your operators' is an objective with a different kind of danger). Another potential red flag is if the AI gives humans what they ask for even when that's 'dumb' according to some sophisticated understanding of human values. This could definitely show up in string theory research (note when some ideas suggest non-string-theory paradigms might be better, and push back on the humans if the humans try to ignore this), it's just intellectually difficult (maybe easier in loop quantum gravity research heyo gottem) and not as salient without the context of alignment and human values.

I once counted several dozens of the ways how AI can cause human extinction, may be some ideas may help (map, text).  

See also ‘The Main Sources of AI Risk?’ by Wei Dai and Daniel Kokotajlo, which puts forward 35 routes to catastrophe (most of which are disjunctive). (Note that many of the routes involve something other than intent alignment going wrong.)

8Johannes C. Mayer
Another one: We manage to solve alignment to a significant extend. The AI who is much smarter than a human thinks that it is aligned, and takes aligned actions. The AI even predicts that it will never become unaligned to humans. However, at some point in the future as the AI naturally unrolles into a reflectively stable equilibrium it becomes unaligned.
6Towards_Keeperhood
Agree on that people focus a bit too much on scheming. It might be good for some people to think a bit more about the other failure modes you described, but the main thing that needs doing is very smart people making progress towards building an aligned AI, not defending against particular failure modes. (However, most people probably cannot usefully contribute to that, so maybe focusing on failure modes is still good for most people. Only that in any case there's the problem that people will find proposals that very likely don't actually work but which people can rather believe in that they work, and thereby making an AI stop a bit less likely.)
6Karl Krueger
I see a lot of discussion of AI doom stemming from research, business, and government / politics (including terrorism). Not a lot about AI doom from crime. Criminals don't stay in the box; the whole point of crime is to benefit yourself by breaking the rules and harming others. Intentional creation of intelligent cybercrime tools — ecosystems of AI malware, exploit discovery, spearphishing, ransomware, account takeovers, etc. — seems like a path to uncontrolled evolution of explicitly hostile AGI, where a maxim of  "discover the rules; break them; profit" is designed-in.
6lunatic_at_large
My initial reaction is that at least some of these points would be covered by the Guaranteed Safe AI agenda if that works out, right? Though the "AGIs act much like a colonizing civilization" situation does scare me because it's the kind of thing which locally looks harmless but collectively is highly dangerous. It would require no misalignment on the part of any individual AI. 
4ozziegooen
This came from a Facebook thread where I argued that many of the main ways AI was described as failing fall into few categories (John disagreed).  I appreciated this list, but they strike me as fitting into a few clusters.   Personally, I like the focus "scheming" has. At the same time, I imagine there are another 5 to 20 clean concerns we should also focus on (some of which have been getting attention). While I realize there's a lot we can't predict, I think we could do a much better just making lists of different risk factors and allocating research amongst them. 
[-]johnswentworthΩ5414332

On o3: for what feels like the twentieth time this year, I see people freaking out, saying AGI is upon us, it's the end of knowledge work, timelines now clearly in single-digit years, etc, etc. I basically don't buy it, my low-confidence median guess is that o3 is massively overhyped. Major reasons:

  • I've personally done 5 problems from GPQA in different fields and got 4 of them correct (allowing internet access, which was the intent behind that benchmark). I've also seen one or two problems from the software engineering benchmark. In both cases, when I look the actual problems in the benchmark, they are easy, despite people constantly calling them hard and saying that they require expert-level knowledge.
    • For GPQA, my median guess is that the PhDs they tested on were mostly pretty stupid. Probably a bunch of them were e.g. bio PhD students at NYU who would just reflexively give up if faced with even a relatively simple stat mech question which can be solved with a couple minutes of googling jargon and blindly plugging two numbers into an equation.
    • For software engineering, the problems are generated from real git pull requests IIUC, and it turns out that lots of those are things like e
... (read more)
Reply8421
[-]BuckΩ16317

I just spent some time doing GPQA, and I think I agree with you that the difficulty of those problems is overrated. I plan to write up more on this.

[-]BuckΩ7146

@johnswentworth Do you agree with me that modern LLMs probably outperform (you with internet access and 30 minutes) on GPQA diamond? I personally think this somewhat contradicts the narrative of your comment if so.

5johnswentworth
I don't know, I have not specifically tried GPQA diamond problems. I'll reply again if and when I do.
4Raemon
I at least attempted to be filtering the problems I gave you for GPQA diamond, although I am not very confident that I succeeded.  (Update: yes, the problems John did were GPQA diamond. I gave 5 problems to a group of 8 people, and gave them two hours to complete however many they thought they could complete without getting any wrong)
5johnswentworth
@Buck Apparently the five problems I tried were GPQA diamond, they did not take anywhere near 30 minutes on average (more like 10 IIRC?), and I got 4/5 correct. So no, I do not think that modern LLMs probably outperform (me with internet access and 30 minutes).
[-]BuckΩ10148

Ok, so sounds like given 15-25 mins per problem (and maybe with 10 mins per problem), you get 80% correct. This is worse than o3, which scores 87.7%. Maybe you'd do better on a larger sample: perhaps you got unlucky (extremely plausible given the small sample size) or the extra bit of time would help (though it sounds like you tried to use more time here and that didn't help). Fwiw, my guess from the topics of those questions is that you actually got easier questions than average from that set.

I continue to think these LLMs will probably outperform (you with 30 mins). Unfortunately, the measurement is quite expensive, so I'm sympathetic to you not wanting to get to ground here. If you believe that you can beat them given just 5-10 minutes, that would be easier to measure. I'm very happy to bet here.

I think that even if it turns out you're a bit better than LLMs at this task, we should note that it's pretty impressive that they're competitive with you given 30 minutes!

So I still think your original post is pretty misleading [ETA: with respect to how it claims GPQA is really easy].

I think the models would beat you by more at FrontierMath.