My blog is here. You can contact me using this form.
Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment.
The first part here feels unfair to the deceived. The second part seems like a property of successful deception, which depends crucially on the environment in addition to the AI. But this seems like too high a bar; successful deception of us, by definition, is not noticed, so if we ever notice deception it can't have been successful. I care less about whether deception will succeed and more about whether the AI will try to be deceptive in the first place. The core intuition is that if we have the latter, I assume we'll eventually get the former through better models (though I think there's a decent chance that control works for a long time, and there you care specifically about whether complex environment interactions lead to deception succeeding or not, but I don't think that's what you mean?).
The thing that seems close to this and correct, and that I think you maybe mean, is something like: deception arises in an AI if (NB: "if", not "if and only if") (1) the AI system has some goal G, (2) the environment is such that deceiving the humans is a good strategy for achieving G, and (3) there are no limits in the AI that prevent it from finding and executing that strategy (e.g. the architecture is expressive enough, the inductive biases don't massively reduce the probability of that strategy, or RLHFed constraints against being bad aren't enough). And here, (2) is of course about the environment. But to see whether this argument goes through, it doesn't seem like we need to care all that much about the real-world environment (as opposed to toy settings), because "does the real world incentivize deception" seems much less cruxy than (1) or (3).
So my (weakly held) claim is that you can study whether deception emerges in sufficiently simple environments that the environment complexity isn't a core problem. This will not let you determine whether a particular output in a complicated environment is part of a deceptive plan, but it should be fairly good evidence of whether or not deception is a problem at all.
(Also: do you mean a literal complexity class or something more informal? I assume the latter, and in that case I think it's better to not overload the term.)
1a) I got the impression that the post emphasises upper bounds more than existing proofs from the introduction, which has a long paragraph on the upper bound problem, and from reading the other comments. The rest of the post doesn't really bear this emphasis out though, so I think this is a misunderstanding on my part.
1b) I agree we should try to be able to make claims like "the model will never X". But if models are genuinely dangerous, by default I expect a good chance that teams of smart red-teamers and eval people (e.g. Apollo) to be able to unearth scary demos. And the main thing we care about is that danger leads to an appropriate response. So it's not clear to me that effective policy (or science) requires being able to say "the model will never X".
1c) The basic point is that a lot of the safety cases we have for existing products rely less on the product not doing bad things across a huge range of conditions, but on us being able to bound the set of environments where we need the product to do well. E.g. you never put the airplane wing outside its temperature range, or submerge it in water, or whatever. Analogously, for AI systems, if we can't guarantee they won't do bad things if X, we can work to not put them in situation X.
2a) Partly I was expecting the post to be more about the science and less about the field-building. But field-building is important to talk about and I think the post does a good job of talking about it (and the things you say about science are good too, just that I'd emphasise slightly different parts and mention prediction as the fundamental goal).
2b) I said the post could be read in a way that produces this feeling; I know this is not your intention. This is related to my slight hesitation around not emphasising the science over the field-building. What standards etc. are possible in a field is downstream of what the objects of study turn out to be like. I think comparing to engineering safety practices in other fields is a useful intuition pump and inspiration, but I sometimes worry that this could lead to trying to imitate those, over following the key scientific questions wherever they lead and then seeing what you can do. But again, I was assuming a post focused on the science (rather than being equally concerned with field-building), and responding with things I feel are missing if the focus had been the science.
3) It is true that optimisation requires computation, and that for your purposes, FLOPS is the right thing to care about because e.g. if doing something bad takes 1e25 FLOPS, the number of actors who can do it is small. However, I think compute should be called, well, "compute". To me, "optimisation power" sounds like a more fundamental/math-y concept, like how many bits of selection can some idealised optimiser apply to a search space, or whatever formalisation of optimisation you have. I admit that "optimisation power" is often used to describe compute for AI models, so this is in line with (what is unfortunately) conventional usage. As I said, this is a nitpick.
It seems to me that there are two unstated perspectives behind this post that inform a lot of it.
First, that you specifically care about upper-bounding capabilities, which in turn implies being able to make statements like "there does not exist a setup X where model M does Y". This is a very particular and often hard-to-reach standard, and you don't really motivate why the focus on this. A much simpler standard is "here is a setup X where model M did Y". I think evidence of the latter type can drive lots of the policy outcomes you want: "GPT-6 replicated itself on the internet and designed a bioweapon, look!". Ideally, we want to eventually be able to say "model M will never do Y", but on the current margin, it seems we mainly want to reach a state where, given an actually dangerous AI, we can realise this quickly and then do something about the danger. Scary demos work for this. Now you might say "but then we don't have safety guarantees". One response is: then get really good at finding the scary demos quickly.
Also, very few existing safety standards have a "there does not exist an X where..." form. Airplanes aren't safe because we have an upper bound on how explosive they can be, they're safe because we know the environments in which we need them to operate safely, design them for that, and only operate them within those. By analogy, this weakly suggests to control AI operating environments and develop strong empirical evidence of safety in those specific operating environments. A central problem with this analogy, though, is that airplane operating environments are much lower-dimensional. A tuple of (temperature, humidity, pressure, speed, number of armed terrorists onboard) probably captures most of the variation you need to care about, whereas LLMs are deployed in environments that vary on very many axes.
Second, you focus on the field, in the sense of its structure and standards and its ability to inform policy, rather than in the sense of the body of knowledge. The former is downstream of the latter. I'm sure biologists would love to have as many upper bounds as physicists, but the things they work on are messier and less amenable to strict bounds (but note that policy still (eventually) gets made when they start talking about novel coronaviruses).
If you focus on evals as a science, rather than a scientific field, this suggests a high-level goal that I feel is partly implicit but also a bit of a missing mood in this post. The guiding light of science is prediction. A lot of the core problem in our understanding of LLMs is that we can't predict things about them - whether they can do something, which methods hurt or help their performance, when a capability emerges, etc. It might be that many questions in this space, and I'd guess upper-bounding capabilities is one, just are hard. But if you gradually accumulate cases where you can predict something from something else - even if it's not the type of thing you'd like to eventually predict - the history of science shows you can get surprisingly far. I don't think it's what you intend or think, but I think it's easy to read this post and come away with a feeling of more "we need to find standardised numbers to measure so we can talk to serious people" and less "let's try to solve that thing where we can't reliably predict much about our AIs".
Also, nitpick: FLOPS are a unit of compute, not of optimisation power (which, if it makes sense to quantify at all, should maybe be measured in bits).
Do you have a recommendation for a good research methods textbook / other text?
This post summarises and elaborates on Neil Postman's underrated "Amusing Ourselves to Death", about the effects of mediums (especially television) on public discourse. I wrote this post in 2019 and posted it to LessWrong in 2022.
Looking back at it, I continue to think that Postman's book is a valuable and concise contribution and formative to my own thinking on this topic. I'm fond of some of the sharp writing that I managed here (and less fond of other bits).
The broader question here is: "how does civilisation set up public discourse on important topics in such a way that what is true and right wins in the limit?" and the weaker one of "do the incentives of online platforms mean that this is doomed?". This has been discussed elsewhere, e.g. here by Eliezer.
The main limitations that I see in my review are:
Overall, this seems like an important topic that could benefit greatly from more thought, and even more from evidence and plans. I hope that my review and Postman's book helped bring a bit more attention to it, but they are still far from addressing the points I have listed above.
I don't currently know of any not-extremely-gerry-mandered task where [scaffolding] actually improves task performance compared to just good prompt engineering. I've been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.
Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.
It does much better than AutoGPT, and also the paper does ablations to show that the different parts of the scaffolding in Voyager do matter. This suggests that better scaffolding does make a difference, and I doubt Voyager is the limit.
I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive. I think it's plausible this gap will eventually mostly close at some capability threshold, especially for many of the most potentially-transformative capabilities (e.g. having insights that draw on a large basis of information not memorised in a base model's weights, since this seems hard to decompose into smaller tasks), but it seems quite plausible the gap will be non-trivial for a while.
This seems like an impressive level of successfully betting on future trends before they became obvious.
apparently this doom path polls much better than treacherous turn stories
Are you talking about literal polling here? Are there actual numbers on what doom stories the public finds more and less plausible, and with what exact audience?
I held onto the finished paper for months and waited for GPT-4's release before releasing it to have good timing[...]I recognize this paper was around a year ahead of its time and maybe I should have held onto it to release it later.
I held onto the finished paper for months and waited for GPT-4's release before releasing it to have good timing
I recognize this paper was around a year ahead of its time and maybe I should have held onto it to release it later.
It's interesting that paper timing is so important. I'd have guessed earlier is better (more time for others to build on it, the ideas to seep into the field, and presumably gives more "academic street cred"), and any publicity boost from a recent paper (e.g. journalists more likely to be interested or whatever) could mostly be recovered later by just pushing it again when it becomes relevant (e.g. "interview with scientists who predicted X / thought about Y already a year ago" seems pretty journalist-y).
Currently, the only way to become an AI x-risk expert is to live in Berkeley.
There's an underlying gist here that I agree with, but the this point seems too strong; I don't think there is literally no one who counts as an expert who hasn't lived in the Bay, let alone Berkeley alone. I would maybe buy it if the claim were about visiting.
These are good questions!
I agree that this exact scenario is unlikely, but I think this class of failure mode is quite plausible, for reasons I hope I've managed to spell out more directly above.
Note that all of this relies on the assumption that we get AIs of a particular power level, and of a particular goodharting level, and a particular agency/coherency level. The AIs controlling future Earth are not wildly superhuman, are plausibly not particularly coherent in their preferences and do not have goals that stretch beyond Earth, no single system is a singleton, and the level of goodharting is just enough that humans go extinct but not so extreme that nothing humanly-recognisable still exists (though the Blight implies that elsewhere in the story universe there are AI systems that differ in at least some of these). I agree it is not at all clear whether these are true assumptions. However, it's not obvious to me that LLMs (and in particular AIs using LLMs as subcomponents in some larger setup that encourages agentic behaviour) are not on track towards this. Also note that a lot of the actual language that many of the individual AIs see is actually quite normal and sensible, even if the physical world has been totally transformed. In general, LLMs being able to use language about maximising shareholder value exactly right (and even including social responsibility as part of it) does not seem like strong evidence for LLM-derived systems not choosing actions with radical bad consequences for the physical world.
Thank you for your comment! I'm glad you enjoyed the review.
Before you pointed it out, I hadn't made the connection between the type of thing that Postman talks about in the book and increasing cultural safety-ism. Another interesting take you might be interested in is by J. Storrs Hall in Where is my flying car? - he argues that increasing cultural safety-ism is a major force slowing down technological progress. You can read a summary of the argument in my review here (search for "perception" to jump to the right part of the review).
That line was intended to (mildly humorously) make the point that we realise and are aware that there are many other serious risks in the popular imagination. Our central point is that AI x-risk is grand civilisational threat #1, so we wanted to lead with that, and since people think many other things are potential civilisational catastrophes (if not x-risks) we thought it made sense to mention those (and also implicitly put AI into the reference class of "serious global concern"). We discussed, and got feedback from several others, on this opener and while there was some discussion we didn't see any fundamental problem with it. The main consideration for keeping it was that we prefer specific and even provocative-leaning writing that makes its claims upfront and without apology (e.g. "AI is a bigger threat than climate change" is a provocative statement; if that is a relevant part of our world model, seems honest to point that out).
The general point we got from your comment is that we judged the way the tone of it comes across very wrongly. Thanks for this feedback; we've changed it. However, we're confused about the specifics of your point, and unfortunately haven't acquired any concrete model of how to avoid similar errors in the future apart from "be careful about the tone of any statements that even vaguely imply something about geopolitics". (I'm especially confused about how you got the reading that we equated the threat level from Putin and nuclear weapons, and it seems to me that the extent that it is "mudslinging" or "propaganda" seems to be the extent to which acknowledging that many people think Putin is a major threat is either of those things.)
In addition to the general tone, an additional thing we got wrong here was not sufficiently disambiguating between "we think these other things are plausible [or, in your reading, equivalent?] sources of catastrophe, and therefore you need a high bar of evidence before thinking AI is a greater one", versus "many people think these are more concrete and plausible sources of catastrophe than AI". The original intended reading was "bold" as in "socially bold, relative to what many people think", and therefore making points only about public opinion.
Correcting the previous mistake might have looked like:
"If human civilisation is destroyed this century, the most likely cause is advanced AI systems. This might sound like a bold claim to many, given that we live on a planet full of existing concrete threats like climate change, over ten thousand nuclear weapons, and Vladimir Putin"
Based on this feedback, however, we have now removed any comparison or mention of non-AI threats. For the record, the entire original paragraph is:
If human civilisation is destroyed this century, the most likely cause is advanced AI systems. This is a bold claim given that we live on a planet that includes climate change, over ten thousand nuclear weapons, and Vladimir Putin. However, it is a conclusion that many people who think about the topic keep coming to. While it is not easy to describe the case for risks from advanced AI in a single piece, here we make an effort that assumes no prior knowledge. Rather than try to argue from theory straight away, we approach it from the angle of what computers actually can and can’t do.