(Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.)
I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post).
Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up?
But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out--it should be possible to get a model we build to do at least roughly what we want. So I’m interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can’t tell any plausible story about how they lead to egregious misalignment.
This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier:
- We can potentially iterate much faster, since it’s often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice.
- We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases.
- We can find algorithms that have a good chance of working in the future even if we don’t know what AI will look like or how quickly it will advance, since we’ve been thinking about a very wide range of possible failure cases.
I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable.
What this looks like (3 examples)
My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.”
Example 1: human feedback
In an unaligned benchmark I describe a simple AI training algorithm:
- Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions.
- We ask humans to evaluate possible futures by looking at the predicted videos output by the model. We then train a model to predict these human evaluations.
- At test time the AI searches for plans that lead to trajectories that look good to humans.
In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment:
- Our generative model understands reality better than human evaluators.
- There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans.
- It’s possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a human looking at predicted videos.
- The fiction can look much better than the actual possible futures. So our planning process finds an action that covertly gathers resources and uses them to create a fiction.
I don’t know if or when this kind of reward hacking would happen — I think it’s pretty likely eventually, but it’s far from certain and it might take a long time.
But from my perspective this failure mode is at least plausible — I don’t see any contradictions between this sequence of events and anything I know about the real world. So this is enough for me to conclude that human feedback can’t handle the worst plausible situation, and to keep looking for an algorithm that can.
To better understand whether this story is really plausible, we can spend time refining it into something more and more concrete to see if it still seems to make sense. There are lots of directions in which we could add detail:
- Specify how our model could fool humans — maybe it sets some configuration files in ways that introduce security vulnerabilities we can’t understand, and sends an email that looks innocuous to us but is actually enlisting aid from an external attacker.
- Specify how an attacker could ultimately corrupt cameras and how that would lead to a higher reward — maybe we were trying to train an agent to design our new computing cluster, and it is able to corrupt the benchmarks we were using to estimate cluster performance.
- Specify how the algorithm works in more detail — maybe it uses MCTS with a learned value function for planning against the learned models, and we spend 10% of our budget on human evaluations of the policies output by the current planner, and so on.
Filling more and more details lets us notice if our abstract story was actually incoherent in important ways, or to notice weird things the story implies about the world that we might want to rule out by assumption.
Example 2: iterated amplification
To avoid the problems with raw human feedback, we could train additional ML assistants that help us evaluate outcomes. For example, assistants could point out possible consequences of a plan that we didn’t notice. Various variants of this idea are explored in benign model-free RL, supervising strong learners by amplifying weak experts, AI safety via debate, and recursive reward modeling.
In inaccessible information I tried to explore a story about how this entire family of algorithms could fail:
- Our AI learns to model the world by looking at a lot of videos, doing gradient descent over a bunch of possible physical models, and selecting a good one. It might model physics with a complicated mix of low-level physics and high-level abstractions.
- This model lets it predict data like camera observations, but also predict “off-screen” events like what code is running on my computer (even if the computation be directly observed on video). These predictions are economically useful and don’t have any close substitute.
- The process of finding this model of the world is much more complex than anything our AI can do at inference time, and the intermediate results are too complex and numerous to be “memorized” in the weights of our trained AI. So there doesn’t seem to be any way to break the model-finding work into pieces that can be delegated to an ML assistant (in amplification) or a debater (in debate).
- If we decide that we don’t trust this physical model and so don’t use it, then we aren’t economically competitive. But if we decide that we do trust the model, then we are back in the unfortunate situation from the last time: our AI might find a plan in which the “off-screen” information (computers outside of our view) are taking actions that ultimately hack our cameras and create a compelling fiction that we will rate highly. Because amplification gives us no way to access that off-screen data, we appear to have no way to tell if the plan is bad.
Example 3: imitative generalization
Imitative generalization is intended to address this problem with iterated amplification.
To briefly summarize: instead of using gradient descent to search over a space of human-incomprehensible models that predict some data (e.g. autoregressive models of videos), we try to search over space of models that a human can “understand” (perhaps with the help of aligned assistants as in amplification or debate), and optimize for a model that both looks plausible to the human and allows the human to successfully predict the same data (i.e. to predict the next pixel of a video). We hope that this allows us to find a human-comprehensible model that allows the human to both predict the data and figure out if the camera is being hacked.
(This algorithm is quite vague, so you could think of it as a whole family of algorithms based on how you parametrize the space of “human-comprehensible” models, how you search over that space, and how you define the prior. I’m going to try to tell a story about the limitations of this whole approach.)
Here’s an exotic situation where I think the naive version of this approach wouldn’t work:
- Our AI learns an extremely low-level model of the world defined in terms of concepts like “atoms.” It makes predictions about high-level concepts (like human behavior) by simulating the lower-level parts of which they are made.
- There is an easy way to read our dataset itself off of the low level simulation — -perhaps you can read bits off of our computer easily when expressed in the low-level simulation. So our AI learns how to use that extraction procedure without learning anything else explicit about the mapping between high-level and low-level concepts.
- Any model which allows the human to predict the video must include all of the low-level facts. But that means that any further high-level facts (e.g. about how the low-level structure relates to the high-level things we care about) can only decrease the prior probability without any compensating increase in the ability to explain data.
There are many obvious ways to try to address this problem, but I think it does break the most obvious implementations of imitative generalization. So now I have two questions:
- Is this story really plausible? It looks bad, but maybe as we flesh it out we’ll discover that it’s implausible.
- Refine imitative generalization so that it handles the situation in this situation.
After a little bit of inspection it turns out that the original story is inconsistent: it’s literally impossible to run a detailed low-level simulation of physics in situations where the computer itself needs to be part of the simulation. So the story as I told it is inconsistent, and we can breathe a temporary sigh of relief.
Unfortunately, the basic problem persists even when we make the story more complicated and plausible. Our AI inevitably needs to reason about some parts of the world in a heuristic and high-level way, but it could still use a model that is lower-level than what humans are familiar with (or more realistically just alien but simpler). And at that point we have the same difficulty.
It’s possible that further refinements of the story would reveal other inconsistencies or contradictions with what we know about ML. But I’ve thought enough about this that I think this failure story is probably something that could actually happen, and so I’m back to the step of improving or replacing imitative generalization.
This story is even more exotic than the ones in the previous sections. I’m including it in part to illustrate how much I’m willing to push the bounds of “plausible.” I think it’s extremely difficult to tell completely concrete and realistic stories, so as we make our stories more concrete they are likely to start feeling a bit strange. But I think that’s OK if we are trying to think about the worst case, until the story starts contradicting some clear assumptions about reality that we might want to rely on for alignment. When that happens, I think it’s really valuable to talk concretely about what those assumptions are, and be more precise about why the unrealistic nature of the story excuses egregious misalignment.
More general process
We start with some unaligned “benchmark”. We rule out a proposed alignment algorithm if we can come up with any story about how it can be either egregiously misaligned or uncompetitive.
I’m always thinking about a stable of possible alignment strategies and possible stories about how each strategy can fail. Depending on the current state of play, there are a bunch of different things to do:
- If there’s a class of algorithms (like imitative generalization) for which I can’t yet tell any failure story, I try to tell a story about how whole the class of algorithms would fail.
- If I can’t come up with any failure story, then I try to fill in more details about the algorithm. As the algorithm gets more and more concrete it becomes easier and easier to tell a failure story.
- The best case is that we end up with a precise algorithm for which we still can’t tell any failure story. In that case we should implement it (in some sense this is just the final step of making it precise) and see how it works in practice.
- More likely I’ll end up feeling like all of our current algorithms are doomed in the worst case. At that point I try to think of a new algorithm. For this step, it’s really helpful to look at the stories about how existing algorithms fail and try to design an algorithm that handles those difficulties.
- If all of my algorithms look doomed and I can’t think of anything new, then I try to really dig in on the existing failure stories by filling in details more concretely and exploring the implications. Are those stories actually inconsistent after all? Do they turn out to contradict anything I know about the world? If so, I may add another assumption about the world that I think makes alignment possible (e.g. the strategy stealing assumption), and throw out any stories that violate that assumption or which I now realize are inconsistent.
- If I have a bunch of stories about how particular algorithms fail, and I can’t think of any new algorithms, then I try to unify and generalize them to tell a story about why alignment could turn out to be impossible. This is a second kind of “victory condition” for my work, and I hope it would shed light on what the fundamental difficulties are in alignment (e.g. by highlighting additional empirical assumptions that would be necessary for any working approach to alignment).
Objections and responses
Can you really come up with a working algorithm on paper? Empirical work seems important
My goal from theoretical work is to find a credible alignment proposal. Even from that point I think it will take a lot of practical work to get it to the point where it works well and we feel confident about it in practice:
- I expect most alignment schemes are likely to depend on some empirical parameters that need to be estimated from experiment, especially to argue that they are competitive. For example, we may need to show that models are able to perform some tasks, like modeling some aspects of human preferences, “easily enough.” (This seems like an unusually easy claim to validate empirically — -if we show that our 2021 models can do a task, then it’s likely that future models can as well.) Or maybe we’ve argued that the aligned optimization problem is only harder by a bounded amount, but it really matters whether it’s 1.01 or 101 as expensive, so we need to measure this overhead and how it scales empirically. I’ve simplified my methodology a bit in this blog post, and I’d be thrilled if our alignment scheme ended up depending on some clearly defined and measurable quantities for which we can start talking about scaling laws.
- I don’t expect to literally have a proof-of-safety. I think at best we’re going to have some convincing arguments and some years of trying-and-failing to find a plausible failure story. That means that empirical research can still turn up failures we didn’t anticipate, or (more realistically) places where reality doesn’t quite match our on-paper picture and so we need to dig in to make sure there isn’t a failure lurking somewhere.
- Even if we’ve correctly argued that our scheme is workable, it’s still going to take a ton of effort to make it actually work. We need to write a bunch of code and debug it. We need to cope with the divergences between our conceptual “ML benchmark” and the messier ML training loops used in practice, even if those divergences are small enough that the theoretical algorithm still works. We need to collect the relevant datasets, even if we’ve argued that they won’t be prohibitively costly. And so on.
My view is that working with pen and paper is an important first step that allows you to move quickly until you have something that looks good on paper. After that point I think you are mostly in applied world, and I think that applied investments are likely to ultimately dwarf the theoretical investments by orders of magnitude even if it turns out that we found a really good algorithm on paper.
That’s why I’m personally excited about “starting with theory,” but I think we should do theoretical and applied work in parallel for a bunch of reasons:
- We need to eventually be able to make alignment techniques in the real world, and so we want to get as much practice as we can. Similarly, we want to build and grow capable teams and communities with good applied track records.
- There’s a good chance (50%?) that no big theoretical insights are forthcoming and empirical work is all that matters. So we really can’t wait on theoretical progress.
- I think there’s a reasonable chance of empirical work turning up unknown unknowns that change how we think about alignment, or to find empirical facts that make alignment easier. We want to get those sooner rather than later.
Why think this task is possible? 50% seems way too optimistic
When I describe this methodology, many people feel that I’ve set myself an impossible task. Surely any algorithm will be egregiously misaligned under some conditions?
My “50% probability of possibility” is coming largely from a soup of optimistic intuitions. I think it would be crazy to be confident on the basis of this kind of intuition, but I do think it’s enough to justify 50%:
- 10 years ago this project seemed much harder to me and my probability would have been much lower. Since then I feel like I’ve made a lot of progress in my own thinking about this problem (I think that a lot of this was a personal journey of rediscovering things that other people already knew or answering questions in a way that was only salient to me because of the way I think about the domain). I went from feeling kind of hopeless, to feeling like indirect normativity formalized the goal, to thinking about evaluating actions rather than outcomes, to believing that we can bootstrap superhuman judgments using AI assistants, to understanding the role of epistemic competitiveness, to seeing that all of these theoretical ideas appear to be practical for ML alignment, to seeing imitative generalization as a plausible approach to the big remaining limitation of iterated amplification.
- There is a class of theoretical problems for which I feel like it’s surprisingly often possible to either solve the problem or develop a clear picture of why you can’t. I don’t really know how to pin down this category but it contains almost all of theoretical computer science and mathematics. I feel like the “real” alignment problem is a messy practical problem, but that the worst-case alignment problem is more like a theory problem. Some theory problems turn out to be hard, e.g. it could be that worst-case alignment is as hard as P vs NP, but it seems surprisingly rare and even being as hard as P vs NP wouldn’t make it worthless to work on (and even for P vs NP we get various consolation prizes showing us why it’s hard to argue that it’s hard). And even for messy domains like engineering there’s something similar that often feels true, where given enough time we either understand how to build+improve a machine (like an engine or rocket) or we understand the fundamental limits that make it hard to improve further.
- So if it’s not possible to find any alignment algorithm that works in the worst case, I think there’s a good chance that we can say something about why, e.g. by identifying a particular hard case where we don’t know how to solve alignment and where we can say something about what causes misalignment in that case. This is important for two reasons: (i) I think that would be a really great consolation prize, (ii) I don’t yet see any good reason that alignment is impossible, so that’s a reason to be a bit more optimistic for now.
- I think one big reason to be more skeptical about alignment than about other theoretical problems is that the problem statement is incredibly imprecise. What constitutes a “plausible story,” and what are the assumptions about reality that an alignment algorithm can leverage? My feeling is that full precision isn’t actually essential to why theoretical problems tend to be soluble. But even more importantly, I feel like there is some precise problem here that we are groping towards, and that makes me feel more optimistic. (I discuss this more in the section “Are there any examples of this methodology working?”)
- Egregious misalignment still feels weird to me and I have a strong intuitive sense that we should be able to avoid it, at least in the case of a particular known technique like ML, if only we knew what we were doing. So I feel way more optimistic about being able to avoid egregious misalignment in the worst case than I do about most other theoretical or practical problems for which I have no strong feasibility intuition. This feasibility intuition also often does useful work for us since we can keep asking “Does this intermediate problem still feel like it should obviously be soluble?” and I don’t feel like this approach has yet led me into a dead end.
- Modern ML is largely based on simple algorithms that look good on paper and scale well in practice. I think this makes it much more plausible that alignment can also be based on simple algorithms that look good on paper and scale well in practice. Some people think of Sutton’s “bitter lesson” as bad news for the difficulty of alignment, and perhaps it is in general, but I think it’s great news if you’re looking for something really simple.
Despite having lots of optimistic words to say, feasibility is one of my biggest concerns with my methodology.
These failure stories involve very unrealistic learned models
My failure stories involve neural networks learning something like “simulate physics at a low level” or “perform logical deductions from the following set of axioms.” This is not the kind of thing that a neural network would learn in practice. I think this leads many people to be skeptical that thinking about such simplified stories could really be useful.
I feel a lot more optimistic:
- I don’t think neural network cognition will be simple, but I think it will involve lots of the features that come up in simple cognition: powerful models will likely make cognitive steps similar to logical deduction, bayesian updating, modeling physics at some level of abstraction, and so on.
- If our alignment techniques don’t work for simple cognition, I’m skeptical that they will work for complex cognition. I haven’t seen any alignment schemes that leverage complexity per se in order to work. A bigger and messier model is more likely to have some piece of its cognition that satisfies any given desirable property — -for example it’s more likely to have particular neurons that whose behavior can be easily understood — -but seems less likely to have every piece of its cognition satisfy any given desirable property.
- I think it’s very reasonable to focus on capable models — -we don’t need to solve alignment for models that can’t speak natural language or understand roughly what humans want. I think that’s OK: we should imagine simple models being very capable, and we can rule out a failure story as implausible if it involves the model being too weak.
- I think it’s more plausible for an alignment scheme to work well for simple cognition but fail for complex cognition. But in that case my methodology will just start with the simple cognition and move on to the more complex cognition, and I think that’s OK.
Are there any examples of a similar research methodology working well? This is different from traditional theoretical work
When theorists design algorithms they often focus on the worst case. But for them the “worst case” is e.g. a particular graph on which their algorithm runs slowly, not a “plausible” story about how a model is “egregiously misaligned.”
I think this is a real, big divergence that’s going to make it way harder to get traditional theorists on board with this approach. But there are a few ways in which I think the situation is less disanalogous than it looks:
- Although the majority of computer science theorists work in closed, precisely defined domains, the field also has some experience with fuzzier domains where the definitions themselves need to be refined. For example, at the beginning of modern cryptography you could describe the methodology as “Tell a story about how someone learns something about your secret” and that only gradually crystallized into definitions like semantic security (and still people sometimes retreat to this informal process in order to define and clarify new security notions). Or while defining interactive and zero knowledge proofs people would work with more intuitive notions of “cheating” or “learning” before they were able to capture them with formal definitions.
I think the biggest difference is that most parts of theoretical CS move quickly past this stage and spend most of their time working with precise definitions. That said, (i) part of this is due to the taste of the field and the increasing unwillingness to engage in hard-to-formalize activities, rather than a principled take that you need to avoid spending long in this stage, (ii) although many people are working on alignment only very few are taking the kind of approach I’m advocating here, so it’s not actually clear that we’ve spent so much more time than is typically needed in theoretical CS to formalize a new area (especially given that people in academia typically pick problems based on tractability).
- Both traditional theorists and I will typically start with a vague “hard case,” e.g. “What if the graph consists of two densely connected clusters with two edges in between them?” They then tell a story about how the algorithm would fail in that case, and think about how to fix the problem. In both cases, the point is that you could make the hard case more precise if you wanted to — -you can specify more details about the graph or you can fill in more details about the story. And in both cases, we learn how to tell vague stories by repeatedly going through the exercise of making them more precise and building intuitions about what the more precise story would look like. The big difference is that you can make a graph fully precise — -you can exactly specify the set of vertices and edges — -but you can never make a story about the world fully precise because there is just too much stuff happening. I think this really does mean that the traditional theorist’s intuition about what “counts” as a hard case is better grounded. But in practice I think it’s usually a difference in degree rather than kind. E.g., you very rarely need to actually write out the full graph in order to compute exactly how an algorithm behaves.
- Although the definition of a “plausible failure story” is pretty vague, most of the concrete stories we are working with can be made very specific in the ways that I think matter. For example, we may be able to specify completely precisely how a learned deduction process works (specifying the formal language L, specifying the “proof search order” it uses to loop over inferences, and so on) and why it leads to misalignment in a toy scenario.
My research methodology was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.
I'd love to hear more about this. To me, "egregious misalignment" feels extremely natural/normal/expected, perhaps due to convergent instrumental goals. You might as well have said "I really don't want my AI to think about politics" or "I really don't want my AI to think about distant superintelligences" or "I really don't want my AI to break any laws."
Separately, how much do you think your views would change if your feelings on this particular point changed?
Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips). Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time. Say, if you imagine somebody at Deepmind coming in without a lot of prior acquaintance with the field - or some hapless innocent ordinary naive LessWrong reader who has a glowing brain, but not a galaxy brain, and who is taking Paul's words for a lot of stuff about alignment because Paul has such a reassuring moderate tone compared to Eliezer - then they would come away from your paragraph thinking, "Oh, well, this isn't something that happens if I take a giant model and train it to produce outputs that human raters score highly, because an 'extreme and somewhat strange failure mode' must surely require that I add on some unusual extra special code to my model in order to produce it."
I suspect that you are talking in a way that leads a lot of people to vastly underestimate how difficult you think alignment is, because you're assuming, in the background, exotic doing-stuff-right technology that does not exist, in order to prevent these "extreme and somewhat strange failure modes" from happening, as we agree they automatically would given any "naive" simple scheme, that you could actually sketch out concretely right now on paper. By which I mean, concretely enough that you could have any ordinary ML person understand in concrete enough detail that they could go write a skeleton of the code, as opposed to that you think you could later sketch out a research approach for doing. It's not just a buffer overflow that's the default for bad security, it's the equivalent of a buffer overflow where nobody can right now exhibit how strange-failure-mode-avoiding code should concretely work in detail. "Strange" is a strange name for a behavior that is so much the default that it is an unsolved research problem to avoid it, even if you think that this research problem should definitely be solvable and it's just something wrong or stupid about all of the approaches we could currently concretely code that would make them exhibit that behavior.
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.
I'm not saying this is an exact analogy for AGI alignment - there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we're likely to try will break on us suddenly in ways we can't fix as we go - treacherous turns, inner misalignment or reactions to distributional shift. It's just that there are different answers to the question of what's the default outcome depending on if you're asking what to expect abstractly or in the context of how things are in fact done.
Instrumental Convergence plus a specific potential failure mode (like e.g. we won't pay sufficient attention to out of distribution robustness), is like saying 'you know the vast majority of physically possible bridge designs fall over straight away and also there's a giant crack in that load-bearing concrete pillar over there' - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn't do much to help the case for expecting catastrophic misalignment and isn't enough to establish that failure is a default outcome.
It seems like your reason for saying that catastrophic misalignment can't be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis - that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story.
= 'because strongly optimizing for almost anything leads to catastrophe via IC, we can't call catastrophic misalignment a bizarre outcome'?
Maybe it's just a subtle difference in emphasis without a real difference in expectation/world model, but I think there is an important need to clarify the difference between 'IC alone raises an issue that might not be obvious but doesn't give us a strong reason to expect a catastrophe' and 'IC alone suggests a catastrophe even though it's not the whole story' - and the first of these is a more accurate way of viewing the role of IC in establishing the likelihood of catastrophic misalignment.
Ben Garfinkel argues for the first of these and against the second, in his objection to the 'classic' formulation of instrumental convergence/orthogonality - that these are just 'measure based' arguments which identify that a majority of possible AI designs with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we're actually likely to build such agents.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn't fall over, which is not the situation we are currently in.
Didn't it use to be for thousands of years, before we had observed thousands of bridge designs falling or not falling and developed exact models, that bridges DID fall down like that quite often?
Have you played Poly Bridge?
I think I'm responding to a more basic intuition, that if I wrote some code and its now searching over ingenious ways to kill me, then something has gone extremely wrong in a way that feels preventable. It may be the default in some sense, just as wildly insecure software (which would lead to my computer doing the same thing under certain conditions) is the default in some sense, but in both cases I have the intuition that the failure comes from having made an avoidable mistake in designing the software.
In some sense changing this view would change my bottom line---e.g. if you ask me "Should you be able to design a bridge that doesn't fall down even in the worst case?" my gut take would be "why would that be possible?"---but I don't feel like there's a load-bearing intuitive disagreement in the vague direction of convergent instrumental goals.
OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about "thinking about politics" or "breaking laws" etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
(I didn't find the "...something has gone extremely wrong in a way that feels preventable" as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don't know what you are doing, that's totally preventable too because if you were an elite circus trainer you would have done it correctly.)
I don't really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking---law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable for basically the same reasons.
I'm not really sure if or how this is a reductio. I don't think it's a trivial statement that this failure is preventable, unless you mean by not running AI. Indeed, that's really all I want to say---that this failure seems preventable, and that intuition doesn't seem empirically contingent, so it seems plausible to me that the solubility of the alignment problem also isn't empirically contingent.
Thinking about politics may not be a failure mode; my question was whether it feels "extreme and somewhat strange," sorry for not clarifying. Like, suppose for some reason "doesn't think about politics" was on your list of desiderata for the extremely powerful AI you are building. So thinking about politics would in that case be a failure mode. Would it be an extreme and somewhat strange one?
I'd be interested to hear more about the law-breaking stuff -- what is it about some laws that makes AI breaking them unsurprising/normal/hard-to-avoid, whereas for others AI breaking them is perverse/strange/avoidable?
I wasn't constructing a reductio, just explaining why the phrase didn't help me understand your view/intuition. When I hear that phrase, it seems to me to apply equally to the grenade case, the lion-bites-head-off case, the AI-is-egregiously-misaligned case, etc. All of those cases feel the same to me.
(I do notice a difference between these cases and the bridge case. With the bridge, there's some sense in which no way you could have made the bridge would be good enough to prevent a certain sufficiently heavy load. By contrast, with AI, lions, and rocket-armchairs, there's at least some possible way to handle it well besides "just don't do it in the first place." Is this the distinction you are talking about?)
Is your claim just that the solubility of the alignment problem is not empirically contingent, i.e. there is no possible world (no set of laws of physics and initial conditions) such that someone like us builds some sort of super-smart AI, and it becomes egregiously misaligned, and there was no way for them to have built the AI without it becoming egregiously misaligned?
Nice post! I'm interested to hear more about how your methodology differs from others. Does this breakdown seem roughly right?
1. Naive AI alignment: We are satisfied by an alignment scheme that can tell a story about how it works. (This is what I expect to happen in practice at many AI labs.)
2. Typical-Case AI Alignment: We aren't satisfied until we try hard to think of ways our scheme could fail, and still it doesn't seem like failure is the most likely outcome. (This is what I expect the better sort of AI labs, the ones with big well-respected safety teams, will do.)
3. Worst-Case AI Alignment: We aren't satisfied until we try hard to think of ways our scheme could fail, and can't think of anything plausible. (This is your methodology, right?)
4. Ordinary Paranoia: We aren't satisfied until we try hard to think of a way our scheme could fail, and can't think of anything logically and physically possible. (Maybe this isn't importantly different from #3? See below.)
5. Security Mindset: As with ordinary paranoia, except that also we aren't satisfied until we can write a premise-conclusion form argument for why our scheme won't fail, such that the premises don't contain value-laden concepts and are in general fairly concrete/detailed, and such that each premise seems highly likely to be true. (This is what I think MIRI advocates? But I think I see shades of it in your methodology too.)
Second question: What counts as plausible? What does it mean for a story to contradict something we know to be true? The looser our standards for plausibility, the more your methodology ends up looking like Ordinary Paranoia. The stricter our standards for plausibility, the more it ends up looking like Typical-Case AI Alignment.
I don't really think of 3 and 4 as very different, there's definitely a spectrum regarding "plausible" and I think we don't need to draw the line firmly---it's OK if over time your "most plausible" failure mode becomes increasingly implausible and the goal is just to make it obviously completely implausible. I think 5 is a further step (doesn't seem like a different methodology, but a qualitatively further-off stopping point, and the further off you go the more I expect this kind of theoretical research to get replaced by empirical research). I think of it as: after you've been trying for a while to come up with a failure story, you can start thinking about why failure stories seem impossible and try to write an argument that there can't be any failure story...
I'm super on board with this general methodology, at least at a high level. (Counterexample guided loops are great.) I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?
For example, I feel like with iterated amplification, a bunch of people (including you, probably) said early on that it seems like a hard case to do e.g. translation between languages with people who only know one of the languages, or to reproduce brilliant flashes of insight. (Iirc, the translation example was in some comment on one of the AI Alignment blog posts from ~2016, though I can't find it right now.) To my eye, inaccessible information is mostly stating this sort of objection more clearly and generally (in particular, it isn't a fundamentally different argument). What changed that made that sound sufficiently like a failure story that you started working on a different algorithm?
High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I'm finally going back to doing theory for a variety of reasons; my story is definitely not that I'm transitioning back from applied work to theory because I now believe the algorithms aren't ready.
I feel like a story is basically plausible until proven implausible, so I have a pretty low bar.
I don't think that iterated amplification ever was at the point where we couldn't tell a story about how it might fail (perhaps in the middle of writing the ALBA post was peak optimism? but by the time I was done writing that post I think I basically had a story about how it could fail). In this case it seems like the distinction is more like "what is a solution going to look like?" and there aren't clean lines between "big changes to this algorithm" and "new algorithm."
I guess the question is why I was as optimistic as I was. For example, all the way until mid-2017 I thought it was plausible that something like iterated amplification would work without too many big changes (that's a bit of a simplification, but you can see how I talked about it e.g. here).
Some thoughts on that:
Cool, that makes sense, thanks!
Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it's the best way to engage with Paul's work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul's native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Eliciting Latent Knowledge.
Overview of why:
As with a number of other posts I have reviewed and given a +9 to, I just realized that I already wrote a curation notice back when it came out. I should check just how predictive my curation notices are (and the false-positive rate), it's interesting if I knew most of my favorite posts the moment they came out.
Curated. This post gives me a lot of context on your prior writing (unaligned benchmark, strategy stealing assumption, iterated amplification, imitative generalization), it helps me understand your key intuitions behind the plausibility of alignment, and it helps me understand where your research is headed.
When I read Embedded Agency, I felt like I then knew how to think productively about the main problems MIRI is working on by myself. This post leaves me feeling similarly about the problems you've been working on for the past 6+ years.
So thanks for that.
I'd like to read a version of this post where each example is 10x the length and analyzes it more thoroughly... I could just read all of your previous posts on each subject, though they're fairly technical. (Perhaps Mark Xu will write it such a post, he did a nice job previously on Solomonoff Induction.)
I'd also be pretty interested in people writing more posts presenting arguments for/against plausible stories for the failure of Imitative Generalization, or fleshing out the details of a plausible story such that we can see more clearly if the story is indeed plausible. Basically, making contributions in the ways you outline.
Aside: Since the post was initially published, some of the heading formatting was lost in an edit, so I fixed that before curating it.
Edit: Removed the line "After reading it I have a substantially higher probability of us solving the alignment problem." Understanding Paul's research is a big positive, but I'm not actually sure I stand by it leading to a straightforward change in my probability.
This post gives great insight into your research methodology, thanks for writing it.
You contrast ‘applied’ and ‘empirical’ here, but they sound the same to me. Is it a typo and you meant ‘applied’ and ‘theoretical’? That would make sense to me.
Yeah, thanks for catching that.
Why did you write "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)
Why doesn't the analogy with cryptography make you a lot more pessimistic about AI alignment, as it did for me?
Would you do anything else to make sure it's safe, before letting it become potentially superintelligent? For example would you want to see "alignment proofs" similar to "security proofs" in cryptography? What if such things do not seem feasible or you can't reach very high confidence that the definitions/assumptions/proofs are correct?
In my other response to your comment I wrote:
I guess SSH itself would be an interesting test of this, e.g. comparing the theoretical model of this paper to a modern implementation. What is your view about that comparison? e.g. how do you think about the following possibilities:
My guess would be that probably we're in world 2, and if not that it's probably because no one cares that much (e.g. because it's obvious that there will be some material weakness and the standards of the field are such that it's not publishable unless it actually comes with an attack) and we are in world 3.
(On a quick skim, and from the author's language when describing the model, my guess is that material weaknesses of the model are more or less obvious and that the authors are aware of potential attacks not covered by their model.)
I'm still curious for your view on the crypto examples you cited. My current understanding is that people do not expect the security proofs to rule out all possible attacks (a situation I can sympathize with since I've written multiple proofs that rule out large classes of attacks without attempting to cover all possible attacks), so I'm interested in whether (i) you disagree with that and believe that serious onlookers have had the expectation that proofs are comprehensive, (ii) you agree but feel it would be impractical to give a correct proof and this is a testament to the difficulty of proving things, (iii) you feel it would be possible but prohibitively expensive, and are expressing a quantitative point about the cost of alignment analyses being impractical, (iv) you feel that the crypto case would be practical but the AI case is likely to be much harder and just want to make a directionally analogous update.
I still feel like more of the action is in my skepticism about the (alignment analysis) <--> (security analysis) analogy, but I could still get some update out of the analogy if the crypto situation is thornier than I currently believe.
I think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality 1.5 years earlier (in section IV), so it's intended to be more about clarification/exposition than changing views.
See the thread with Rohin above for some rough history.
I'm not sure.It's possible I would become more pessimistic if I walked through concrete cases of people's analyses being wrong in subtle and surprising ways.
My experience with practical systems is that it is usually easy for theorists to describe hypothetical breaks for the security model, and the issue is mostly one of prioritization (since people normally don't care too much about security). For example, my strong expectation would be that people had described hypothetical attacks on any of the systems discussed in the article you linked prior to their implementation, at least if they had ever been subject to formal scrutiny. The failures are just quite far away from the levels of paranoia that I've seen people on the theory side exhibit when they are trying to think of attacks.
I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long. It sounds like a fun game.
Another possible divergence is that I'm less convinced by the analogy, since alignment seems more about avoiding the introduction of adversarial consequentialists and it's not clear if that game behaves in the same way. I'm not sure if that's more or less important than the prior point.
I would want to do a lot of work before deploying an algorithm in any context where a failure would be catastrophic (though "before letting it become potentially superintelligent" kind of suggests a development model I'm not on board with).
That would ideally involve theoretical analysis from a lot of angles, e.g. proofs of key properties that are amenable to proof, demonstrations of how the system could plausibly fail if we were wrong about key claims or if we relax assumptions, and so on.
It would also involve good empirical characterization, including things like running on red team inputs, or changing the training procedure in ways that seem as bad as possible while still preserving our alignment arguments, and performing extensive evals under those more pessimistic conditions. It would involve validating key claims individually, and empirically testing other claims that are established by structurally similar arguments. It would involve characterizing scaling behavior where applicable and understanding it as well as we can (along with typical levels of variability and plausible stories about deviations from trend).
I'm not exactly sure what you are asking. It seems like we'll do what we can on all the fronts and prioritize them as well as we can. Do you mean, what else can we say today about what methodologies we'd use? Or under what conditions would I pivot to spending down my political capital to delay deployment? Or something else?
I wanted a concrete example of the "human feedback" example. Let me know if this seems like a reasonable interpretation of the scenario.
Let's call the plan that acquires influence in a way obvious to a generative model but invisible and incomprehensible to humans "Plan X." Since by definition, I cannot conceive of "Plan X," we'll modify this scenario slightly.
Instead of a scenario in which humans design an AI, we'll imagine a scenario in which some very dumb goblins design an AI that's as smart as an average human. We'll call this a human-level AI or HL-AI.
The HL-AI controls a robot with camera eyes and motor appendages, and is located in a room full of tools and disassembled Ikea furniture. The robot moves its appendages randomly in response to the HL-AI. The observations of the camera eyes are used to train a generative model that predicts what will be observed via the camera eyes based on previous movements and camera eye observations.
humansgoblins to evaluate possible futures
The goblins want the HL-AI to help them build the Ikea furniture. The model has been trained to accurately generate videos that illustrate the predicted results of simulated robot movements. The goblins watch these videos until they find ones in which pieces of furniture are partly or fully assembled. They assign value to the videos according to how much furniture has been built.
Based on these evaluations, the HL-AI searches for plans that lead to trajectories in which the most furniture is assembled.
The goblins, for example, don't have very good object permanence. They can keep track of a piece of Ikea furniture as it's being assembled step by step. However, if you introduce a screen for a couple of minutes, blocking their view of the furniture, you can remove the original furniture, introduce a new piece (or more than one), and the goblins won't realize what's happened, or even be suspicious.
The goblins are real dumb, so they give points to the videos only based on the amount of partially or fully completed furniture they observe. That means that if the HL-AI figures out how to drag in pieces of furniture from the adjacent office, while turning off the camera eyes so that the goblins can't see it doing this, then the goblins will be tricked into thinking the HL-AI has magically produced extra furniture, for which they will reward it highly.
The goblins just think the furniture is magically disappearing from the office, so they keep replacing it. This allows the HL-AI to continue its strategy.
The goblins now believe they've created an HL-AI that's able to magically produce furniture, which they believe will be very helpful in solving the problem of the office in which furniture magically disappears. They continue putting the HL-AI to new "magical creation tasks," with the result that the HL-AI keeps on moving resources around pointlessly in the goblin world, creating new problems which they assign it to solve, leading to havoc.
Thanks for writing this! I'm quite excited by learning more about your meta-agenda and your research process, and this reading stimulated me about my own research process.
So you don't think that we could have a result of the sort "with these empirical facts, egregious misalignment is either certain or very hard to defend against, and so we should push towards not building AIs that way"? Or is it more than even with such arguments, you see incentives for people to use it, and so we might as well consider that we have to solve the problem even in such problematic cases?
Of these, only the last one looks to me like it's making things simpler. The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior. Or said differently, if you have to solve every plausible scenario, then simple testing doesn't cut it. And for the second, my personal worry with work on toy models is that the solutions work on test cases but not on practical one, not the other way around.
Reading that paragraph, I feel like you addressed some of my questions from above. One thing that I only understood here is that you want a solution such that we can't think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn't any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human's epistemic perspective.
Your rundown of examples from your research was really helpful, not only to get a grip of the process, but also because it clarified the path of refinement of your different proposals. I think it might be worth to make it its own post, with maybe more examples, for a view of how your "stable" evolved over the years.
This made me think of this famous paper in the theory of distributed computing, and especially what Nancy Lynch, the author, says about the process of working on impossibility results:
I expect this description of the process to be really helpful to many starting researchers who don't know where to push when one direction or approach fails.
This is the main reason I'm excited by empirical work.
For the objections and your response, I don't have any specific comment, except that I pretty much agree with most of what you say. On the differences with traditional theoretical computer science, I feel like the biggest one right now is that most of the work here lies in the "grasping towards the precise problem" instead of "solving a well-defined precise problem". I would expect that this is because the problem is harder, because the field is younger and has less theoretical work on, and because we are not satisfied by simply working on a tractable and/or exciting precise problem -- it has to be relevant to alignment.
You get to iterate fast until you find an algorithm where it's hard to think of failure stories. And you get to work on toy cases until you find an algorithm that actually works in all the toy cases. I think we're a long way from meeting those bars, so that we'll get to iterate fast for a while. After we meet those bars, it's an open question how close we'd be to something that actually works. My suspicion is that we'd have the right basic shape of an algorithm (especially if we are good at thinking of possible failures).
I feel like these distinctions aren't important until we get to an algorithm for which we can't think of a failure story (which feels a long way off). At that point the game kind of flips around, and we try to come up with a good story for why it's impossible to come up with a failure story. Maybe that gives you a strong security argument. If not, then you have to keep trying on one side or the other, though I think you should definitely be starting to prioritize applied work more.
Red-penning is a general problem-solving method that's kinda similar to this research methodology.
These are both cases of counterexample-guided techniques. The basic idea is to solve "exists x: forall y: P(x, y)" statements according to the following algorithm:
The reason this is so nice is because you've taken a claim with two quantifiers and written an algorithm that must only ever solve claims with one quantifier. (For step 3, you inline the "forall y in Y" part, because Y is a small finite set.)
The methodology laid out in this post is a counterexample-guided approach to solve the claim "exists alignment proposal: forall plausible worlds: The alignment proposal is safe in the world"
Examples from programming languages include CEGIS (counterexample guided inductive synthesis) and CEGAR (counterexample guided abstraction refinement).
For any competitive alignment scheme that involve helper (intermediate) ML models, I think we can construct the following story about an egregiously misaligned AI being created:
Suppose that there does not exist an ML model (in the model space being searched) that fulfills both the following conditions:
To complete the story: while we follow our alignment scheme, at some point we train a helper model that is egregiously misaligned, and we don't yet have any other helper model that allows to mitigate the associated risk.
If you don't find this story plausible, consider all the creatures that evolution created on the path from the first mammal to humans. The first mammal fulfills condition 2 but not 1. Humans might fulfill condition 1, but not 2. It seems that human evolution did not create a single creature that fulfills both conditions.
One might object to this analogy on the grounds that evolution did not optimize to find a solution that fulfills both conditions. But it's not like we know how to optimize for that (while doing a competitive search over a space of ML models).
I am not understanding this, but it's probably a simple ML terminology thing.
First you train a model, then you use it lots as a black box (of the type: input video-camera data -> output further (predicted) video-camera data). It has a model of physics, and the broad system it's in (Earth, 2000s, industrial revolution has happened, etc).
Is this paragraph saying that the learned model does not have an understanding of physics and current-Earth, but deduces all of this every time the model is run? And that's why the ML assistant isn't able to analyze this model of physics plus current-Earth?
Planned summary for the Alignment Newsletter:
That doesn't seem like it has quite the type signature I'm looking for. I'm imagining a story as a description of how something bad happens, so I want the story to end with "and then something bad happens."
In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."
I'm mostly doing that by making it more and more concrete---something is plausible iff there is a plausible way to fill in all the details. E.g. how is the model thinking, and why does that lead it to decide to kill you?
Sometimes after filling in a few details I'll see that the current story isn't actually plausible after all (i.e. now I see how to argue that the details-so-far are contradictory). In that case I backtrack.
Sometimes I fill in enough details that I'm fairly convinced the story is plausible, i.e. that there is some way to fill in the rest of the details that's consistent with everything I know about the world. In that case I try to come up with a new algorithm or new assumption.
(Sometimes plausibility takes the form of an argument that there is a way to fill in some set of details, e.g. maybe there's an argument that a big enough model could certainly compute X . Or sometimes I'm just pretty convinced for heuristic reasons.)
That's not a fully-precise methodology. But it's roughly what I'd do. (There are many places where the the methodology in this post is not fully-precise and certainly not mechanical.)
If I was starting looking at the trivial story "and then your algorithm kills you," my first move would usually be to try to say what kind of model was learned, which needs to behave well on the training set and plausibly kill you off distribution. Then I might try to shoot that story down by showing that some other model behaves even better on the training set or is even more readily learned (to try to contradict the part where the story needed to claim "And this was the model learned by SGD"), then gradually filling in more details as necessary to evaluate plausibility of the story.
To fill in the details more:
Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space that it can't do anything unsafe).
It seems like in some sense the game is in constraining the agent's cognition to be such that it is "safe" and "useful". The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive.
However, there are always going to be some plausible circumstances that we didn't consider (even if we're talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won't have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe.
(This wouldn't be true if we had some sort of proof against misfiring, that doesn't assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I'm pretty sure you agree with that.)
More generally, this story is going to be something like:
Obviously, I can't usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I'm not arguing that any of this is probable. However, it seems to meet your bar of "plausible":
EDIT: Or maybe more accurately, I'm not sure how exactly the stories you tell are different / more concrete than the ones above.
When I say you have "a better defined sense of what does and doesn't count as a valid step 2", I mean that there's something in your head that disallows the story I wrote above, but allows the stories that you generally use, and I don't know what that something is; and that's why I would have a hard time applying your methodology myself.
Possible analogy / intuition pump for the general story I gave above: Human cognition is only competent in particular domains and must be relearned in new domains (like protein folding) or new circumstances (like when COVID-19 hits), and sometimes human cognition isn't up to the task (like when being teleported to a universe with different physics and immediately dying), or doesn't do so in a way that agrees with other humans (like how some humans would push a button that automatically wirehead everyone for all time, while others would find that abhorrent).
That's basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?
I agree this involves discretion, and indeed moving beyond the trivial story "The algorithm fails and then it turns out you die" requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I'm doing some in between thing, which is roughly like: I'm allowed to push on the story to make it more concrete along any axis, but I recognize that I won't have time to pin down every axis so I'm basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can't fill in a billion parameters of my model one by one this way; what's worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).
Yeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet).
I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.
(Probably I could have been clearer about this in the original opinion.)
Interesting... On first reading your post, I felt that your methodological approach for dealing with the 'all is doomed in the worst case' problem is essentially the same as my approach. But on re-reading, I am not so sure anymore. So I'll try to explore the possible differences in methodological outlook, and will end with a question.
The key to your methodology is that you list possible process steps which one might take when one feels like
The specific doom-removing process step that I want to focus on is this one:
My feeling is that AGI safety/alignment community is way too reluctant to take this process step of 'add another assumption about the world' in order to eliminate a worst case failure story.
These seem to be several underlying causes for this reluctance. One of them is that in the field of developing machine learning algorithms, in the narrow sense where machine learning equals function approximation, the default stance is to make no assumptions about the function that has to be approximated. But the main function to be approximated in the case of an ML agent is the function that determines the behavior of the agent environment. So the default methodological stance in ML is that we can introduce no assumptions whatsoever about the agent environment, we can't for example assume that it contains a powerful oversight body that will help to keep the agent aligned. Obviously this stance is not very helpful if you want to make progress on certain alignment problems.
So I'm happy to see a post that encourages people to make explicit assumptions about the agent's environment. I have definitely used this technique to make progress in my own work.
When I look at your example of 'the strategy stealing assumption' as one useful assumption to add, it is very much not the default example that would first come to my mind. So I am wondering if you would even recommend the approach of adding the kind of default assumptions that I tend to add.
To make this more specific, in this post I introduce an agent design with three safety interlocks that are supposed to be helpful to agent oversight. The interlocks are agent design refinements that make it easier for oversight to keep control over the agent. The interlocks contribute to more successful oversight not by making the oversight people+machines smarter (which is your main line of research I believe), but by making the agent less smart in very specific ways.
But at the same time, these interlocks do not remove all possible worst-case failure stories of doom. To quote from the post and the underlying paper:
The key here is the 'highly unlikely'. If we have an algorithm were
then I typically add the following assumption to avoid doom:
In terms of methodology, I usually describe the above move as one where we seek to drive down the risk of certain failure modes to residual levels. There is a link to empirical work here. To make it more plausible that the above assumption about low risk is valid for a particular ML system and physical realization of an agent and its environment, we can do simulations and real-life experiments with trained generative models.
So my question is: would the above assumption-adding step, about the low risk of mis-predictions, be a natural and valid assumption-adding process step for 'throwing out failure stories' in your methodology?
Or is the existence of this assumption automatically implied by default in your process?
I feel confused about the failure story from example 3. (First 3 bullet-points in that section.)
It sounded like: We ask for a human-comprehensible way to predict X; the computer uses a very low-level simulation plus a small bridge that predicts only and exactly X; humans can't use the model to predict any high-level facts besides X.
But I don't see how that leads to egregious misalignment. Shouldn't the humans be able to notice their inability to predict high-level things they care about and send the AI back to its model-search phase? (As opposed to proceeding to evaluate policies based on this model and being tricked into a policy that fails "off-screen" somewhere.)