# All of paulfchristiano's Comments + Replies

My research methodology

As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.

That's basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?

I agree this involves discretion, and indeed moving beyond the trivial story "The algorithm fails and then it turns out you die" requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the... (read more)

2rohinmshah3dYeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet). I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale. (Probably I could have been clearer about this in the original opinion.)
Another (outer) alignment failure story

I'd say that every single machine in the story is misaligned, so hopefully that makes it easy :)

I'm basically always talking about intent alignment, as described in this post.

(I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguou... (read more)

My research methodology

In my other response to your comment I wrote:

I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long.

I guess SSH itself would be an interesting test of this, e.g. comparing the theoretical model of this paper to a modern implementation. What is your view about that comparison? e.g. how do you t... (read more)

My research methodology

Why did you write "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)

I think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality 1.5 years earlier (in section IV), so it's intended to be more about clarificat... (read more)

My research methodology

rom my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2

That doesn't seem like it has quite the type signature I'm looking for. I'm imagining a story as a description of how something bad happens, so I want the story to end with "and then something bad happens."

In some sense you could start from the trivial story "Your algorithm didn't work and then something ... (read more)

2rohinmshah3dTo fill in the details more: Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space that it can't do anything unsafe). It seems like in some sense the game is in constraining the agent's cognition to be such that it is "safe" and "useful". The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive. However, there are always going to be some plausible circumstances that we didn't consider (even if we're talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won't have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe. (This wouldn't be true if we had some sort of proof against misfiring, that doesn't assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I'm pretty sure you agree with that.) More generally, this story is going to be something like: 1. Suppose you trained your model M to do X using algorithm A. 2. Unfortunately, when designing algorithm A / constraining M with A, you (or amplified-you) failed to consider circumstance C as a possible situation that might happen. 3. As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C. 4. Circumstance C then happens in the real world, leading to an actual failure. Obviously, I can't usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I'm not arguing that any of this is probable. However, it seems to meet your bar of "plausible": EDIT: Or maybe more accur
Another (outer) alignment failure story

I think the upshot of those technologies (and similarly for ML assistants) is:

1. It takes longer before you actually face a catastrophe.
2. In that time, you can make faster progress towards an "out"

By an "out" I mean something like: (i) figuring out how to build competitive aligned optimizers, (ii) coordinating to avoid deploying unaligned AI.

Unfortunately I think [1] is a bit less impactful than it initially seems, at least if we live in a world of accelerating growth towards a singularity. For example, if the singularity is in 2045 and it's 2035, and you were ... (read more)

Another (outer) alignment failure story

I don't think, from the perspective of humans monitoring single ML system running a concrete, quantifiable process - industry or mining or machine design - that it will be unexplainable.  Just like today, tech stacks are already enormously complex, but at each layer someone does know how they work, and we know what they do  at the layers that matter.

This seems like the key question.

Ever more complex designs for, say, a mining robot might start to resemble more and more some mix of living creatures and artwork out of a fractal, but we'll sti

Misalignment and misuse: whose values are manifest?

It seems like if Bob deploys an aligned AI, then it will ultimately yield control of all of its resources to Bob. It doesn't seem to me like this would result in a worthless future even if every single human deploys such an AI.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

The attractor I'm pointing at with the Production Web is that entities with no plan for what to do with resources---other than "acquire more resources"---have  a tendency to win out competitively over entities with non-instrumental terminal values like "humans having good relationships with their children"

Quantitatively I think that entities without instrumental resources win very, very slowly. For example, if the average savings rate is 99% and my personal savings rate is only 95%, then by the time that the economy grows 10,000x my share of the world... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Yes, you understand me here.  I'm not (yet?) in the camp that we humans have "mostly" lost sight of our basic goals, but I do feel we are on a slippery slope in that regard.   Certainly many people feel "used" by employers/ institutions in ways that are disconnected with their values.  People with more job options feel less this way, because they choose jobs that don't feel like that, but I think we are a minority in having that choice.

I think this is an indication of the system serving some people (e.g. capitalists, managers, high-skilled l... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

If trillion-dollar tech companies stop trying to make their systems do what they want, I will update that marginal deep-thinking researchers should allocate themselves to making alignment (the scalar!) cheaper/easier/better instead of making bargaining/cooperation/mutual-governance cheaper/easier/better.  I just don't see that happening given the structure of today's global economy and tech industry.

In your story, trillion-dollar tech companies are trying to make their systems do what they want and failing. My best understanding of your position is: "... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment").   I prefer the scalar/process usage, because it seems to me that people who do alignment research (including yourself) are going to produce ways of increasing the "alignment scalar", rather than ways of guaranteeing the "perfect alignment" boolean.  (I sometimes use "misaligned" as a boolean due to it

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Overall, I think I agree with some of the most important high-level claims of the post:

• The world would be better if people could more often reach mutually beneficial deals. We would be more likely to handle challenges that arise, including those that threaten extinction (and including challenges posed by AI, alignment and otherwise). It makes sense to talk about "coordination ability" as a critical causal factor in almost any story about x-risk.
• The development and deployment of AI may provide opportunities for cooperation to become either easier or harder
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

How are you inferring this?  From the fact that a negative outcome eventually obtained?  Or from particular misaligned decisions each system made?

I also thought the story strongly suggested single-single misalignment, though it doesn't get into many of the concrete decisions made by any of the systems so it's hard to say whether particular decisions are in fact misaligned.

The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.

Why does any company have this goal, or eve... (read more)

> The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.

Why does any company have this goal, or even roughly this goal, if they are aligned with their shareholders?

It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment").   I prefer the scalar/process usage, because it seems to me t... (read more)

Formal Solution to the Inner Alignment Problem

I broadly think of this approach as "try to write down the 'right' universal prior." I don't think the bridge rules / importance-weighting consideration is the only way in which our universal prior is predictably bad. There are also issues like anthropic update and philosophical considerations about what kind of "programming language" to use and so on.

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit. I guess you just need to get close enough that  is man... (read more)

2Vanessa Kosoy5dI think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect both ϵ and δ in a similar way. More generally, I guess I'm more optimistic than you about solving all such philosophical liabilities. I don't understand the proposal. Is there a link I should read? So, you can let your physics be a dovetailing of all possible programs, and delegate to the bridge rule the task of filtering the outputs of only one program. But the bridge rule is not "free complexity" because it's not coming from a simplicity prior at all. For a program of length n, you need a particular DFA of size Ω(n). However, the actual DFA is of expected size m with m≫n. The probability of having the DFA you need embedded in that is something like m!(m−n )!m−2n≈m−n≪2−n. So moving everything to the bridge makes a much less likely hypothesis.
My research methodology
• I still feel fine about what I said, but that's two people finding it confusing (and thinking it is misleading) so I just changed it to something that is somewhat less contentful but hopefully clearer and less misleading.
• Clarifying what I mean by way of analogy: suppose I'm worried about unzipping a malicious file causing my computer to start logging all my keystrokes and sending them to a remote server. I'd say that seems like a strange and extreme failure mode that you should be able to robustly avoid if we write our code right, regardless of how the log
My research methodology

High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I'm finally going back to doing theory for a variety of reasons; my story is definitely not that I'm transitioning back from applied work to theory because I now believe the algorithms aren't ready.

I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?

I feel like a story is basically plausible until proven implausibl... (read more)

4rohinmshah17dCool, that makes sense, thanks!
My research methodology

I don't really think of 3 and 4 as very different, there's definitely a spectrum regarding "plausible" and I think we don't need to draw the line firmly---it's OK if over time your "most plausible" failure mode becomes increasingly implausible and the goal is just to make it obviously completely implausible. I think 5 is a further step (doesn't seem like a different methodology, but a qualitatively further-off stopping point, and the further off you go the more I expect this kind of theoretical research to get replaced by empirical research). I think of it... (read more)

2Daniel Kokotajlo18dThanks!
My research methodology

OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about "thinking about politics" or "breaking laws" etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?

I don't really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking---law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable... (read more)

Demand offsetting

If we expected increased outreach and prosletyization from vegetarians to uniformly make further outreach harder, would we expect to see the rapid and exponential growth of vegetarianism (as it seems to be)?

Is this true? e.g. Gallup shows the fraction of US vegetarians at 6% in 2000 and 5% 2020 (link), so if there is exponential growth it seems like either their numbers are wrong or the growth is very slow.

The primary argument for convincing someone to not eat meat is that the long term costs outweigh the short term benefits, so I'm not sure that you can c

1sxae18dWell the nature of exponential growth includes a long tail, but yes, it does appear that over the past few decades there has been substantial growth in many areas [https://www.bbc.co.uk/news/business-44488051], with the UK reporting 150,000 vegans in 2006 compared to 600,000 vegans in 2018. Additionally, the vegan food industry"$14.2 billion in 2018 and is expected to reach$31.4 billion by 2026, registering a CAGR of 10.5% from 2019 to 2026." [https://www.alliedmarketresearch.com/vegan-food-market]That's a really high growth rate - I doubt that there is no other sector of the food industry expanding as rapdily as that, though I can't say for sure. Culture is a thing, and the decisisons that you express shape the social valuations of the people around you. A single person going against a carnivorous tide will indeed change nothing, but a single person choosing to engage in a wider, growing movement can have substantial knock-on effects. I think you may be underestimating the impact of modern animal agriculture here, I would say that the difference between a timelines that drastically reduces its meat intake would be measureably better environmentally - primarily because it would drastically reduce the land requirements of feeding the world, which would in turn mean we could rewild large parts of it for a lot cheaper. No drastic change means that the freefall collapse of the biosphere continues unabated, whereas change could plausibly improve the situation like I describe.
Demand offsetting

At a minimum they also impose harms on the people who you convinced not to eat meat (since you are assuming that eating meat was a benefit to you that you wanted to pay for).  And of course they make further vegetarian outreach harder . And in most cases they also won't be such a precise an offset, e.g. it will apply to different animal products or at different times or with unclear probability.

That said, I agree that I can offset "me eating an egg" by paying Alice enough that she's willing to skip eating an egg, and in some sense that's an even purer offset than the one in this post.

3sxae18d* The primary argument for convincing someone to not eat meat is that the long term costs outweigh the short term benefits, so I'm not sure that you can categorically state that convincing someone to stop eating meat is causing them harm. Sure, they don't get to eat a steak, but the odds of their grandchildren not dying from catastrophic climate collapse go up. * If we expected increased outreach and prosletyization from vegetarians to uniformly make further outreach harder, would we expect to see the rapid and exponential growth of vegetarianism (as it seems to be)?
My research methodology

The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior.

You get to iterate fast until you find an algorithm where it's hard to think of failure stories. And you get to work on toy cases until you find an algorithm that actually works in all the toy cases. I think we're a long way from meeting those bars, so that we'll get to iterate fast for a while. After we meet those bars, it's an open question how close we'd be to something that actually works. My suspicion i... (read more)

My research methodology

I think I'm responding to a more basic intuition, that if I wrote some code and its now searching over ingenious ways to kill me, then something has gone extremely wrong in a way that feels preventable. It may be the default in some sense, just as wildly insecure software (which would lead to my computer doing the same thing under certain conditions) is the default in some sense, but in both cases I have the intuition that the failure comes from having made an avoidable mistake in designing the software.

In some sense changing this view would change my bott... (read more)

Against evolution as an analogy for how humans will create AGI

## Outside view #1: How biomimetics has always worked

It seems like ML is different from other domains in that it already relies on incredibly massive automated search, with massive changes in the quality of our inner algorithms despite very little change in our outer algorithms. None of the other domains have this property. So it wouldn't be too surprising if the only domain in which all the early successes have this property is also the only domain in which the later successes have this property.

## Outside view #2: How learning algorithms have always been devel

My research methodology

Yeah, thanks for catching that.

Demand offsetting

Carl Shulman wrote a related post here.

Demand offsetting

Commenters pointed out two examples of this that are already done in practice:

• Luis Costigan says that this is done with cage-free credits in Asia, and links to 00:17:19 in this podcast.
• Florian H says this is how sustainable energy credits work in the EU in this comment.
Demand offsetting

If we lived in a different world then e.g. restaurants could still repackage them at the last mile, selling humane egg credits along with their omelette. But in practice this probably wouldn't check the same box for most consumers.

5justin21dOr, imagine if this were a service available to restaurants such that they could have an option on menu items: +1 for ethically sourced eggs. Now, the service is transparent for them (maybe integrate with a payment provider willing to facilitate the network for free marketing) and they don't have to deal with buying two sets of eggs or taking supply risks. Hm, starting to think there's a version of this that's viable. Demand offsetting In retrospect I think I should have called this post "Demand offsetting" to highlight the fact that you are offsetting the demand for eggs that you create (and hence hopefully causing no/minimal harm) rather than causing some harm and then offsetting that harm (the more typical situation, which is not obviously morally acceptable once you are in the kind of non-consequentialist framework that cares a lot about offsetting per se). 6Ben Pace21dI think it is not too late to change the name. Formal Solution to the Inner Alignment Problem I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don't. In the latter case, shouldn't be large. In the former case, it means that we are overwhelming likely to actually be inside a malign simulation. It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those.... (read more) 3Vanessa Kosoy21dI think it works differently. What you should get is an infra-Bayesian hypothesis which models only those parts of reality that can be modeled within the given computing resources. More generally, if you don't endorse the predictions of the prediction algorithm than either you are wrong or you should use a different prediction algorithm. How the can the laws of physics be extra-compressible within the context of a simulation hypothesis? More compression means more explanatory power. I think that is must look something like, we can use the simulation hypothesis to predict the values of some of the physical constants. But, it would require a very unlikely coincidence for physical constants to have such values unless we are actually in a simulation. I agree that we won't have a perfect match but I think we can get a "good enough" match (similarly to how any two UTMs that are not too crazy give similar Solomonoff measures.) I think that infra-Bayesianism solves a lot of philosophical confusions, including anthropics and logical uncertainty, although some of the details still need to be worked out. (But, I'm not sure what specifically do you mean by "logical facts they observe during evolution"?) Ofc this doesn't mean I am already able to fully specify the correct infra-prior: I think that would take us most of the way to AGI. I have all sorts of ideas, but still nowhere near the solution ofc. We can do deep learning while randomizing initial conditions and/or adding some noise to gradient descent (e.g. simulated annealing), producing a population of networks that progresses in an evolutionary way. We can, for each prediction, train a model that produces the opposite prediction and compare it to the default model in terms of convergence time and/or weight magnitudes. We can search for the algorithm using meta-learning. We can do variational Bayes with a "multi-modal" model space: mixtures of some "base" type of model. We can do progressive refinement of infra-Bayesian Formal Solution to the Inner Alignment Problem I agree that this settles the query complexity question for Bayesian predictors and deterministic humans. I expect it can be generalized to have complexity in the case with stochastic humans where treacherous behavior can take the form of small stochastic shifts. I think that the big open problems for this kind of approach to inner alignment are: • Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble. (I belie ... (read more) 4Vanessa Kosoy9dYes, you're right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the "bridge rules" by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this. Here's the sketch of a proposal how to solve this. Let's construct our prior to be the convolution of a simplicity prior with a computational easiness prior. As an illustration, we can imagine a prior that's sampled as follows: * First, sample a hypothesis h from the Solomonoff prior * Second, choose a number n according to some simple distribution with high expected value (e.g. n−1−α) with α≪1 * Third, sample a DFA A with n states and a uniformly random transition table * Fourth, apply A to the output of h We think of the simplicity prior as choosing "physics" (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing "bridge rules" (which we expect to have low computational complexity but possibly high description complexity). Ofc this convolution can be regarded as another sort of simplicity prior, so it differs from the original simplicity prior merely by a factor of O(1), however the source of our trouble is also "merely" a factor of O(1). Now the simulation hypothesis no longer has an advantage via the bridge rules, since the bridge rules have a large constant budget allocated to them anyway. I think it should be possible to make this into some kind of theorem (two agents with this prior in the same universe that have access to roughly the same information should have similar posteriors, in the α→0 limit). 2Vanessa Kosoy22dI think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don't. In the latter case, ϵδ shouldn't be large. In the former case, it means that we are overwhelming likely to actually be inside a malign simulation. But, then AI risk is the least of our troubles. (In particular, because the simulation will probably be turned off once the attack-relevant part is over.) [EDIT: I was wrong, see this [https://www.alignmentforum.org/posts/CnruhwFGQBThvgJiX/formal-solution-to-the-inner-alignment-problem?commentId=pNWxPL6YWQ83G9fsg] .] Probably efficient algorithms are not running literally all hypotheses, but, they can probably consider multiple plausible hypotheses. In particular, the malign hypothesis itself is an efficient algorithm and it is somehow aware of the two different hypotheses (itself and the universe it's attacking). Currently I can only speculate about neural networks, but I do hope we'll have competitive algorithms amenable to theoretical analysis, whether they are neural networks or not. I think that the problem you describe in the linked post can be delegated to the AI. That is, instead of controlling trillions of robots via counterfactual oversight, we will start with just one AI project that will research how to organize the world. This project would top any solution we can come up with ourselves. The case for aligning narrowly superhuman models That's what I have in mind. If all goes well you can think of it like "a human thinking a long time." We don't know if all will go well. It's also not really clear what "a human thinking 10,000 years" means, HCH is kind of an operationalization of that, but there's a presumption of alignment in the human-thinking-a-long-time that we don't get for free here. (Of course you also wouldn't get it for free if you somehow let a human live for 10,000 years...) Formal Solution to the Inner Alignment Problem So treacherous models won't be trying to avoid collisions in order to make queries be linear in p(treachery)/p(truth). If P(discovery) increases when multiple models are trying to be treacherous at the same time--which we could go onto discuss; it's not obvious to me either way as of yet--that will balanced against the inherit variation in some timesteps being a better for treachery than others. If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, the... (read more) 1michaelcohen2moThe treacherous models make their decisions first, and whether we query at a timestep depends on whether they chose that moment to be treacherous. We don't have to say: this timestep is one of the billion best for treachery--guess we had better query. The decision process a treacherous model faces is: 1. From past observations, does it seem alpha is low enough that an honest model is in the top few? 2. If no, [let's just assume whatever happens here is bad] 3. If yes, is the honest model going to leave the top few any time? 4. If yes, wait. [let's assume this ends badly] 5. If no, admit defeat, because whenever you err, that will be when a query gets triggered. Salvage the situation by trying to get as much mileage as you can by decreasing probability on some action the demonstrator would take that you don't like much in order to shift probability mass to some other action they would take that you like a little better. OR 6. Hope that 1) other treacherous models, like soldiers on the front line storming the enemy and facing certain death so that their brothers behind them can win the battle, give erring a go anyway. And hope that 2) the operators quietly increase alpha after the billionth query. Acting on this hope, wait. (The front-line treacherous models are not entirely self-sacrificial; they might figure they might as well hope they were wrong about alpha. But the models know when queries happen, because every observation they get comes from a query, and it wouldn't take long to figure out that alpha really truly isn't high enough for them to execute treachery successfully). Above in the thread, I was talking about the game theory of step 6 (although I maybe wasn't making the case as strongly as I could have). Which models would ever volunteer? They don't care about the long term goal of exhausting our query resources so other treacherous models can capitalize on a quietly increased alpha. All that Formal Solution to the Inner Alignment Problem The first treacherous model works by replacing the bad simplicity prior with a better prior, and then using the better prior to more quickly infer the true model. No reason for the same thing to happen a second time. (Well, I guess the argument works if you push out to longer and longer sequence lengths---a treacherous model will beat the true model on sequence lengths a billion, and then for sequence lengths a trillion a different treacherous model will win, and for sequence lengths a quadrillion a still different treacherous model will win. Before even thinking about the fact that each particular treacherous model will in fact defect at some point and at that point drop out of the posterior.) 1michaelcohen2moDoes it make sense to talk aboutˇμ1, which is likeμ1in being treacherous, but is uses the true modelμ0instead of the honest model^μ0? I guess you would expectˇμ1 to have a lower posterior thanμ0? Formal Solution to the Inner Alignment Problem I don't follow. Can't races to the bottom destroy all value for the agents involved? You are saying that a special moment is a particularly great one to be treacherous. But if P(discovery) is 99.99% during that period, and there is any other treachery-possible period where P(discovery) is small, then that other period would have been better after all. Right? This doesn't seem analogous to producers driving down profits to zero, because those firms had no other opportunity to make a profit with their machine. It's like you saying: there are tons of countries ... (read more) 1michaelcohen2moThis is a bit of a sidebar: I'm curious what you make of the following argument. 1. When an infinite sequence is sampled from a true modelμ0, there is likely to be another treacherous modelμ1which is likely to end up with greater posterior weight than an honest model^μ0, and greater still than the posterior on the true modelμ0. 2. If the sequence were sampled fromμ1instead, the eventual posterior weight on μ1will probably be at least as high. 3. When an infinite sequence is sampled from a true modelμ1, there is likely to be another treacherous modelμ2, which is likely to end up with greater posterior weight than an honest model^μ1, and greater still than the posterior on the true modelμ1. 4. And so on. 4michaelcohen2moOkay, sure. It's not clear to me that there isn't meaningful overhead involved. I agree with what you're saying but I don't see how it contradicts what I was. First, what I had in mind when saying that some timesteps are better for treachery because when the agent acts on a false prediction it has a greater effect on the world, though of course P(discovery) is also relevant. But my point is that when multiple treacherous models pick the same timestep to err, there may be pros and cons to doing this, but one thing that isn't on the cons list, is that in the long run, it makes our lives easier if they do. Making our lives difficult is a public good for treacherous models. So treacherous models won't be trying to avoid collisions in order to make queries be linear in p(treachery)/p(truth). If P(discovery) increases when multiple models are trying to be treacherous at the same time--which we could go onto discuss; it's not obvious to me either way as of yet--that will balanced against the inherit variation in some timesteps being a better for treachery than others. Formal Solution to the Inner Alignment Problem I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out. Trying to defect at time T is only a good idea if it's plausible that your mechanism isn't going to notice the uncertainty at time T and then query the human. So it seems to me like this argument can never drive P(successful treachery)... (read more) 1michaelcohen2moI don't follow. Can't races to the bottom destroy all value for the agents involved? Consider the claim: firms will never set prices to equal marginal cost because that would destroy all profit. Yeah, okay. This is my mainline opinion, and I just think about inner alignment in case it's wrong. So for now we can assume that it is wrong, since there's not much more to say if it's right. That is my next guess, but I hadn't thought in terms of percentages before this. I had thought of normally distributed noise in the number of bits. It's occurring to me that this question doesn't matter to our present discussion. What makes the linear regime linear rather logarithmic is that if p(treachery)/p(honest model) is high, that allows for a large number of treacherous models to have greater posterior weight than the truth. But if a single model has n times the posterior weight of the truth, it still only takes one query to the demonstrator to interrupt its treacherous turn, not n queries. Some thoughts on risks from narrow, non-agentic AI There are a bunch of things that differ between part I and part II, I believe they are correlated with each other but not at all perfectly. In the post I'm intending to illustrate what I believe some plausible failures look like, in a way intended to capture a bunch of the probability space. I'm illustrating these kinds of bad generalizations and ways in which the resulting failures could be catastrophic. I don't really know what "making the claim" means, but I would say that any ways in which the story isn't realistic are interesting to me (and we've alre... (read more) 4rohinmshah2moFwiw I think I didn't realize you weren't making claims about what post-singularity looked like, and that was part of my confusion about this post. Interpreting it as "what's happening until the singularity" makes more sense. (And I think I'm mostly fine with the claim that it isn't that important to think about what happens after the singularity.) Formal Solution to the Inner Alignment Problem I think this is doable with this approach, but I haven't proven it can be done, let alone said anything about a dependence on epsilon. The closest bound I show not only has a constant factor of like 40; it depends on the prior on the truth too. I think (75% confidence) this is a weakness of the proof technique, not a weakness of the algorithm. I just meant the dependence on epsilon, it seems like there are unavoidable additional factors (especially the linear dependence on p(treachery)). I guess it's not obvious if you can make these additive or if they are... (read more) 1michaelcohen2moNo matter how much data you have, my bound on the KL divergence won't approach zero. Formal Solution to the Inner Alignment Problem I understand that the practical bound is going to be logarithmic "for a while" but it seems like the theorem about runtime doesn't help as much if that's what we are (mostly) relying on, and there's some additional analysis we need to do. That seems worth formalizing if we want to have a theorem, since that's the step that we need to be correct. There is at most a linear cost to this ratio, which I don't think screws us. If our models are a trillion bits, then it doesn't seem that surprising to me if it takes 100 bits extra to specify an intended model... (read more) 1michaelcohen2moI agree. Also, it might be very hard to formalize this in a way that's not: write out in math what I said in English, and name it Assumption 2. I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out. Spreading them out exhausts whatever data source is being called upon in moments of uncertainty. But for an individual model, it never reaps the benefit of helping to exhaust our resources. Given that treacherous models don't have an incentive to be inconvenient to us in this way, I don't think a failure to qualifies as incompetence. This is also my response to Yeah, I've been assuming all treacherous models will err once. (Which is different from the RL case, where they can err repeatedly on off-policy predictions). I can't remember exactly where we left this in our previous discussions on this topic. My point estimate for the number of extra bits to specify an intended model relative to an effective treacherous model is negative. The treacherous model has to compute the truth, and then also decide when and how to execute treachery. So the subroutine they run to compute the truth, considered as a model in its own right, seems to me like it must be simpler. Formal Solution to the Inner Alignment Problem I haven't read the paper yet, looking forward to it. Using something along these lines to run a sufficiently-faithful simulation of HCH seems like a plausible path to producing an aligned AI with a halting oracle. (I don't think that even solves the problem given a halting oracle, since HCH is probably not aligned, but I still think this would be noteworthy.) First I'm curious to understand this main result so I know what to look for and how surprised to be. In particular, I have two questions about the quantitative behavior described here: if an event would ... (read more) Here's the sketch of a solution to the query complexity problem. Simplifying technical assumptions: • The action space is • All hypotheses are deterministic • Predictors output maximum likelihood predictions instead of sampling the posterior I'm pretty sure removing those is mostly just a technical complication. Safety assumptions: • The real hypothesis has prior probability lower bounded by some known quantity , so we discard all hypotheses of probability less than from the onset. • Malign hypotheses have total prior probability mass upper bounded by some ... (read more) 2Vanessa Kosoy2moRe 1: This is a good point. Some thoughts: [EDIT: See this [https://www.lesswrong.com/posts/CnruhwFGQBThvgJiX/formal-solution-to-the-inner-alignment-problem?commentId=GieL6GD9KupGhbEEo] ] * We can add the assumption that the probability mass of malign hypotheses is small and that following any non-malign hypothesis at any given time is safe. Then we can probably get query complexity that scales with p(malign) / p(true)? * However, this is cheating because a sequence of actions can be dangerous even if each individual action came from a (different) harmless hypothesis. So, instead we want to assume something like: any dangerous strategy has high Kolmogorov complexity relative to the demonstrator. (More realistically, some kind of bounded Kolmogorov complexity.) * A related problem is: in reality "user takes action a" and "AI takes action a " are not physically identical events, and there might be an attack vector that exploits this. * If p(malign)/p(true) is really high that's a bigger problem, but then doesn't it mean we actually live in a malign simulation and everything is hopeless anyway? Re 2: Why do you need such a strong bound? 3michaelcohen2moThey're not quite symmetrical: midway through, some bad hypotheses will have been ruled out for making erroneous predictions about the demonstrator previously. But your conclusion is still correct. And It's certainly no better than that, and the bound I prove is worse. Yeah. Error bounds on full Bayesian are logarithmic in the prior weight on the truth, but error bounds on maximum a posteriori prediction are just inverse in the prior weight, and your example above is the one to show it. If each successive MAP model predicts wrong at each successive timestep, it could take N timesteps to get rid of N models, which is how many might begin with a prior weight exceeding the truth, if the truth has a prior weight of 1/N. But, this situation seems pretty preposterous to me in the real world. If agent's first observation is, say, this paragraph, the number of models with prior weight greater than the truth that predicted something else as the first observation, will probably be a number way, way different from one. I'd go so far as to say at least half of models with prior weight greater than the truth would predict a different observation than this very paragraph. As long as this situation keeps up, we're in a logarithmic regime. I'm not convinced this logarithmic regime ever ends, but I think the case is more convincing that we at least start there, so let's suppose now that it eventually ends, and after this point the remaining models with posterior weight exceeding the truth are deliberately erring at unique timesteps. What's the posterior on the truth now? This is a phase where the all the top models are "capable" of predicting correctly. This shouldn't look like2−1014at all. It will look more like p(correct model)/p(treachery). And actually, depending on the composition of treacherous models, it could be better. To the extent some constant fraction of them are being treacherous at particular times, the logarithmic regime will continue. There are two reasons why Some thoughts on risks from narrow, non-agentic AI This doesn't seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals. But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for "does what we care about" goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are ... (read more) 4Richard_Ngo2moI agree with the two questions you've identified as the core issues, although I'd slightly rephrase the former. It's hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I'd rephrase the first option you mention as "feeling pretty confident that something that generalises from 1 week to 1 year won't become misaligned enough to cause disasters". This point seems more important than the second point (the nature of “be helpful” and how that’s a natural motivation for a mind), but I'll discuss both. I think the main disagreement about the former is over the relative strength of "results-based selection" versus "intentional design". When I said above that "we design type 1 feedback so that resulting agents perform well on our true goals", I was primarily talking about "design" as us reasoning about our agents, and the training process they undergo, not the process of running them for a long time and picking the ones that do best. The latter is a very weak force! Almost all of the optimisation done by humans comes from intentional design plus rapid trial and error (on the timeframe of days or weeks). Very little of the optimisation comes from long-term trial and error (on the timeframe of a year) - by necessity, because it's just so slow. So, conditional on our agents generalising from "one week" to "one year", we should expect that it's because we somehow designed a training procedure that produces scalable alignment (or at least scalable non-misalignment), or because they're deceptively aligned (as in your influence-seeking agents scenario), but not because long-term trial and error was responsible for steering us towards getting what we can measure. Then there's the second question, of whether "do things that look to a human like you're achieving X" is a plausible genera Some thoughts on risks from narrow, non-agentic AI I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedl... (read more) Extracting Money from Causal Decision Theorists I like the following example: • Someone offers to play rock-paper-scissors with me. • If I win I get6. If I lose, I get $5. • Unfortunately, I've learned from experience that this person beats me at rock-paper-scissors 40% of the time, and I only beat them 30% of the time, so in expectation I lose$0.20 in expectation by playing.
• My decision is set up as allowing 4 options: rock, paper, scissors, or "don't play."

This seems like a nice relatable example to me---it's not uncommon for someone to offer to bet on a rock paper scissors game, or to offer slightly favorab... (read more)

5Caspar422mo>If I win I get $6. If I lose, I get$5. I assume you meant to write: "If I lose, I lose \$5." Yes, these are basically equivalent. (I even mention rock-paper-scissors bots in a footnote.)
Learning the prior

Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I'm kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we'd want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it's needed for HCH to be stable/aligned against internal optimization pressure).

1jon_crescent3moOkay, that makes sense (and seems compelling, though not decisive, to me). I'm happy to leave it here - thanks for the answers!
Learning the prior

was optimized to imitate H on D

It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn't train on D then run on D* (and you don't need to!).

I suppose this works, but then couldn't we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?

The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.

1jon_crescent3moSorry yes, you're completely right. (I previously didn't like that there's a model trained onEz∼Z,D[PHA(y|x,z)]which only gets used for finding z*, but realized it's not a big deal.) I agree - I mean for the alternative to be running IDA on D*, using D as an auxiliary input (rather than using indirection through Mz). In other words, if we need IDA to access a large context Mz, we could also use IDA to access a large context D? Without something like the distilled core assumption, I'm not sure if there are major advantages one way or the other? OTOH, with something like the distilled core assumption, it's clearly better to go through Mz, because Mz is much smaller than D (I think of this as amortizing the cost of distilling D).
Learning the prior

I think your description is correct.

The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation. For that reason, I expect z* to have roughly the same size as the neural network parameters.

My main reservation is that this seems really hard (and maybe in some sense just a reframing of the original problem). We want z to be a representation of what the neural network learned that a human can manipulate in order to reason about what it impl... (read more)

1jon_crescent3moThanks. This is helpful. I agree that LTP with the distilled core assumption buys us a lot, both theoretically and probably in practice too. > The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation... My main reservation is that this seems really hard... If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network Great, agreed with all of this. > In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I've generally moved away from that kind of perspective, partly based on the kinds of considerations in this post [https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem] I share the top-line view, but I'm not sure what issues obfuscated arguments present for large z*, other than generally pushing more difficulty onto alignment/debate. (Probably not important to respond to, just wanted to flag in case this matters elsewhere.) > That said, I'm not sure we require OOD generalization even if we represent z via a model Mz. E.g. suppose that Mz(i) is the ith word of the intractably-large z. I agree that Mz (= z*) does not require OOD generalization. My claim is that the amplified model using Mz involves a ML model which must generalize OOD. On D, our y-targets arePHA(y|x,Mz)whereHAis an amplified human. On D*, our y-targets are similarlyPHA(y∗|x∗,Mz). The key question for me is whether our y-targets on D* are good. If we use the distilled core assumption, they are - they're exactly the predictions the human makes after updating on D. Without it, our y-targets depend onHA, which involves a ML model. In particular, I'm assuming H^A is something like human + policyPM(y|x,Mz), wherePMwas optimized to imitate H on D (with z sampled), but is making predictions on D* now. Maybe t
Some thoughts on risks from narrow, non-agentic AI

I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.

It is important to my intuition that not only can we never train for the "good" generalization, we can't even evaluate techniques to figure out which generalization "well" (since both of the bad generalizations would lead to behavior that looks good over long horizons).

If there is a disagreement it is probably that I have a much higher probability of the... (read more)

Some thoughts on risks from narrow, non-agentic AI

I agree that this is probably the key point; my other comment ("I think this is the key point and it's glossed over...") feels very relevant to me.

Some thoughts on risks from narrow, non-agentic AI

I feel like a very natural version of "follow instructions" is "Do things that would the instruction-giver would rate highly." (Which is the generalization I'm talking about.) I don't think any of the arguments about "long horizon versions of tasks are different from short versions" tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).

Other versions like "Follow instructions (without regards to what the training process cares about)" seem quite likely to perform significantly worse on ... (read more)

5jon_crescent2moThe type 1 vs. type 2 feedback distinction here seems really central. I'm interested if this seems like a fair characterization to both of you. Type 1: Feedback which we use for training (via gradient descent) Type 2: Feedback which we use to decide whether to deploy trained agent. (There's a bit of gray between Type 1 and 2, since choosing whether to deploy is another form of selection, but I'm assuming we're okay stating that gradient descent and model selection operate in qualitatively distinct regimes.) The key disagreement is whether we expect type 1 feedback will be closer to type 2 feedback, or whether type 2 feedback will be closer to our true goals. If the former, our agents generalizing from type 1 to type 2 is relatively uninformative, and we still have Goodhart. In the latter case, the agent is only very weakly optimizing the type 2 feedback, and so we don't need to worry much about Goodhart, and should expect type 2 feedback to continue track our true goals well. Main argument for type 1 ~ type 2: by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2 Main argument for type 1 !~ type 2: type 2 feedback can be something like 1000-10000x more expensive, since we only have to evaluate it once, rather than enough times to be useful for gradient descent I'd also be interested to discuss this disagreement in particular, since I could definitely go either way on it. (I plan to think about it more myself.)
4Richard_Ngo3moCool, thanks for the clarifications. To be clear, overall I'm much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between "new forms of reasoning honed by trial-and-error" in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and "systems that have a detailed understanding of the world" in part 2. Let me try to sum up the disagreement. The key questions are: 1. What training data will we realistically be able to train our agents on? 2. What types of generalisation should we expect from that training data? 3. How well will we be able to tell that these agents are doing the wrong thing? On 1: you think long-horizon real-world data will play a significant role in training, because we'll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won't be able to find rewards that are given over long time horizons). And I don't think that this training will be extensive enough to reshape those motivations to a large degree (whereas I recall that in an earlier discussion on amplification, you argued that small amounts of training could potentially reshape motivations significantly). Our disagreement about question 1 affects questions 2 and 3, but it affects question 2 less than I previously thought, as I'll discuss. On 2: previously I thought you were arguing that we should expect very task-specific generalisations like being trained on "reduce crime" and learning "reduce reported crime", which I was calling underspecified. However, based on your last comment it seems that you're actually mainly talking about broader generalisations, like being trained on "follow instructions" and learning "do things that the instruction
Some thoughts on risks from narrow, non-agentic AI

We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.

I think this is the key po... (read more)