As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
That's basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?
I agree this involves discretion, and indeed moving beyond the trivial story "The algorithm fails and then it turns out you die" requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the... (read more)
I'd say that every single machine in the story is misaligned, so hopefully that makes it easy :)
I'm basically always talking about intent alignment, as described in this post.
(I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguou... (read more)
In my other response to your comment I wrote:
I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long.
I guess SSH itself would be an interesting test of this, e.g. comparing the theoretical model of this paper to a modern implementation. What is your view about that comparison? e.g. how do you t... (read more)
Why did you write "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)
I think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality 1.5 years earlier (in section IV), so it's intended to be more about clarificat... (read more)
rom my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2
That doesn't seem like it has quite the type signature I'm looking for. I'm imagining a story as a description of how something bad happens, so I want the story to end with "and then something bad happens."
In some sense you could start from the trivial story "Your algorithm didn't work and then something ... (read more)
I think the upshot of those technologies (and similarly for ML assistants) is:
By an "out" I mean something like: (i) figuring out how to build competitive aligned optimizers, (ii) coordinating to avoid deploying unaligned AI.
Unfortunately I think  is a bit less impactful than it initially seems, at least if we live in a world of accelerating growth towards a singularity. For example, if the singularity is in 2045 and it's 2035, and you were ... (read more)
I don't think, from the perspective of humans monitoring single ML system running a concrete, quantifiable process - industry or mining or machine design - that it will be unexplainable. Just like today, tech stacks are already enormously complex, but at each layer someone does know how they work, and we know what they do at the layers that matter.
This seems like the key question.
Ever more complex designs for, say, a mining robot might start to resemble more and more some mix of living creatures and artwork out of a fractal, but we'll sti
It seems like if Bob deploys an aligned AI, then it will ultimately yield control of all of its resources to Bob. It doesn't seem to me like this would result in a worthless future even if every single human deploys such an AI.
The attractor I'm pointing at with the Production Web is that entities with no plan for what to do with resources---other than "acquire more resources"---have a tendency to win out competitively over entities with non-instrumental terminal values like "humans having good relationships with their children"
Quantitatively I think that entities without instrumental resources win very, very slowly. For example, if the average savings rate is 99% and my personal savings rate is only 95%, then by the time that the economy grows 10,000x my share of the world... (read more)
Yes, you understand me here. I'm not (yet?) in the camp that we humans have "mostly" lost sight of our basic goals, but I do feel we are on a slippery slope in that regard. Certainly many people feel "used" by employers/ institutions in ways that are disconnected with their values. People with more job options feel less this way, because they choose jobs that don't feel like that, but I think we are a minority in having that choice.
I think this is an indication of the system serving some people (e.g. capitalists, managers, high-skilled l... (read more)
If trillion-dollar tech companies stop trying to make their systems do what they want, I will update that marginal deep-thinking researchers should allocate themselves to making alignment (the scalar!) cheaper/easier/better instead of making bargaining/cooperation/mutual-governance cheaper/easier/better. I just don't see that happening given the structure of today's global economy and tech industry.
In your story, trillion-dollar tech companies are trying to make their systems do what they want and failing. My best understanding of your position is: "... (read more)
It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment"). I prefer the scalar/process usage, because it seems to me that people who do alignment research (including yourself) are going to produce ways of increasing the "alignment scalar", rather than ways of guaranteeing the "perfect alignment" boolean. (I sometimes use "misaligned" as a boolean due to it
Overall, I think I agree with some of the most important high-level claims of the post:
How are you inferring this? From the fact that a negative outcome eventually obtained? Or from particular misaligned decisions each system made?
I also thought the story strongly suggested single-single misalignment, though it doesn't get into many of the concrete decisions made by any of the systems so it's hard to say whether particular decisions are in fact misaligned.
The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.
Why does any company have this goal, or eve... (read more)
> The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.
Why does any company have this goal, or even roughly this goal, if they are aligned with their shareholders?
It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment"). I prefer the scalar/process usage, because it seems to me t... (read more)
I broadly think of this approach as "try to write down the 'right' universal prior." I don't think the bridge rules / importance-weighting consideration is the only way in which our universal prior is predictably bad. There are also issues like anthropic update and philosophical considerations about what kind of "programming language" to use and so on.
I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit. I guess you just need to get close enough that εδ is man... (read more)
High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I'm finally going back to doing theory for a variety of reasons; my story is definitely not that I'm transitioning back from applied work to theory because I now believe the algorithms aren't ready.
I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?
I feel like a story is basically plausible until proven implausibl... (read more)
I don't really think of 3 and 4 as very different, there's definitely a spectrum regarding "plausible" and I think we don't need to draw the line firmly---it's OK if over time your "most plausible" failure mode becomes increasingly implausible and the goal is just to make it obviously completely implausible. I think 5 is a further step (doesn't seem like a different methodology, but a qualitatively further-off stopping point, and the further off you go the more I expect this kind of theoretical research to get replaced by empirical research). I think of it... (read more)
OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about "thinking about politics" or "breaking laws" etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
I don't really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking---law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable... (read more)
If we expected increased outreach and prosletyization from vegetarians to uniformly make further outreach harder, would we expect to see the rapid and exponential growth of vegetarianism (as it seems to be)?
Is this true? e.g. Gallup shows the fraction of US vegetarians at 6% in 2000 and 5% 2020 (link), so if there is exponential growth it seems like either their numbers are wrong or the growth is very slow.
The primary argument for convincing someone to not eat meat is that the long term costs outweigh the short term benefits, so I'm not sure that you can c
At a minimum they also impose harms on the people who you convinced not to eat meat (since you are assuming that eating meat was a benefit to you that you wanted to pay for). And of course they make further vegetarian outreach harder . And in most cases they also won't be such a precise an offset, e.g. it will apply to different animal products or at different times or with unclear probability.
That said, I agree that I can offset "me eating an egg" by paying Alice enough that she's willing to skip eating an egg, and in some sense that's an even purer offset than the one in this post.
The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior.
You get to iterate fast until you find an algorithm where it's hard to think of failure stories. And you get to work on toy cases until you find an algorithm that actually works in all the toy cases. I think we're a long way from meeting those bars, so that we'll get to iterate fast for a while. After we meet those bars, it's an open question how close we'd be to something that actually works. My suspicion i... (read more)
I think I'm responding to a more basic intuition, that if I wrote some code and its now searching over ingenious ways to kill me, then something has gone extremely wrong in a way that feels preventable. It may be the default in some sense, just as wildly insecure software (which would lead to my computer doing the same thing under certain conditions) is the default in some sense, but in both cases I have the intuition that the failure comes from having made an avoidable mistake in designing the software.
In some sense changing this view would change my bott... (read more)
Outside view #1: How biomimetics has always worked
It seems like ML is different from other domains in that it already relies on incredibly massive automated search, with massive changes in the quality of our inner algorithms despite very little change in our outer algorithms. None of the other domains have this property. So it wouldn't be too surprising if the only domain in which all the early successes have this property is also the only domain in which the later successes have this property.
Outside view #2: How learning algorithms have always been devel
Yeah, thanks for catching that.
Carl Shulman wrote a related post here.
Commenters pointed out two examples of this that are already done in practice:
If we lived in a different world then e.g. restaurants could still repackage them at the last mile, selling humane egg credits along with their omelette. But in practice this probably wouldn't check the same box for most consumers.
In retrospect I think I should have called this post "Demand offsetting" to highlight the fact that you are offsetting the demand for eggs that you create (and hence hopefully causing no/minimal harm) rather than causing some harm and then offsetting that harm (the more typical situation, which is not obviously morally acceptable once you are in the kind of non-consequentialist framework that cares a lot about offsetting per se).
I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don't. In the latter case, ϵδ shouldn't be large. In the former case, it means that we are overwhelming likely to actually be inside a malign simulation.
It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those.... (read more)
I agree that this settles the query complexity question for Bayesian predictors and deterministic humans.
I expect it can be generalized to have complexity O(ϵ2δ2) in the case with stochastic humans where treacherous behavior can take the form of small stochastic shifts.
I think that the big open problems for this kind of approach to inner alignment are:
That's what I have in mind. If all goes well you can think of it like "a human thinking a long time." We don't know if all will go well.
It's also not really clear what "a human thinking 10,000 years" means, HCH is kind of an operationalization of that, but there's a presumption of alignment in the human-thinking-a-long-time that we don't get for free here. (Of course you also wouldn't get it for free if you somehow let a human live for 10,000 years...)
So treacherous models won't be trying to avoid collisions in order to make queries be linear in p(treachery)/p(truth). If P(discovery) increases when multiple models are trying to be treacherous at the same time--which we could go onto discuss; it's not obvious to me either way as of yet--that will balanced against the inherit variation in some timesteps being a better for treachery than others.
If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, the... (read more)
The first treacherous model works by replacing the bad simplicity prior with a better prior, and then using the better prior to more quickly infer the true model. No reason for the same thing to happen a second time.
(Well, I guess the argument works if you push out to longer and longer sequence lengths---a treacherous model will beat the true model on sequence lengths a billion, and then for sequence lengths a trillion a different treacherous model will win, and for sequence lengths a quadrillion a still different treacherous model will win. Before even thinking about the fact that each particular treacherous model will in fact defect at some point and at that point drop out of the posterior.)
I don't follow. Can't races to the bottom destroy all value for the agents involved?
You are saying that a special moment is a particularly great one to be treacherous. But if P(discovery) is 99.99% during that period, and there is any other treachery-possible period where P(discovery) is small, then that other period would have been better after all. Right?
This doesn't seem analogous to producers driving down profits to zero, because those firms had no other opportunity to make a profit with their machine. It's like you saying: there are tons of countries ... (read more)
I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out.
Trying to defect at time T is only a good idea if it's plausible that your mechanism isn't going to notice the uncertainty at time T and then query the human. So it seems to me like this argument can never drive P(successful treachery)... (read more)
There are a bunch of things that differ between part I and part II, I believe they are correlated with each other but not at all perfectly. In the post I'm intending to illustrate what I believe some plausible failures look like, in a way intended to capture a bunch of the probability space. I'm illustrating these kinds of bad generalizations and ways in which the resulting failures could be catastrophic. I don't really know what "making the claim" means, but I would say that any ways in which the story isn't realistic are interesting to me (and we've alre... (read more)
I think this is doable with this approach, but I haven't proven it can be done, let alone said anything about a dependence on epsilon. The closest bound I show not only has a constant factor of like 40; it depends on the prior on the truth too. I think (75% confidence) this is a weakness of the proof technique, not a weakness of the algorithm.
I just meant the dependence on epsilon, it seems like there are unavoidable additional factors (especially the linear dependence on p(treachery)). I guess it's not obvious if you can make these additive or if they are... (read more)
I understand that the practical bound is going to be logarithmic "for a while" but it seems like the theorem about runtime doesn't help as much if that's what we are (mostly) relying on, and there's some additional analysis we need to do. That seems worth formalizing if we want to have a theorem, since that's the step that we need to be correct.
There is at most a linear cost to this ratio, which I don't think screws us.
If our models are a trillion bits, then it doesn't seem that surprising to me if it takes 100 bits extra to specify an intended model... (read more)
I haven't read the paper yet, looking forward to it. Using something along these lines to run a sufficiently-faithful simulation of HCH seems like a plausible path to producing an aligned AI with a halting oracle. (I don't think that even solves the problem given a halting oracle, since HCH is probably not aligned, but I still think this would be noteworthy.)
First I'm curious to understand this main result so I know what to look for and how surprised to be. In particular, I have two questions about the quantitative behavior described here:
if an event would
Here's the sketch of a solution to the query complexity problem.
Simplifying technical assumptions:
I'm pretty sure removing those is mostly just a technical complication.
This doesn't seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals.
But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for "does what we care about" goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are ... (read more)
I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedl... (read more)
I like the following example:
This seems like a nice relatable example to me---it's not uncommon for someone to offer to bet on a rock paper scissors game, or to offer slightly favorab... (read more)
Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I'm kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we'd want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it's needed for HCH to be stable/aligned against internal optimization pressure).
PM was optimized to imitate H on D
It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn't train on D then run on D* (and you don't need to!).
I suppose this works, but then couldn't we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?
The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.
I think your description is correct.
The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation. For that reason, I expect z* to have roughly the same size as the neural network parameters.
My main reservation is that this seems really hard (and maybe in some sense just a reframing of the original problem). We want z to be a representation of what the neural network learned that a human can manipulate in order to reason about what it impl... (read more)
I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.
It is important to my intuition that not only can we never train for the "good" generalization, we can't even evaluate techniques to figure out which generalization "well" (since both of the bad generalizations would lead to behavior that looks good over long horizons).
If there is a disagreement it is probably that I have a much higher probability of the... (read more)
I agree that this is probably the key point; my other comment ("I think this is the key point and it's glossed over...") feels very relevant to me.
I feel like a very natural version of "follow instructions" is "Do things that would the instruction-giver would rate highly." (Which is the generalization I'm talking about.) I don't think any of the arguments about "long horizon versions of tasks are different from short versions" tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).
Other versions like "Follow instructions (without regards to what the training process cares about)" seem quite likely to perform significantly worse on ... (read more)
We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.
I think this is the key po... (read more)