A friend (correctly) recommended me this post as useful context and I'm documenting my thoughts here for easy reference. This is not, strictly speaking, an objection to the headline claim of the post. It's a claim that coherence will tend to emerge in practice.
That the agent knows in advance what trades they will be offered.
This assumption doesn't hold in real life. It's a bit like saying "If I know what moves my opponent will make, I can always beat them at chess." Well, yes. But in practice you don't. Agents in real life can't rely on perfect knowledge like this. Directionally, agents will be less exploitable and more efficient as their preferences grow more explicit and coherent. In actual practice, training a neural net to solve problems without getting stuck also trains it to have more explicit and coherent preferences.
(If the agent doesn’t know in advance what trades they will be offered or is incapable of backward induction, then their pursuit of a dominated strategy need not indicate any defect in their preferences. Their pursuit of a dominated strategy can instead be blamed on their lack of knowledge and/or reasoning ability.)
I blame it on both. The lack of knowledge in question is the fact that agents in practice aren't omni-prescient. The lack of reasoning ability in question is a refusal to assign an explicit preference ordering to outcomes.
If you don't know the whole decision tree in advance, then "if I previously turned down some option X, I will not choose any option that I strictly disprefer to X" will probably be violated at some point by e.g. having rejected X1 and X2 and later having to choose between X1- and X2-, even without adversarial exploitation.
Even if I grant the entire rest of the post, it still seems highly probable that sufficiently smart AIs grown using modern methods end up likely to have coherent preference orderings in most ways that matter.
I can mostly only speak to my own probabilities, and it depends how many years we count as coming. I'm less than 98% on ASI in the next five years, say. The ~98% is if anyone builds it (using anything remotely like current methods).
Thanks for clarifying. It still seems that we'd encounter the same sort of problem even in the short term, though? Take the case of a programmer hijacking the input medium. Does the AI care? It's still getting instructions to follow. To what extent is it modeling the real humans on the other end? You touch on this in Defining the Principal(s) and jailbreaking, but it seems like it should be even more of a Problem for the approach. Like, an AI that can robustly navigate that challenge, to the point of being more or less immune to intercepts, seems hard to distinguish from one that is (a) long-term aligned as well and (b) possessed of deadly competence at world-modeling if not long-term aligned. An AI that can't handle this problem...well, is it really intent-aligned? Where else does its understanding of the developers break down?
On the one hand, I...sort of agree about the intuitions. There exist formal arguments, but I can't always claim to understand them well.
On the other, one of my intuitions is that if you're trying to build a Moon rocket, and the rocket engineers keep saying things like "The arguments boil down to differing intuitions" and "I think it is quite accurate to say that we don't understand how [rockets] work" then the rocket will not land on the Moon. At no point in planning a Moon launch should the arguments boil down to different intuitions. The arguments should boil down to math and science that anyone with the right background can verify.
If they don't, I would claim the correct response is not "maybe it'll work, maybe it won't, maybe it'll get partway there," it's instead "wow that rocket is doomed."
I see the current science being leveled at making Claude "nice" and I go "wow that sure looks like a faroff target with lots of weird unknowns between us and it, and that sure does not look like a precise trajectory plotted according to known formulae; I don't see them sticking the landing this way."
It's really hard to shake this intuition.
Possibly a nitpick: So, I don't actually think HHH was the training target. It was the label attached to the training target. The actual training target is...much weirder and more complicated IMO. The training target for RLHF is more or less "get human to push button" and RLAIF is the same but with an AI. Sure, pushing the "this is better" button often involves a judgment according to some interpretation of a statement like "which of these is more harmless?", but the appearance of harmlessness is not the same as its reality, etc.
Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn't trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.
...huh. It seems to me that the fundamental problem in machine learning, that no one has a way to engineer specific goals into AI, applies equally well to weak AIs as to powerful ones. So this might be a key crux.
To clarify, by "align systems..." did you mean the same thing I do, full-blown value alignment / human CEV? Is the theory in fact that we could get weak AIs who steer robustly and entirely towards human values, and would do so even on reflection; that we'd actually know how to do this reliably on purpose with practical engineering, but that such knowledge wouldn't generalize to scaled-up versions of the same system? (My impression is that aligning even a weak AI that thoroughly requires understanding cognition on such a deep and fundamental level that it mostly would generalize to superintelligence, though of course it'd still be foolish to rely on this generalization alone.)
If instead you meant something more like you described here, systems that are not "egregiously misaligned", then that's a different matter. But I get the sense it actually is the first thing, in this specific narrow case?
Sure, I was just supporting the claim that "less capable AI systems can make meaningful progress on improving the situation". You seemed to be implicitly arguing against this claim.
I don't think they can make meaningful progress on alignment without catastrophically dangerous levels of competence. That's the main intended thrust of this particular post. (Separately, I don't think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either; I'd perhaps be in favor of narrow AIs for this purpose if such could be specified in a treaty without leaving enormous exploitable loopholes. Maybe it can.)
It looks like we do agree on quite a lot of things. Not a surprise, but glad to see it laid out.
Why are you assuming the AI has misaligned goals?
The short, somewhat trite answer is that it's baked into the premise. If we had a way of getting a powerful optimizer that didn't have misaligned goals, we wouldn't need said optimizer to solve alignment!
The more precise answer is that we can train for competence but not goodness, current LLMs have misaligned goals to the extent they have any at all, and this doesn't seem likely to change.
Perhaps you will argue for this in the next post.
Yup. (Well, I'll try; the whole conversation on that particular crux seems unusually muddled to me and it shows.)
I also think less capable AI systems can make meaningful progress on improving the situation, but this is partially tied up in thinking that it isn't intractable to use a ton of human labor to do a good enough job aligning systems which are capable enough to dominate top human experts at these domains. As in, the way less capable AI systems help is by allowing us to (in effect) apply much more cognitive labor to the problem of sufficiently aligning AIs we can use to fully automate safety R&D (aka deferring to AIs , aka handing off to AIs).
The cruxy points here are, I think, "good enough" and "sufficiently", and the underlying implication that partial progress on alignment can make capabilities much safer. I doubt this. A future post will touch on why.
To the extent you contest this, the remaining options are to use AI labor to buy much more time and/or to achieve substantial human augmentation.
Nitpick: Neither approach seems to require AI labor. I certainly use plenty of LLMs in my workflow, but maybe you'd have something more ambitious in mind.
More on several of these topics in the coming posts.
I cannot speak for their team, but my best guess is that they are envisioning an Agent-3 which possesses insufficient awareness of its misaligned goals or insufficient coherence to notice it is incentivized to scheme. This does seem consistent with Agent-3 being incompetent to align Agent-4. To quote:
The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.
In Rob's list of possible outcomes, this seems to fall under "AIs that are confidently wrong and lead you off a cliff just like the humans would." (Possibly at some point Agent-3 said "Yep, I'm stumped too" and then OpenBrain trained it not to do that.)
It sounds like we are indeed using very different meanings of "alignment" and should use other words instead.
I suspect our shared crux is the degree to which cooperative behavior can be predicted/extrapolated as models get more competent. To a reasonable first approximation, if e.g. Claude wants good things, improvements to Claude's epistemics are probably good for us; if Claude does not, they are not. Yes?
It may take a whole entire post to explain, but I'm curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection. I don't think training methods are precise enough to have instilled those in the first place; do you believe differently, are you mostly taking the observed behavioral tendencies as strong evidence, is it something else...? (Maybe you have written about this elsewhere already.)
To be clear, I do suspect any AI smart enough to solve alignment is also smart enough to escape control and kill us. I'm not planning to go into great detail on control until after a deeper dive on the subject, though. Thanks for the reading material!
Somewhat relatedly, "If I previously turned down some option X, I will not choose any option that I strictly disprefer to X" does feel to me like a grafted-on hack of a policy that breaks down in some adversarial edge case.
Maybe it's airtight, I'm not sure. But if it is, that just feels like coherence with extra steps? Like, sure, you can pursue a strategy of incoherence which requires you to know the entire universe of possible trades you will make and then backchains inductively to make sure you never, ever are exploitable about this.
Or you could make your preferences explicit and be consistent in the first place. In a sense, I think that's the simple, elegant thing that the weird hack approximates.
If you have coherent preferences, you get the hack for free. I think an agent with coherent preferences performs at least as well with the same assumptions (prescience, backchaining) on the same decision tree, and performs better if you relax one or more of those assumptions.
In practice, it pays to be the sort of entity that attempts to have consistent preferences about things whenever that's decision-relevant and computationally tractable.