LESSWRONG
LW

77
Joe Rogero
47119370
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Against Muddling Through
Situational Awareness Summarized
We won’t get docile, brilliant AIs before we solve alignment
Joe Rogero2d10

I can mostly only speak to my own probabilities, and it depends how many years we count as coming. I'm less than 98% on ASI in the next five years, say. The ~98% is if anyone builds it (using anything remotely like current methods). 

Reply
Intent alignment seems incoherent
Joe Rogero7d10

Thanks for clarifying. It still seems that we'd encounter the same sort of problem even in the short term, though? Take the case of a programmer hijacking the input medium. Does the AI care? It's still getting instructions to follow. To what extent is it modeling the real humans on the other end? You touch on this in Defining the Principal(s) and jailbreaking, but it seems like it should be even more of a Problem for the approach. Like, an AI that can robustly navigate that challenge, to the point of being more or less immune to intercepts, seems hard to distinguish from one that is (a) long-term aligned as well and (b) possessed of deadly competence at world-modeling if not long-term aligned. An AI that can't handle this problem...well, is it really intent-aligned? Where else does its understanding of the developers break down?

Reply
LLMs are badly misaligned
Joe Rogero8d52

On the one hand, I...sort of agree about the intuitions. There exist formal arguments, but I can't always claim to understand them well. 

On the other, one of my intuitions is that if you're trying to build a Moon rocket, and the rocket engineers keep saying things like "The arguments boil down to differing intuitions" and "I think it is quite accurate to say that we don't understand how [rockets] work" then the rocket will not land on the Moon. At no point in planning a Moon launch should the arguments boil down to different intuitions. The arguments should boil down to math and science that anyone with the right background can verify. 

If they don't, I would claim the correct response is not "maybe it'll work, maybe it won't, maybe it'll get partway there," it's instead "wow that rocket is doomed." 

I see the current science being leveled at making Claude "nice" and I go "wow that sure looks like a faroff target with lots of weird unknowns between us and it, and that sure does not look like a precise trajectory plotted according to known formulae; I don't see them sticking the landing this way." 

It's really hard to shake this intuition. 

 

Possibly a nitpick: So, I don't actually think HHH was the training target. It was the label attached to the training target. The actual training target is...much weirder and more complicated IMO. The training target for RLHF is more or less "get human to push button" and RLAIF is the same but with an AI. Sure, pushing the "this is better" button often involves a judgment according to some interpretation of a statement like "which of these is more harmless?", but the appearance of harmlessness is not the same as its reality, etc. 

Reply1
We won’t get AIs smart enough to solve alignment but too dumb to rebel
Joe Rogero8d10

Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn't trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.

...huh. It seems to me that the fundamental problem in machine learning, that no one has a way to engineer specific goals into AI, applies equally well to weak AIs as to powerful ones. So this might be a key crux. 

To clarify, by "align systems..." did you mean the same thing I do, full-blown value alignment / human CEV? Is the theory in fact that we could get weak AIs who steer robustly and entirely towards human values, and would do so even on reflection; that we'd actually know how to do this reliably on purpose with practical engineering, but that such knowledge wouldn't generalize to scaled-up versions of the same system? (My impression is that aligning even a weak AI that thoroughly requires understanding cognition on such a deep and fundamental level that it mostly would generalize to superintelligence, though of course it'd still be foolish to rely on this generalization alone.) 

If instead you meant something more like you described here, systems that are not "egregiously misaligned", then that's a different matter. But I get the sense it actually is the first thing, in this specific narrow case? 

Sure, I was just supporting the claim that "less capable AI systems can make meaningful progress on improving the situation". You seemed to be implicitly arguing against this claim.

I don't think they can make meaningful progress on alignment without catastrophically dangerous levels of competence.  That's the main intended thrust of this particular post. (Separately, I don't think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either; I'd perhaps be in favor of narrow AIs for this purpose if such could be specified in a treaty without leaving enormous exploitable loopholes. Maybe it can.) 

Reply
We won’t get AIs smart enough to solve alignment but too dumb to rebel
Joe Rogero8d10

It looks like we do agree on quite a lot of things. Not a surprise, but glad to see it laid out. 

Why are you assuming the AI has misaligned goals?

The short, somewhat trite answer is that it's baked into the premise. If we had a way of getting a powerful optimizer that didn't have misaligned goals, we wouldn't need said optimizer to solve alignment! 

The more precise answer is that we can train for competence but not goodness, current LLMs have misaligned goals to the extent they have any at all, and this doesn't seem likely to change.

Perhaps you will argue for this in the next post.

Yup. (Well, I'll try; the whole conversation on that particular crux seems unusually muddled to me and it shows.) 

I also think less capable AI systems can make meaningful progress on improving the situation, but this is partially tied up in thinking that it isn't intractable to use a ton of human labor to do a good enough job aligning systems which are capable enough to dominate top human experts at these domains. As in, the way less capable AI systems help is by allowing us to (in effect) apply much more cognitive labor to the problem of sufficiently aligning AIs we can use to fully automate safety R&D (aka deferring to AIs , aka handing off to AIs).

The cruxy points here are, I think, "good enough" and "sufficiently", and the underlying implication that partial progress on alignment can make capabilities much safer. I doubt this. A future post will touch on why. 

To the extent you contest this, the remaining options are to use AI labor to buy much more time and/or to achieve substantial human augmentation.

Nitpick: Neither approach seems to require AI labor. I certainly use plenty of LLMs in my workflow, but maybe you'd have something more ambitious in mind. 

More on several of these topics in the coming posts. 

Reply
We won’t get AIs smart enough to solve alignment but too dumb to rebel
Joe Rogero8d60

I cannot speak for their team, but my best guess is that they are envisioning an Agent-3 which possesses insufficient awareness of its misaligned goals or insufficient coherence to notice it is incentivized to scheme. This does seem consistent with Agent-3 being incompetent to align Agent-4. To quote: 

The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.

In Rob's list of possible outcomes, this seems to fall under "AIs that are confidently wrong and lead you off a cliff just like the humans would." (Possibly at some point Agent-3 said "Yep, I'm stumped too" and then OpenBrain trained it not to do that.) 

Reply
LLMs are badly misaligned
Joe Rogero8d40

It sounds like we are indeed using very different meanings of "alignment" and should use other words instead. 

I suspect our shared crux is the degree to which cooperative behavior can be predicted/extrapolated as models get more competent. To a reasonable first approximation, if e.g. Claude wants good things, improvements to Claude's epistemics are probably good for us; if Claude does not, they are not. Yes? 

It may take a whole entire post to explain, but I'm curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection. I don't think training methods are precise enough to have instilled those in the first place; do you believe differently, are you mostly taking the observed behavioral tendencies as strong evidence, is it something else...? (Maybe you have written about this elsewhere already.) 

Reply
We won’t get AIs smart enough to solve alignment but too dumb to rebel
Joe Rogero8d20

To be clear, I do suspect any AI smart enough to solve alignment is also smart enough to escape control and kill us. I'm not planning to go into great detail on control until after a deeper dive on the subject, though. Thanks for the reading material! 

Reply
LLMs are badly misaligned
Joe Rogero9d10

Yeah, I'm mostly trying to address the impression that LLMs are ~close to aligned already and thus the problem is keeping them that way, rather than, like, actually solving alignment for AIs in general. 

Reply
LLMs are badly misaligned
Joe Rogero9d40

Friendly and unfriendly attractors might exist, but that doesn't make them equally likely. The first seems much more likely than the second. I have in mind a mental image of a galaxy of value-stars, each with their own metaphorical gravity well. Somewhere in that galaxy is a star or a handful of stars labeled "cares about human wellbeing" or similar. Almost every other star is lethal. Landing on a safe star, and not getting snagged by any other gravity wells, requires a very precise trajectory. The odds of landing it by accident are astronomically low. 

(Absent caring, I don't think "granting us rights" is a particularly likely outcome; AIs far more powerful than humans would have no good reason to.)

I agree that an AI being too dumb to recognize when it's causing harm (vs e.g. co-writing fiction) screens off many inferences about its intent. I...would not describe any such interaction, with human or AI, as "revealing its CEV." I'd say current interactions seem to rule out the hypothesis that LLMs are already robustly orbiting the correct metaphorical star. They don't say much about which star or stars they are orbiting. 

Reply
Load More
7We won’t get docile, brilliant AIs before we solve alignment
5d
3
4Labs lack the tools to course-correct
5d
0
2Alignment progress doesn’t compensate for higher capabilities
6d
0
22Intent alignment seems incoherent
8d
2
28We won’t get AIs smart enough to solve alignment but too dumb to rebel
9d
16
26LLMs are badly misaligned
10d
25
22Goodness is harder to achieve than competence
12d
0
22Good is a smaller target than smart
12d
0
39So You Want to Work at a Frontier AI Lab
4mo
12
46Existing Safety Frameworks Imply Unreasonable Confidence
6mo
3
Load More