If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.
Cool, we agree on this point.
my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries.
I think we agree here on the local point but disagree on its significance to the broader argument. [I'm not sure how much we agree-I think of training dynamics as 'neutral', but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly 'trying' to make Number Go Down.]
In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) leads to misaligned vs. aligned AI (if it hits 'AI' at all), where I think aligned AI is a narrow target to hit that most loss functions will miss, and hitting that narrow target requires security mindset.
To explain further, it's not that the (loss function, training set) is thinking back at you on its own; it's that the AI that's created by training is thinking back at you. So before you decide to optimize X you need to check whether or not you actually want something that's optimizing X, or if you need to optimize for Y instead.
So from my perspective it seems like you need security mindset in order to pick the right inputs to ML training to avoid getting misaligned models.
Congrats on another successful program!
Mentors rated their enthusiasm for their scholars to continue with their research at 7/10 or greater for 94% of scholars.
What is it at 9/10 or greater? My understanding is that 7/10 and 8/10 are generally viewed as 'neutral' scores, and this is more like "6% of scholars failed" than it is "94% of scholars succeeded." (It looks like averages of roughly 8 are generally viewed as 'high' in this postmortem so this population might be tougher raters than in other contexts, and so I'm wrong on what counts as 'neutral'.)
A lot of my thinking over the last few months has shifted from "how do we get some sort of AI pause in place?" to "how do we win the peace?". That is, you could have a picture of AGI as the most important problem that precedes all other problems; anti-aging research is important, but it might actually be faster to build an aligned artificial scientist who solves it for you than to solve it yourself (on this general argument, see Artificial Intelligence as a Positive and Negative Factor in Global Risk). But if alignment requires a thirty-year pause on the creation of artificial scientists to work, that belief flips--now actually it makes sense to go ahead with humans researching the biology of aging, and to do projects like Loyal.
This isn't true of just aging; there are probably something more like twelve major areas of concern. Some of them are simply predictable catastrophes we would like to avert; others are possibly necessary to be able to safely exit the pause at all (or to keep the pause going when it would be unsafe to exit).
I think 'solutionism' is basically the right path, here. What I'm interested in: what's the foundation for solutionism, or what support does it need? Why is solutionism not already the dominant view? I think one of the things I found most exciting about SENS was the sense that "someone had done the work", had actually identified the list of seven problems, and had a plan of how to address all of the problems. Even if those specific plans didn't pan out, the superstructure was there and the ability to pivot was there. It looked like a serious approach by serious people. What is the superstructure for solutionism such that one can be reasonably confident that marginal efforts are actually contributing to success, instead of bailing water on the Titanic?
Hearing, on my way out the door, when I'm exhausted beyond all measure and feeling deeply alienated and betrayed, "man, you should really consider sticking around" is upsetting.
This is not how I read Seth Herd's comment; I read him as saying "aw, I'll miss you, but not enough to follow you to Substack." This is simultaneously support for you staying on LW and for the mods to reach an accommodation with you, intended as information for you to do what you will with it.
I think the rest of this--being upset about what you think is the frame of that comment--feels like it's the conflict in miniature? I'm not sure I have much helpful to say, there.
My understanding is that their commitment is to stop once their ASL-3 evals are triggered.
Ok, we agree. By "beyond ASL-3" I thought you meant "stuff that's outside the category ASL-3" instead of "the first thing inside the category ASL-3".
For the Anthropic RSP in particular, I think it's accurate & helpful to say
Yep, that summary seems right to me. (I also think the "concrete commitments" statement is accurate.)
But I want to see RSP advocates engage more with the burden of proof concerns.
Yeah, I also think putting the burden of proof on scaling (instead of on pausing) is safer and probably appropriate. I am hesitant about it on process grounds; it seems to me like evidence of safety might require the scaling that we're not allowing until we see evidence of safety. On net, it seems like the right decision on the current margin but the same lock-in concerns (if we do the right thing now for the wrong reasons perhaps we will do the wrong thing for the same reasons in the future) worry me about simply switching the burden of proof (instead of coming up with a better system to evaluate risk).
I got the impression that Anthropic wants to do the following things before it scales beyond ASL-3:
Did you mean ASL-2 here? This seems like a pretty important detail to get right. (What they would need to do to scale beyond ASL-3 is meet the standard of an ASL-4 lab, which they have not developed yet.)
I agree with Habryka that these don't seem likely to cause Anthropic to stop scaling:
By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue. If you get the standard in place soon enough, you don't need to pause at all. This incentivizes implementing the security and safety procedures as soon as possible, which seems good to me.
But the RSP does not commit Anthropic to having any particular containment measures or any particular evidence that it is safe to scale to ASL-4– it only commits Anthropic to publish a post about ASL-4 systems. This is why I don't consider the ASL-4 section to be a concrete commitment.
Yes, I agree that the ASL-4 part is an IOU, and I predict that when they eventually publish it there will be controversy over whether or not they got it right. (Ideally, by then we'll have a consensus framework and independent body that develops those standards, which Anthropic will just sign on to.)
Again, this is by design; the underlying belief of the RSP is that we can only see so far ahead thru the fog, and so we should set our guidelines bit-by-bit, rather than pausing until we can see our way all the way to an aligned sovereign.
Ok, it sounds to me like you're saying:
"When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there's not a demon that will create another obstacle given that you surmounted this one."
That is, training processes are not neutral; there's the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.
Is this roughly right, or am I misunderstanding you?