Okay, but why is it wrong though? I still haven't seen a convincing case for that! It sure looks to me like, given an assumption which I still feel confused about whether you share, the conclusion does in fact follow from the premises, even in metaphor form.
I am open to the case that it's a bad argument. If it is in fact a bad argument then that's a legitimate criticism. But from my perspective you have not adequately spelled out how "deep nets favor simple functions" implies it's a bad argument.
I notice I'm still confused about your argument.
It seems to me that the question of whether [safe or intended goalset] is [the simplest function] is extremely relevant to the question of whether the argument as stated is wrong.
As I understand things right now, we seem to generally agree that:
You said (emphasis added):
To the end of being rigorous and correct, I'm claiming that the "each of these black shapes is basically just as good at passing that particular test" story isn't a good explanation of why alignment is hard (notwithstanding that alignment is in fact hard), because of the story about deep net architectures being biased towards simple functions.
You seem to be saying "because deep learning privileges simple functions, the claim that many different AIs could pass our test is false." I don't see how that follows, because:
And if in fact we don't know how to draw the right metaphorical teal thing, then the metaphorical black thing could take on various shapes that appear weird and complicated to us, but that actually reflect an underlying simplicity of which we are unaware. So it doesn't seem wrong to claim that the black thing could take some (apparently) weird and (apparently) complex shape, given the assumption that we can't draw a sufficiently constraining teal thing.
It is still the case that many different [AIs / simple functions] could [pass the test / approximate the dataset]. When we move the argument one level deeper, the original claim still holds true. Maybe I'm still just misunderstanding, though.
...or maybe you are only saying that the explanation as written is bad at taking readers from (1) to (4) because it does not explicitly mention (2), i.e. not technically wrong but still a bad explanation. In that case it seems we'd agree that (2) seems like a relevant wrinkle, and that writing (3) with "selected data" instead of "test" adds useful and correct nuance. But I don't see how it makes Duncan's summary either untrue or misleading, because eliding it doesn't change (1) or (4).
...or maybe you are saying it was a bad explanation for you, and for readers with your level of sophistication and familiarity with the arguments, and thus a bad answer to your original question. Which is...kinda fair? In that case I suppose you'd be saying "Ah, I notice you are making an assumption there, and I agree with the assumption, but failing to address it is bad form and I'm worried about what that failure implies." (I'll hold off on addressing this argument until I know whether you'd actually endorse it.)
...or maybe you also flatly disagree with (3)? Like, you disagree with Duncan's
I want to be clear that I think the only sane prior is on "we don't know how to choose the right data." Like, I don't think this is reasonably an "if." I think the burden of proof is on "we've created a complete picture and constrained all the necessary axes," à la cybersecurity, and that the present state of affairs with regards to LLM misalignment (and all the various ways that it keeps persisting/that things keep squirting sideways) bears this out. The claim is not "impossible/hopeless," but "they haven't even begun to make a case that would be compelling to someone actually paying attention."
...and in that case, excellent! We have surfaced a true crux. And it makes perfect sense from that perspective to say "the metaphor is wrong" because, from that perspective, one of its key assumptions is false.
Importantly, though, that looks to me like an object-level disagreement, and not one that reflects bad epistemics, except insofar as one believes that any disagreement must be the result of bad epistemics.
I am in favor of learning more programming! During the two years I spent pivoting from reliability engineering, I did in fact attempt some hands-on machine learning code. My brain isn't shaped in such a way that reading textbooks confers meaningful coding skills - I have to Actually Do the Thing - but I did try Actually Doing the Thing, reading and all.
I later facilitated BlueDot's alignment and governance courses, and went through their reading material several times over in the process.
I now face a tradeoff between learning more ML, which is doable but extremely time-consuming, and efforts to convince policymakers not to let labs build ASI. It seems overwhelmingly overdetermined that my (marginal) time is best spent on the second thing. I see my primary comparative advantage as attempting to buy more time for developing solutions that might actually save us.
...which does unfortunately mean it's going to take me a while to properly digest your argument-from-dataset-approximation. Doesn't mean I won't try.
Even attempting to take it as given, though, I'm confused by your conclusion, because you seem to be simultaneously saying "[language models approximating known datasets we can squint at] is a reason we know current systems are safe" and "this reason will not generalize to ASI" and "this answers the quoted question of why [ASI] would converge on something conveniently safe for us".
At some point in the future, when you have AIs training other AIs on AI-generated data too fast for humans to monitor, you can't just eyeball the data and feel confident that it's not doing something you don't want to happen.
Isn't this the default path? Don't most labs' plans to build ASI run through massive use of AI-generated data? Even if I accept the premise that you can confidently assure safety by eyeballing data today, this doesn't do much to reassure me if you then agree that it doesn't generalize.
So I'm still not seeing how this supports the crux that "implement everyone's CEV" (or, whichever alternative goalset you consider safe) is likely the simplest [function that approximates the datasets that will be used to create ASI].[1]
(Also, at this point I kind of want to taboo 'dataset' because it feels like a very overloaded term.)
Brackets, because I'm not even sure this is representing you right. Possibly it should be [function reflected by the dataset] or some other thing.
I agree directionally and denotationally with this, but I feel the need to caution that "winning arguments" is itself a very dangerous epistemic frame to inhabit for long.
Also...
I think a big part of how safety team earns its dignity points is by being as specific as possible about exactly how capabilities team is being suicidal
We do that too! There's a lot of ground to cover.
I want to first acknowledge strongly that yep, we are mostly on the same side about getting a much better future than everyone dying to AIs.
But in the context of my own writing, everyone who's paying attention to me already knows about existential risk;
I note this is not necessarily true for MIRI; we are trying very hard on purpose to reach and inform more people.
I want my words to be focused on being rigorous and correct, not scaring policymakers and the public (notwithstanding that policymakers and the public should in fact be scared).
The two can be compatible!
I perceive at least two separate critiques, and I want to address them both without cross-contamination. (Please correct me if these miss the mark.)
Hypothesis 1: Maybe MIRI folks have wrong world-models (possibly due to insufficient engagement with sophisticated disagreement).
Hypothesis 2: Maybe MIRI folks are prioritizing their arguments badly for actually stopping the AI race.
Regarding Hypothesis 1, there's a tradeoff between refining and polishing one's world-model, and acting upon that world-model to try to accomplish things.
Speaking only for myself, there are many possible things I could be writing or saying, and only finite time to write or say them in. For the moment, I mostly want my words to be focused on (productively) scaring policymakers and the public, because they should in fact be scared.
This obviously does not preclude writing for and talking with the ingroup, nor continuing to refine and polish my own world-model.
But...well, I feel like I've mostly hit diminishing returns on that, both when it comes to updating my own models and when it comes to updating those of others like me. So the balance of time-spent naturally tips towards outreach.
To borrow from your comment below, in regards to Hypothesis 2...
if we want that Pause treaty, we need to find the ironclad arguments that convince skeptical experts, not just appeal to intuition.
...for one thing, I'm not sure how true this is? Policymakers and the public can sometimes both be swayed by appeals to intuition. Skeptical experts can be really hard to convince. Especially after the Nth iteration of debate has passed and a lot of ideas have congealed.
Again, there's a tradeoff here, a matter of how much time one spends making cases to audiences of various levels of informed or uninformed skepticism. I'm not sure what the right balance is, but for myself at least, it's probably not a primary focus on convincing Paul Christiano of things. Tactical priorities can differ from person to person, of course.
Caveat 1: Again, I speak for myself here. I admittedly have much less context on the decades-long back-and-forth than some of my colleagues.
Caveat 2: No matter who I'm trying to convince, I do want my arguments to rest on a solid foundation. If an interlocutor digs deep, the argument-hole they unearth should hold water. To, uh, rather butcher a metaphor.
But this just rounds back to my response to Hypothesis 1 - thanks to the magic of the Internet (and Lightcone Infrastructure in particular) you can always find someone with a sophisticated critique to level at your supposedly solid foundation. At some point you do have to take your best guess about what's true and robust and correct according to your current world-model, then go and try to share it outside the crucible of LessWrong forums.
With all that being said, sure, let's talk world-models. (With, again, the caveat that this is all my own limited take as someone who spent most of the 2010s doing reliability engineering and not alignment research.)
To the end of being rigorous and correct, I'm claiming that the "each of these black shapes is basically just as good at passing that particular test" story isn't a good explanation of why alignment is hard (notwithstanding that alignment is in fact hard), because of the story about deep net architectures being biased towards simple functions.
I think I follow your argument that one might say "we don't know how to draw the teal thing" instead. But this seems more quibble than crux. I don't think you addressed Duncan's core point, which is that "we don't know how to draw the teal thing" is the correct prior? (i.e. we don't know how to select training data in a way that constrains an AI to learn to explicitly, primarily, and robustly value human flourishing.)
And if in fact we don't know how to draw the right metaphorical teal thing, then the metaphorical black thing could take on various shapes that appear weird and complicated to us, but that actually reflect an underlying simplicity of which we are unaware. So it doesn't seem wrong to claim that the black thing could take some (apparently) weird and (apparently) complex shape, given the assumption that we can't draw a sufficiently constraining teal thing.
More broadly, I think I'm missing some important context, or just failing to follow your logic. I don't see how a bias towards simple functions implies a convergence towards nonlethal aims. We don't know what would be the simplest functions that approximate current or future training data. Why believe they would converge on something conveniently safe for us? [1]
From the papers you cite, I can see how one would conclude that AIs will be efficient, but I don't see how they imply that AIs will be nice.
In the aforementioned spirit of rigor, I'm trying to avoid saying "human values" because those might not be good enough either. Many humans do not prioritize conscious flourishing! An ASI that doesn't hold conscious wellbeing as its highest priority likely kills everyone as a side effect of optimizing for other ends, etc. etc.
Whether the simplest function is misaligned (as posited by List of Lethalities #20) is the thing you have to explain!
Is this in fact a crux for you? If you were largely convinced that the simplest functions found by gradient descent in the current paradigm would not remotely approximate human values, to what extent would this shift your odds of the current paradigm getting everyone killed?
I chiefly advise against work that brings us closer to superintelligence. I aim this advice primarily at those who want to make sure AI goes well. For careers that do other things, and for those who aren't aiming their careers for impact, this post mostly doesn't apply. One can argue about secondary effects and such, but in general, mundane utility is a good thing and it's fine for people to get paid for providing it.
Somewhat relatedly, "If I previously turned down some option X, I will not choose any option that I strictly disprefer to X" does feel to me like a grafted-on hack of a policy that breaks down in some adversarial edge case.
Maybe it's airtight, I'm not sure. But if it is, that just feels like coherence with extra steps? Like, sure, you can pursue a strategy of incoherence which requires you to know the entire universe of possible trades you will make and then backchains inductively to make sure you never, ever are exploitable about this.
Or you could make your preferences explicit and be consistent in the first place. In a sense, I think that's the simple, elegant thing that the weird hack approximates.
If you have coherent preferences, you get the hack for free. I think an agent with coherent preferences performs at least as well with the same assumptions (prescience, backchaining) on the same decision tree, and performs better if you relax one or more of those assumptions.
In practice, it pays to be the sort of entity that attempts to have consistent preferences about things whenever that's decision-relevant and computationally tractable.
A friend (correctly) recommended me this post as useful context and I'm documenting my thoughts here for easy reference. This is not, strictly speaking, an objection to the headline claim of the post. It's a claim that coherence will tend to emerge in practice.
That the agent knows in advance what trades they will be offered.
This assumption doesn't hold in real life. It's a bit like saying "If I know what moves my opponent will make, I can always beat them at chess." Well, yes. But in practice you don't. Agents in real life can't rely on perfect knowledge like this. Directionally, agents will be less exploitable and more efficient as their preferences grow more explicit and coherent. In actual practice, training a neural net to solve problems without getting stuck also trains it to have more explicit and coherent preferences.
(If the agent doesn’t know in advance what trades they will be offered or is incapable of backward induction, then their pursuit of a dominated strategy need not indicate any defect in their preferences. Their pursuit of a dominated strategy can instead be blamed on their lack of knowledge and/or reasoning ability.)
I blame it on both. The lack of knowledge in question is the fact that agents in practice aren't omni-prescient. The lack of reasoning ability in question is a refusal to assign an explicit preference ordering to outcomes.
If you don't know the whole decision tree in advance, then "if I previously turned down some option X, I will not choose any option that I strictly disprefer to X" will probably be violated at some point by e.g. having rejected X1 and X2 and later having to choose between X1- and X2-, even without adversarial exploitation.
Even if I grant the entire rest of the post, it still seems highly probable that sufficiently smart AIs grown using modern methods end up likely to have coherent preference orderings in most ways that matter.
Thank you for attempting to spell this out more explicitly. If I understand correctly, you are saying singular learning theory suggests that AIs with different architectures will converge on a narrow range of similar functions that best approximate the training data.
With less confidence, I understand you to be claiming that this convergence implies that (in the context of the metaphor) a given [teal thing / dataset] may reliably produce a particular shape of [black thing / AI].
So (my nascent Zack model says) the summary is incorrect to analogize the black thing to "architectures" instead of "parametrizations" or "functions", and more importantly incorrect to claim that the black shape's many degrees of freedom imply it will take a form its developers did not intend. (Because, by SLT, most shapes converge to some relatively simple function approximator.)
But...it does seem to me like an AI trained using modern methods, e.g. constitutional AI, is insufficiently constrained to embody human-compatible values even given the stated interpretation of SLT. Or in other words, the black shape is still basically unpredictable from the perspective of the teal-shape drawer. I'm not sure you disagree with that?
As an exercise in inferential gap-crossing, I want to try to figure out what minimum change to the summary / metaphor would make it relatively unobjectionable to you.
Attempting to update the analogy in my own model, it would go something like: You draw a [teal thing / dataset]. You use it to train the [black thing / AI]. There are underlying regularities in your dataset, some of which are legible to you as a human and some of which are not. The black thing conforms to all the regularities. This does not by coincidence happen to cause it to occupy the shape you hoped for; you do not see all the non-robust / illegible features constraining that shape. You end up with [weird shape] instead of [simple shape you were aiming for.]
A more skeptical Zack-model in my head says "No, actually, you don't end up with [weird shape] at all. SLT says you can get [shape which robustly includes the entire spectrum of reflectively consistent human values] because that's the function being approximated, the underlying structure of the data." I dunno if this is an accurate Zack-model.
(I am running into the limited bandwith of text here, and will also DM a link to schedule a conversation if you're so inclined).