Vaniver

Sequences

Decision Analysis

Wiki Contributions

Comments

Then the model can safely scale.

If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose. 

FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem! 

I think that's a pretty simplistic view of the post, but given that view, I agree that's the right update to make.

Why does it seem simplistic? Like, one of the central points of the post you link is that we should think about the specific technical features of proposals, instead of focusing on marketing questions of which camp a proposal falls into. And Eliezer saying he's "no fan of neurons" is in the context of him responding to a comment by someone with the username Marvin Minsky defending the book Perceptrons (the post is from the Overcoming Bias era, when comments did not have threading or explicit parents).

I basically read this as Eliezer making fun of low-nuance people, not people excited about NNs; in that very post he excitedly describes a NN-based robotics project!

So I definitely think that's something weirdly unspoken about the argument; I would characterize it as Eliezer saying "suppose I'm right and they're wrong; all this requires is things to be harder than people think, which is usual. Suppose instead that I'm wrong and they're right; this requires things to be easier than people think, which is unusual." But the equation of "people" and "Eliezer" is sort of strange; as Quintin notes, it isn't that unusual for outside observers to overestimate difficulty, and so I wish he had centrally addressed the the reference class tennis game; is the expertise "getting AI systems to be capable" or "getting AI systems to do what you want"?

I agree with this heuristic in general, but will observe Quintin's first post here was over two years ago and he commented on A List of Lethalities; I do think it'd be fair for him to respond with "what do you think this post was?".

FWIW, I thought the bit about manifolds in The difficulty of alignment was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to.

That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.

During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem.

I think this isn't the claim; I think the claim is that it would be particularly unusual for someone to overlook that they're accidentally solving a technical problem. (It would be surprising for Edison to not be thinking hard about what filament to use and pick tungsten; in actual history, it took decades for that change to be made.)

BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me.

My sense (which I expect Eliezer would agree with) is that it's relatively easy to get an AI system to imitate the true underlying 'basic notions of morality', to the extent humans agree on that, but that this doesn't protect you at all as soon as you want to start making large changes, or as soon as you start trying to replace specialist sectors of the economy. (A lot of ethics for doctors has to do with the challenges of simultaneously being a doctor and a human; those ethics will not necessarily be relevant for docbots, and the question of what they should be instead is potentially hard to figure out.)

So if you're mostly interested in getting out of the acute risk period, you probably need to aim for a harder target.

seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).

I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call 'deceptively aligned' when they're maladaptive. The idea of metacognitive blindspots also seems related. 

Load More