What Success Might Look Like

I'm reminded of this part from HP:MoR when

Harry's following Voldemort to what seems to be his doom.

Suppose, said that last remaining part, suppose we try to condition on the fact that we win this, or at least get out of this alive. If someone told you as a fact that you had survived, or even won, somehow made everything turn out okay, what would you think had happened—
Not legitimate procedure, whispered Ravenclaw, the universe doesn’t work like that, we’re just going to die

I never understood why this was considered illegitimate. If we have a particular desired outcome, it makes sense to me to envisage it and work backwards from there. Remaining open to deviations of course.

[-]ceba2mo10

If you use the "suppose ..." feature in a proof, you need to make sure the supposition isn't false in context of the proof

[-]Richard Juggins2mo20

I haven't read HP:MoR, so don't know exactly what is happening in the example, but might you not have doubt about whether the supposition is false or not? Envisaging a solution is a way of interrogating the structure of the problem, including whether it is solvable at all. Sure though, if you want to use the suppostion to prove something else, rather than work backwards from it, you want to be sure of the supposition in the first place.

[-]Flow2mo10

I'm out of my depth with mathematical and logical proofs, but wouldn't this be just rhetorical engagement with a hypothetical. In probability theory we can use conditionals, this feels like doing that.

[-]StanislavKrym2mo10

What could the system failure after solving alignment actually mean? The AI-2027 forecast had Agent-4 manage to solve mechinterp well enough to ensure that the superintelligent Agent-5 has no way to betray Agent-4. Does it mean that creating an analogue of Agent-5 aligned to human will is technically impossible and that the best possible way of alignment is permanent scalable oversight? Or is it due to human will changing in unpredictable ways?

[-]Richard Juggins2mo20

Well, if the solution to alignment is that a particular system has to keep running in a certain way, then that can fail. The durability of solutions is going to be on a spectrum. What we would hope is that the solution we try to implement is something that improves over time, rather than is permanently brittle.

I think that asking for a perfect solution is asking a lot. It may be possible to perfectly align a superintelligence to human will, but you also want to maintain as much oversight as you can in case you actually got it slightly wrong.

^{^}

For more information about my research project, see my substack.

^{^}

I take intelligence to be generalised knowing-how. That is, the ability to complete novel tasks. This is fairly similar to Francois Chollet’s definition: ‘skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty’, although I put more emphasis on learned skills grounding the whole thing in a bottom-up way. Chollet’s paper On the measure of intelligence is a good overview of the considerations involved in defining intelligence.

^{^}

For similar reasons to those given by Nathan Lambert here.

^{^}

I appreciate this will be very difficult to achieve, flying in the face of all of human history. I suspect that some kind of positive-sum, interest-respecting dynamic will need to be coded into the global political system — something that absolutely eschews all talk of one party or other ‘winning’ an AI race, in favour of a vision of shared prosperity.

^{^}

Some people would call this ‘corrigibility’, but I’m not going to use this term because it has a hinterland and means different things to different people. If you want to learn more about an alignment solution that specifically prioritises corrigibility, see Corrigibility as Singular Target by Max Harms.

^{^}

This is not over-anthropomorphising it. It is saying that AI will expect to interact with humans in a certain way, and may act unpredictably if treated differently to those expectations. Perhaps a different word to ‘rights’, with less baggage, would be preferable to describe this though.

^{^}

Beren Millidge has written an interesting post about seeing alignment as a feedback control problem, although I don’t know enough about control theory to tell you how well it could slot into my scheme.

^{^}

Beren Millidge has also written about the tension between instruction-following and innate values or laws.

^{^}

Zvi recently said: ‘If we want to build superintelligent AI, we need it to pass Who You Are In The Dark, because there will likely come a time when for all practical purposes this is the case. If you are counting on “I can’t do bad things because of the consequences when other minds find out” then you are counting on preserving those consequences.’ My idea is to both build AI that passes Who You Are In The Dark and, given perfection is hard, permanently enforce consequences for bad behaviour.

^{^}

This will be easier if it is individual copies that tend to fail, rather than whole classes of AIs at once. There might be an argument here that copies failing leads to antifragility, as some constant rate of survivable failures makes the system stronger and less likely to suffer catastrophic ones.

^{^}

Thank you John Wentworth for making me aware of this.

^{^}

To ‘increase market penetration, maximise brand loyalty, and enhance intangible assets’.

^{^}

In extremis: ‘will this action kill everyone?’

^{^}

It doesn’t make sense to me to assume AI will forever communicate with copies of itself using only natural language. The advantages of setting up higher-bandwidth channels are so obvious that I think any successful future must be robust to their existence.

^{^}

This idea seems highly plausible to me: ‘having worked with many different models, there is something about a model’s own output that makes it way more believable to the model itself even in a different instance. So a different model is required as the critiquer.’ As mentioned before, if you doubt the future will be multipolar, feel free to ignore this bit. It’s not load-bearing on its own.

^{^}

I’m picturing a subjective experience like when, as a parent, you play with your child and let them make all the decisions. You’re just pleased to be there and do what you can to make them happy.

^{^}

Situations where, due to competitive pressures, humans don’t bother to try and understand their AIs as it slows down their pursuit of power, will be policed by the international institution for AI risk, and be strongly disfavoured by the AIs themselves. E.g. Bob is going to get annoyed at Alice, and potentially lodge a complaint, if she doesn’t bother to pay attention to his demonstrations.

^{^}

This failure could be explicit, through the emergence of serious misalignment, or implicit, as humans fade into irrelevance.

^{^}

That being said, the vast majority of people will be living far less consequentially than Alice.

LESSWRONG
LW

LESSWRONG
LW

22

What Success Might Look Like

22

22

The purpose of this post

What does SuperMind look like?

SuperMind’s world

Alignment targets

How the system works

What does Bob do all day?

Engaging above your level of expertise

Bob’s morning

Alice’s morning

Conclusion