Sam Clarke

Wiki Contributions


Inner Alignment: Explain like I'm 12 Edition

Part of me thinks: I was trying to push on whether it has a world model or rather has just memorised loads of stuff on the internet and learned a bunch of heuristics for how to produce compelling internet-like text. For me, "world model" evokes some object that has a map-territory relationship with the world. It's not clear to me that GPT-3 has that.

Another part of me thinks: I'm confused. It seems just as reasonable to claim that it obviously has a world model that's just not very smart. I'm probably using bad concepts and should think about this more.

Inner Alignment: Explain like I'm 12 Edition

It looks good to me!

This is already true for GPT-3

Idk, maybe...?

Inner Alignment: Explain like I'm 12 Edition

Re the argument for "Why internalization might be difficult", I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it's not right.

Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:

  • ~all models with good performance on a diverse training set probably have to have a complex world model already, which likely includes a model of the base objective,
  • so having the base objective re-encoded in a separate part of the model that represents its objective is just a waste of space/complexity.

Especially since this post is now (rightly!) cited in several introductory AI risk syllabi, it might be worth correcting this, if you agree it's an error.

Inner Alignment: Explain like I'm 12 Edition

Edit: or do you just mean that even though you take the same steps, the two feel different because retreating =/= going further along the wall

Yeah, this — I now see what you were getting at!

Late 2021 MIRI Conversations: AMA / Discussion

One argument for alignment difficulty is that corrigibility is "anti-natural" in a certain sense. I've tried to write out my understanding of this argument, and would be curious if anyone could add or improve anything about it.

I'd be equally interested in any attempts at succinctly stating other arguments for/against alignment difficulty.

Inner Alignment: Explain like I'm 12 Edition

Instead of "always go left", how about "always go along one wall"?

Yeah, maybe better, though still doesn't quite capture the "backing up" part of the algorithm. Maybe "I explore all paths through the maze, taking left hand turns first, backing up if I reach a dead end"... that's a bit verbose though.

I don't think there is a difference.


Inner Alignment: Explain like I'm 12 Edition

Another small nitpick: the difference, if any, between proxy alignment and corrigibility isn't explained. The concept of proxy alignment is introduced in subsection "The concept" without first defining it.

You are probably underestimating how good self-love can be

I've since been told about Tasshin Fogleman's guided metta meditations, and have found their aesethic to be much more up my alley than the others I've tried. I'd expect others who prefer a more rationalist-y aesthetic to feel similarly.

The one called 'Loving our parts' seems particularly good for self-love practice.

Inner Alignment: Explain like I'm 12 Edition

I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I'm convinced that inner misalignment is possible).

So, I currently tend to prefer the following as the strongest "solid, specific reason to expect dangerous misalignment":

We don't yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.

Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize control of the memory cell where their reward is implemented (and eliminate anyone who might try to interfere with this) once they're sufficiently powerful - because that will allow them to get much higher scores, much more easily, than actually bringing about complicated changes to the world.

Meanwhile, AI capabilities are marching on scarily fast, so we probably don't have that much time to find a solution. And it's plausible that a solution will be very difficult because corrigibility seems "anti-natural" in a certain sense.

Curious what you think about this?

Comments on Carlsmith's “Is power-seeking AI an existential risk?”

Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:

  • Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
  • AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
  • Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
    • (1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
    • but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
  • This seems unnatural and Eliezer can’t see how to do it currently.
  • An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
  • (This is partly based on this summary)
Load More