It'd be interesting to see whether it performs worse if it only plays one side and the other side is played by a human. (I'd expect so.)

I want to preregister my prediction that Sydney will be significantly worse for significantly longer games (like I'd expect it often does illegal or nonsense move when we are like at move 50), though I'm already surprised that it apparently works up to 30 moves. I don't have time to test it unfortunately, but it'd be interesting to learn whether I am correct.

Likewise, for some specific programs we can verify that they halt.

(Not sure if I'm missing something, but my initial reaction:)

There's a big difference between being able to verify for some specific programs if they have a property, and being able to check it for all programs.

For an arbitrary TM, we cannot check whether it outputs a correct solution to a specific NP complete problem. We cannot even check that it halts! (Rice's theorem etc.)

Not sure what alignment relevant claim you wanted to make, but I doubt this is a valid argument for it.

Thank you! I'll likely read your paper and get back to you. (Hopefully within a week.)

From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that's how Yudkowsky means it, but not sure if that's what most people mean when they use the term.) (Though not that this implies that you need so much consequentialism that we won't be able to shut down the AGI. But as I see it a theoretical solution to corrigibility needs to deal with consequentialism. I haven't looked into your paper yet, so it's well possible that my comment here might appear misguided.) E.g. if we just built a gigantic transformer and train it on all human knowledge (and say we have a higher sample efficiency or so), it is possible that it can do almost everything humans can do. But it won't be able to just one-shot solve quantum gravity or so when we give it the prompt "solve quantum gravity". There is no runtime updating/optimization going on, i.e. the transformer is non-consequentialist. All optimization happened through the training data or gradient descent. Either the human training data was already sufficient to encode a solution to quantum gravity in the patterns of the transformer, or it wasn't. It is theoretically possible that the transformer learns a bit deeper underlying patterns than humans have (though I do not expect that from sth like the transformer architecture), and is so able to generalize a bit further than humans. But it seems extremely unlikely that it learned so deep understanding to already have the solution to quantum gravity encoded, although it was never explicitly trained to learn that and just read physics papers. The transformer might be able to solve quantum gravity if it can recursively query itself to engineer better prompts, or if it can give itself feedback which is then somehow converted into gradient descent updates and then try multiple times. But in those cases there is consequentialist reasoning again. The key point: Consequentialism becomes necessary when you go beyond human level.

Out of interest, how much do you agree with what I just wrote?

Hi Koen, thank you very much for writing this list!

I must say I'm skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that's not at all a clear definition yet, I'm still deconfusing myself about that, and I'll likely publish a post clarifying the problem how I see it within the next month.)

So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I'd expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all).

But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn't a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write? (I guess I'll do it sometime this week of you write me back, but no promises.)

Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)

Huh, interesting. Could you make some examples for what people seem to claim this, and if Eliezer is among them, where he seems to claim this? (Would just interest me.)

In case some people relatively new to lesswrong aren't aware of it. (And because I wish I found that out earlier): "Rationality: From AI to Zombies" does not nearly cover all of the posts Eliezer published between 2006 and 2010.

Here's how it is:

  • "Rationality: From AI to Zombies" probably contains like 60% of the words EY has written in that timeframe and the most important rationality content.
  • The original sequences are basically the old version of the collection that is now "Rationality: A-Z", containing a bit more content. In particular a longer quantum physics sequence and sequences on fun theory and metaethics.
  • All EY posts from that timeframe (or here for all EY posts until 2020 I guess) (also can be found on lesswrong, but not in any collection I think).

So a sizeable fraction of EY's posts are not in a collection.

I just recently started reading the rest.

I strongly recommend reading:

And generally a lot of posts on AI (i.e. primarily posts in the AI foom debate) are not in the sequences. Some of them were pretty good.

I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.

This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it's relatively obvious that AGI design X destroys the world, and all the wise actors don't deploy it, we cannot prevent unwise actors to deploy it a bit later.

We currently don't have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.

I'd guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I'd still intuitively expect that there are doable pivotal acts on those classes.

