Towards_Keeperhood — LessWrong

I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

I have a current set of projects, but, the meta-level one is "look for ways to systematically improve people's ability to quickly navigate confusing technical problems, and see what works, and stack as many interventions as we can."

Yeah so I think I read all the posts about that you wrote in the last 2 years.

I think such techniques and the meta skill of deriving more of them are very important for making good alignment progress, but I still think it takes geniuses. (In fact I sorta tried my shot at the alignment problem over the last 3ish years where I often reviewed and looked at what I could do better, and started developing a more systematic sciency approach for studying human minds. I'm now pivoting to working on Plan 1 though, because there's not enough time left.)

Like currently it takes a some special kind of genius to make useful progress. Maybe we could try to train the smartest young supergeniuses in the techniques we currently have, and maybe they then could make progress much faster than me or Eliezer. Or maybe they still wouldn't be able to judge what is good progress.
If you actually get supergeniuses you could try to train that would obviously be quite useful to try, even though it likely won't get done on time, but if they don't end up running away with their dumb idea for solving alignment without understanding the systems they are dealing with, it would still be great for needing less time after the ban.

(Steven Byrnes' agenda seems to be potentially more scaleable with good methodology, and it has the advantage that progress is relatively less exfohazardry, but would still take very smart people (and relatively long study) to make progress. But it won't get done on time, so you still need international coordination to not build ASI.)

But the way you seemed to motivate your work in your previous comment sounded more like "make current safety researchers do more productive work so we might actually solve alignment without international coordination". Seems very difficult to me, I think they are not even tracking many problems that are actually rather obvious or not getting the difficulties that are easy to get. People somehow often have a very hard time understanding relevant concepts here. E.g. even special geniuses like John Wentworth and Steven Byrnes made bad attempts at attacking corrigibility where they misunderstood the problem (1, 2), although that's somewhat cherry picked and may be fixable. I mean not that such MIRI-like research is likely necessary, but still. Though I'm still curious about how you imagine your project might help here more precisely.

Thanks for writing that list!

...I also see part of my goal as trying to help the "real alignment work" technical field reach a point where the stuff-that-needs-doing is paradigmatic enough that you can just point at it, and the action-biased-philosophy-averse lab "safety" people can just say "oh, sure it sounds obvious when you put it like that, why didn't you say that before?"

This seems extremely unrealistic to me. Not sure how you imagine that might work.

Yeah I suppose I could've guessed that.

I read your sequence in the past but I didn't think carefully enough about this to evaluate this.

I'm not trying to reach Type 3 people. I'm trying to not alienate Type 2 people from supporting Plan 1.

I mean, maybe this isn't "opposed" but there is a direct tradeoff where if you're not trying to stop the race, you want to race ahead faster because you're more alignment-pilled.

You could still try to solve alignment and support (e.g. financially) others who are trying to stop the race.

Or rather, how are you supposed to reach Type 3 people?

I assume you mean Type 2 people. Like, they could sign the superintelligence statement.

What exactly do you mean by corrigibility here? Getting an AI to steer towards a notion of human empowerment in a way we can ask it to solve uploading and it does that without leading to bad results? Or getting an AI that has solve-uploading levels of capability but still would let us shut it down without resisting (even if it didn't complete its task yet). And if the latter, does it need to be in a clean way, or can it be in a messy way like that we just trained really really hard to make the AI not think about the offswitch and it somehow surprisingly ended up working?

I think the way most alignment researchers (probably including Paul Christiano) would approach training for corrigibility is relatively unlikely to work in time, because they think more in terms of behavior generalization rather than steering systems, and I guess they wouldn't aim well enough at getting coherent corrigible steering patterns to make the systems corrigible at high levels of optimization power.

It's possible there's a smarter way that has better chances.

Pretty unsure about both of those though.

I just posted my attempt at combatting polarization a bit.

In contrast, Camp B tends to support such binding standards, akin to those of the FDA

Don't compare to FDA, compare to IAEA.

Yeah I am also a bit disappointed with that list.

I would recommend controlAI.

I definitely have to update here - that's just law of probability. Maybe you don't have to update much if you already expected to have superhuman competetive programming around now.

But also this isn't the only update that informs my new timelines. I was saying more like "look I wrote down advanced predictions and it was actually useful to me", rather than intending to give an epistemically legible account of my timeline models.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments