I've really been enjoying reading the previous PDF version (thanks for making that available!) and found it to have subtle implications for many different areas of life (how optimistic one can be wrt Hansonian cynicism, shifting my priors on how relevant genetic interventions are for improving cognition towards marginally less important, subtle implications for meditation practice & Buddhist theory, &c). I'm a bit skeptical still that it's as useful for alignment as you think it might be, but thank you for this research nonetheless!
A new version of “Intro to Brain-Like-AGI Safety” is out!
Things that have not changed
Same links as before:
…And same abstract as before:
Highlights from the changelog
So what’s new? Well, I went through the whole thing and made a bunch of edits and additions, based on what I’ve (hopefully) learned since my last big round of edits 18 months ago. Here are some highlights:
Post 1: What's the problem & Why work on it now?
What is AGI?
I updated my “what is AGI” chart, along with more elaboration on why some (not all!) LLM enthusiasts have a blind spot for just how much headroom there still is beyond the AI of today.
More responses to intelligence denialists
I added my response to another flavor of “intelligence denialism”
Post 2: “Learning from scratch” in the brain
Better overview of the discourse
Having learned more about the range of opinions on my “learning from scratch” hypothesis, I wrote a better overview of the state of the discourse:
Plasticity
I revamped my discussion of brain plasticity, relating it to “mutable variables” in computer science:
Interpretability
I’ve talked to a number of people who think that interpretability is a silver bullet for brain-like-AGI safety, so I added this subsection in response:
Post 3: Two subsystems: Learning & Steering
My timelines prediction
I added my extraordinarily bold prediction on when brain-like AGI will appear. In case you’re wondering, yes I am willing to put my money where my mouth is and take bets on “between zero and infinity years”, at 1:1 odds 😉
Responses to bad takes on acting under uncertainty
I rewrote my round-up of bad takes on AGI uncertainty, including “the insane bizarro-world reversal of Pascal’s Wager”:
…And the section now has even more silly pictures! The first of these is new, the other two are old.
Post 5. The “long-term predictor”, and TD learning
More pedagogy on the toy model
Much of this post discusses a certain toy model, and many readers have struggled to follow what I was saying. I added a new three-part “preliminaries” section that will hopefully provide helpful pointers & intuitions.
Post 6: Big picture of motivation, decision-making, and RL
More on why ego-syntonic goals are in the hypothalamus & brainstem
I added a brief summary of why we should believe that desires like friendship and justice come ultimately from little hypothalamic cell groups, just like hunger and pain do, as opposed to purely from “reason” (as one high-level AI safety funder once confidently told me).
Post 10: The technical alignment problem
LLMs
LLMs are officially off-topic, but in order to keep up with the times, I keep having to mention them in more and more places. One of the changes this time was to add a subsection “Didn’t LLMs solve the Goodhart’s Law problem?”
Instrumental convergence & consequentialist preferences
I rewrote my discussion of instrumental convergence, its relation to consequentialist preferences, and the surrounding strategic situation:
What about RL today?
I added a discussion of how I can say all this stuff about how RL agents are scary and we don’t know how to control them … and yet, RL research seems to be going fine today! How do I reconcile that?
What do I mean by (technical) alignment?
Posts 10 & 11 now have a clearer discussion of what I mean by (technical) alignment:
Post 12: Two paths forward: “Controlled AGI” and “Social-instinct AGI”
What exactly is the RL training environment?
I added a subsection clarifying that, contrary to normal RL, we get to choose the source code for AGI but we don’t really get to choose the training environment:
Post 15: Conclusion: Open problems, how to help, AMA
“Reward Function Design”
I added the RL subfield of “reward function design” as an 8th concrete research program that I endorse people working on:
Conclusion
Those were just highlights; there were many other small changes and corrections. The blog version has more detailed changelogs after each post. Happy for any feedback, either as blog comments, or by email, DM, etc.!