MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.
I have not signed any contracts that I can't mention exist, as of December 20, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
Maybe this is partially because the AI is biased towards finishing its task quickly? During training, it would eventually be cut off if it took too long, so it's motivated to stop early and assert that it's finished even when that isn't true.
Was it definitely necessary to give it a lot of hints about what to do, or do you think it could've succeeded if you just repeatedly said "you missed something, try again"? If hints were really needed, what kinds of things did it need hints for?
I wonder if it would help to break down the task into multiple pieces. You could try:
Why... can't we take them back?
Well, now that I think about it, I'm not sure what scenario I should be imagining here.
Scenario 1: if genetic interventions became popular enough that the entire world were getting 10 IQ points smarter each year (EDIT: sorry, I meant "generation"), as you say, then it seems obvious to me that you'd be unable to take it back. Surely the first generation of superbabies would want future generations to be like them. If their parents say "actually, y'all were a mistake, let our grandchildren mean-regress and be normal please," they'd simply refuse.
Scenario 2: more realistically IMO, we start with a generation of a few thousand superbabies, who are the children of rationalist-type people who really care about intelligence. Maybe these people grow up very smart but very weird, and they are unable to shape society to their weird preferences because there aren't that many of them.
But wait, many people view these genetic interventions as our best hope to save the world... Do we expect that the superbabies are going to be smart enough to make the critical difference in solving AI alignment, but we don't expect they'll gain enough influence to significantly affect society's future values? Seems unlikely to me.
For better or for worse, the second scenario is basically already playing out - you have people like Elon Musk and Mark Zuckerberg who got their power by being very smart, who now get to shape the world in their own weird ways. Powerful people are already optimized for being intelligent via selection effects; genetic optimization would just be another layer on top of that.
People who want genius superbabies: how worried are you about unintended side effects of genetic interventions on personality?
Even if we assume genetically modified babies will all be very healthy and smart on paper, genes that are correlated with intelligence might affect hard-to-measure but important traits. For example, they might alter aesthetic taste, emotional capacity, or moral/philosophical intuitions. From the subjective perspective of an unmodified human, these changes are likely to be "for the worse."
If you pick your child's genes to maximize their IQ (or any other easily-measurable metric), you might end up with the human equivalent of a benchmaxxed LLM with amazing test scores but terrible vibes.
I'd be hesitant to hand off the future to any successors which are super far off distribution from baseline humans. Once they exist, we obviously can't just take them back. And in the case of superbabies, we'd have to wait decades to find out what they're like once they've grown up.
Using AI can help you get things done faster, even if it's worse than you at coding
Yeah, I guess it is ambiguous. I agree people should be more careful about this.
For what it's worth, this is a bullet point in the description of the "woo stuff" meetup that Eliezer was responding to:
Rats are particularly drawn to certain woo practices (jhanas and meditation, circling and authentic relating, psychedelics) while rejecting others (astrology, reiki, palm reading). What principles do you think determine which practices get adopted? Can we characterize this selection process as rational, meta-rational, level three midwittery, or some other thing?
So if the meetup was about these "certain woo practices" and Eliezer doesn't think they're all bad, there's his answer.
Many "woo" things have nothing to do with holding false beliefs. Even CFAR does stuff like Focusing, which I'd consider pretty solidly woo.
The good kinds of woo are about learning to use your emotions / System 1. Those things are very useful, a discovery which I found surprising. In retrospect, it should've been pretty obvious that evolution gave us complex emotions for good reason.
For cerebral people with a very developed System 2, there's often a ton of low-hanging fruit in using their System 1, and woo can teach them to pick it. Applying a bit of System 1 can often solve your problems more effectively than "just use System 2 even harder!"
You can get a lot of use out of pretending to push emotions around your body in the form of "energy" without literally believing it, probably similar to how you can push a pseudo-prediction to your brain to get yourself out of bed. And in fact, I think meditation-style practices are a promising avenue for anti-akrasia.
Yeah, all I meant was that it seems like MIRI is not that close to reaching $1.6 million in donations. If they were going to make $1.6 million anyway, then a marginal donation would not cause SFF to donate more
The first $1.6M will be matched 1:1 by Survival and Flourishing Fund. It seems plausible that donations right now could actually cause counterfactual matching, which is good if you think MIRI is better than whatever SFF otherwise would have funded.
Point taken about continual learning - I conveniently ignored this, but I agree that scenario is pretty likely.
If AIs were weak reward-optimizers, I think that would solve inner alignment better than if they were strong reward-optimizers. "Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?" I.e. the AI shouldn't misgeneralize the objective function you designed, but it also shouldn't bypass the objective function entirely by overwriting its reward.
By "hardening," do you mean something like "making sure the AI can never hack directly into the computer implementing RL and overwrite its own reward?" I guess you're right. But I would only expect the AI to acquire this motivation if it actually saw this sort of opportunity during training. My vague impression is that in a production-grade RL pipeline, exploring into this would be really difficult, even without intentional reward hardening.
Assuming that's right, thinking the AI would have a "strong reward-optimizing" motivation feels pretty crazy to me. Maybe that's my true objection.
Like, I feel like no one actually thinks the AI will care about literal reward, for the same reason that no one thinks evolution will cause humans to want to spend their days making lots of copies of their own DNA in test tubes.
But maybe people do think this. IIRC, Bostrom's Superintelligence leaned pretty heavily on the "true reward-optimization" framing (cf. wireheading). Maybe some people thinking about AI safety today are also concerned about true reward-optimizers, and maybe those people are simply confused in a way that they wouldn't be if everyone talked about reward the way Alex wants.
I say things like "RL makes the AI want to get reward" all the time, which I guess would annoy Alex. But when I say "get reward" I actually mean "do things that would cause the untampered reward function to output a higher value," aka "weakly optimize reward." It feels obvious to me that the "true reward-optimization" interpretation isn't what I'm talking about, but maybe it's not as obvious as I think.
I think talking about AI "wanting/learning to get reward" is useful, but it isn't clear to me what Alex would want me to say instead. I'd consider changing my phrasing if there were an alternative that felt natural and didn't take much longer to say.
Did you see OpenAI's work on "confession training"? Would you say the main difference between your method and theirs is that you train the honesty behavior into a LoRA that's separate from the main model?