Former AI safety research engineer, now AI governance researcher at OpenAI. Blog:


Replacing fear
Shaping safer goals
AGI safety from first principles

Wiki Contributions


The thing I'm picturing here is a futures contract where charizard-shirt-guy is obligated to deliver 3 trillion paperclips in exchange for one soul. And, assuming a reasonable discount rate, this is a better deal than only receiving a handful of paperclips now in exchange for the same soul. (I agree that you wouldn't want to invest in a current-market-price paperclip futures contract.)

Damn, MMDoom is a good one. New lore: it won the 2055 technique award.

Judges' ratings:

Technique: 5/10

The training techniques used here are in general very standard ones (although the dissonance filters were a nice touch). For a higher score on this metric, we would have expected more careful work to increase the stability of self-evaluation and/or the accuracy of the judgments.

Novelty: 7/10

While the initial premise was a novel one to us, we thought that more new ideas could have been incorporated into this entry in order for it to score more highly on this metric. For example, the "outliers" in the entry's predictions were a missed opportunity to communicate an underlying pattern. Similarly, the instability of the self-evaluation could have been incorporated into the entry in some clearer way.

Artistry: 9/10

We consider the piece a fascinating concept—one which forces the judges to confront the automatability of their own labors. Holding a mirror to the faces of viewers is certainly a classic artistic endeavor. We also appreciated the artistic irony of the entry's inability to perceive itself.

I think we have failed, thus far.  I'm sad about that. When I began posting in 2018, I assumed that the community was careful and trustworthy. Not easily would undeserved connotations sneak into our work and discourse. I no longer believe that and no longer hold that trust.

I empathize with this, and have complained similarly (e.g. here).

I have also been trying to figure out why I feel quite a strong urge to push back on posts like this one. E.g. in this case I do in fact agree that only a handful of people actually understand AI risk arguments well enough to avoid falling into "suggestive names" traps. But I think there's a kind of weak man effect where if you point out enough examples of people making these mistakes, it discredits even those people who avoid the trap.

Maybe another way of saying this: of course most people are wrong about a bunch of this stuff. But the jump from that to claiming the community or field has failed isn't a valid one, because the success of a field is much more dependent on max performance than mean performance.

In this particular case, Ajeya does seem to lean on the word "reward" pretty heavily when reasoning about how an AI will generalize. Without that word, it's harder to justify privileging specific hypotheses about what long-term goals an agent will pursue in deployment. I've previously complained about this here.

Ryan, curious if you agree with my take here.

Copying over a response I wrote on Twitter to Emmett Shear, who argued that "it's just a bad way to solve the problem. An ever more powerful and sophisticated enemy? ... If the process continues you just lose eventually".

I think there are (at least) two strong reasons to like this approach:

1. It’s complementary with alignment.

2.  It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs, to setups that would control slightly superhuman AGIs, to…

As one example of this: as you get increasingly powerful AGI you can use it to identify more and more vulnerabilities in your code. Eventually you’ll get a system that can write provably secure code. Ofc that’s still not a perfect guarantee, but if it happens before the level at which AGI gets really dangerous, that would be super helpful.

This is related to a more general criticism I have of the P(doom) framing: that it’s hard to optimize because it’s a nonlocal criterion. The effects of your actions will depend on how everyone responds to them, how they affect the deployment of the next generation of AIs, etc. An alternative framing I’ve been thinking about: the g(doom) framing. That is, as individuals we should each be trying to raise the general intelligence  threshold at which bad things happen.

This is much more tractable to optimize! If I make my servers 10% more secure, then maybe an AGI needs to be 1% more intelligent in order to escape. If I make my alignment techniques 10% better, then maybe the AGI becomes misaligned 1% later in the training process.

You might say: “well, what happens after that”? But my point is that, as individuals, it’s counterproductive to each try to solve the whole problem ourselves. We need to make contributions that add up (across thousands of people) to decreasing P(doom), and I think approaches like AI control significantly increase g(doom) (the level of general intelligence at which you get doom), thereby buying more time for automated alignment, governance efforts, etc.

I originally found this comment helpful, but have now found other comments pushing back against it to be more helpful. Upon reflection, I don't think the comparison to MATS is very useful (a healthy field will have a bunch of intro programs), the criticism of Remmelt is less important given that Linda is responsible for most of the projects, the independence of the impact assessment is not crucial, and the lack of papers is relatively unsurprising given that it's targeting earlier-stage researchers/serving as a more introductory funnel than MATS.

my guess is the brain is highly redundant and works on ion channels that would require actually a quite substantial amount of matter to be displaced (comparatively)

Neurons are very small, though, compared with the size of a hole in a gas pipe that would be necessary to cause an explosive gas leak. (Especially because you then can't control where the gas goes after leaking, so it could take a lot of intervention to give the person a bunch of away-from-building momentum.)

I would probably agree with you if the building happened to have a ton of TNT sitting around in the basement.

The resulting probability distribution of events will definitely not reflect your prior probability distribution, so I think Thomas' argument still doesn't go through. It will reflect the shape of the wave-function. 

This is a good point. But I don't think "particles being moved the minimum necessary distance to achieve the outcome" actually favors explosions. I think it probably favors the sensor hardware getting corrupted, or it might actually favor messing with the firemens' brains to make them decide to come earlier (or messing with your mother's brain to make her jump out of the building)—because both of these are highly sensitive systems where small changes can have large effects.

Does this undermine the parable? Kinda, I think. If you built a machine that samples from some bizarre inhuman distribution, and then you get bizarre outcomes, then the problem is not really about your wish any more, the problem is that you built a weirdly-sampling machine. (And then we can debate about the extent to which NNs are weirdly-sampling machines, I guess.)

The outcome pump is defined in a way that excludes the possibility of active subversion: it literally just keeps rerunning until the outcome is satisfied, which is a way of sampling based on (some kind of) prior probability. Yudkowsky is arguing that this is equivalent to a malicious genie. But this is a claim that can be false.

In this specific case, I agree with Thomas that whether or not it's actually false will depend on the details of the function: "The further she gets from the building's center, the less the time machine's reset probability." But there's probably some not-too-complicated way to define it which would render the pump safe-ish (since this was a user-defined function).

Load More