Sorted by New

Wiki Contributions


Yudkowsky and Christiano discuss "Takeoff Speeds"

Is GPT-3 perhaps some sort of discontinuity for how single-language text generation w/ neural networks is monetized? Have there been other companies that sold text completion as a service, metered out per token, before GPT-3?

Obviously, this isn’t a purely technical discontinuity, but I haven’t heard of any companies monetizing language models in this way in the past.

EDIT: see also Gwern’s comment for why Penn Tree Bank Perplexity isn’t a good metric for discontinuities in language models. (

Discussion with Eliezer Yudkowsky on AGI interventions

Thanks for the detailed response.

On reflection, I agree with what you said - I think the amount of work it takes to translate a nice sounding idea into anything that actually works on an experimental domain is significant, and what exact work you need is generally not predictable in advance. In particular, I resonated a lot with this paragraph:

I'm also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong but nonetheless the first thing we tried didn't work.

At least from my vantage point, “having a strong story for why a result should be X” is insufficient for ex ante predictions of what exactly the results would be. (Once you condition on that being the story told in a paper, however, the prediction task does become trivial.)

I’m now curious what the MIRI response is, as well as how well their intuitive judgments of the results are calibrated.

EDIT: Here’s another toy model I came up with: you might imagine there are two regimes for science - an experiment driven regime, and a theory driven regime. In the former, it’s easy to generate many “plausible sounding” ideas and hard to be justified in holding on to any of them without experiments. The role of scientists is to be (low credence) idea generators and idea testers, and the purpose of experimentation is to primarily to discover new facts that are surprising to the scientist finding them. In the second regime, the key is to come up with the right theory/deep model of AI that predicts lots of facts correctly ex ante, and then the purpose of experiments is to convince other scientists of the correctness of your idea. Good scientists in the second regime are those who discover the right deep models much faster than others. Obviously this is an oversimplification, and no one believes it’s only one or the other, but I suspect both MIRI and Stuart Russell lie more on the “have the right idea, and the paper experiments are there to convince others/apply the idea in a useful domain” view, while most ML researchers hold the more experimentalist view of research?

Discussion with Eliezer Yudkowsky on AGI interventions

I actually think this particular view is worth fleshing out, since it seems to come up over and over again in discussions of what AI alignment work is valuable (versus not).

For example, it does seem to me that >80% of the work in actually writing a published paper (at least amongst papers at CHAI) (EDIT: no longer believe this on reflection, see Rohin’s comment below) involves doing work with results that are predictable to the author after the concept (for example, actually getting your algorithm to run, writing code for experiments, running said experiments, writing up the results into a paper, etc.)

Discussion with Eliezer Yudkowsky on AGI interventions

Yudkowsky mentions this briefly in the middle of the dialogue:

 I don't know however if I should be explaining at this point why "manipulate humans" is convergent, why "conceal that you are manipulating humans" is convergent, why you have to train in safe regimes in order to get safety in dangerous regimes (because if you try to "train" at a sufficiently unsafe level, the output of the unaligned system deceives you into labeling it incorrectly and/or kills you before you can label the outputs), or why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being "anti-natural" in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior ("consistent utility function") which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).

Basically, there are reasons to expect that alignment techniques that work in smaller safe regime fail in larger, unsafe regimes. For example, an alignment technique that requires your system demonstrate undesirable behavior while running could remain safe while your system is weak, but then become dangerous when undesirable behavior from your system becomes powerful. 

That being said,  Ajeya's "Case for Aligning Narrowly Superhuman models" does flesh out the case for trying to align existing systems (as capabilities scale). 

Discussion with Eliezer Yudkowsky on AGI interventions

Thanks! Relevant parts of the comment:

 Eliezer thinks that while corrigibility probably has a core which is of lower algorithmic complexity than all of human value, this core is liable to be very hard to find or reproduce by supervised learning of human-labeled data, because deference is an unusually anti-natural shape for cognition, in a way that a simple utility function would not be an anti-natural shape for cognition.


The central reasoning behind this intuition of anti-naturalness is roughly, "Non-deference converges really hard as a consequence of almost any detailed shape that cognition can take", with a side order of "categories over behavior that don't simply reduce to utility functions or meta-utility functions are hard to make robustly scalable".


What I imagine Paul is imagining is that it seems to him like it would in some sense be not that hard for a human who wanted to be very corrigible toward an alien, to be very corrigible toward that alien; so you ought to be able to use gradient-descent-class technology to produce a base-case alien that wants to be very corrigible to us, the same way that natural selection sculpted humans to have a bunch of other desires, and then you apply induction on it building more corrigible things.

My class of objections in (1) is that natural selection was actually selecting for inclusive fitness when it got us, so much for going from the loss function to the cognition; and I have problems with both the base case and the induction step of what I imagine to be Paul's concept of solving this using recursive optimization bootstrapping itself; and even more so do I have trouble imagining it working on the first, second, or tenth try over the course of the first six months.

My class of objections in (2) is that it's not a coincidence that humans didn't end up deferring to natural selection, or that in real life if we were faced with a very bizarre alien we would be unlikely to want to defer to it. Our lack of scalable desire to defer in all ways to an extremely bizarre alien that ate babies, is not something that you could fix just by giving us an emotion of great deference or respect toward that very bizarre alien. We would have our own thought processes that were unlike its thought processes, and if we scaled up our intelligence and reflection to further see the consequences implied by our own thought processes, they wouldn't imply deference to the alien even if we had great respect toward it and had been trained hard in childhood to act corrigibly towards it.

Discussion with Eliezer Yudkowsky on AGI interventions

Yeah, we've also spent a while (maybe ~5 hours total?) in various CHAI meetings (some of which you've attended) trying to figure out the various definitions of corrigibility to no avail, but those notes are obviously not public. :( 

That being said I don't think failing in several hours of meetings/a few unpublished attempts is that much evidence of the difficulty?

Discussion with Eliezer Yudkowsky on AGI interventions

I'm not sure why you mean by 'philosophically' simple? 

Do you agree that other problems in AI Alignment don't have "philosophically' simple cores? It seems to me that, say, scaling human supervision to a powerful AI or getting an AI that's robust to 'turning up the dial' seem much harder and intractable problems than corrigibility. 

Discussion with Eliezer Yudkowsky on AGI interventions

+1 to both points.

For an even more extreme example, Linear Regression is a large "vector of floating points" that's easy enough to prove things about proofs are assigned as homework questions for introductory Linear Algebra/Statistics classes. 

I also think that we've had more significantly more theoretical progress on ANNs than I would've predicted ~5 years ago. For example, the theory of wide two layer neural networks has basically been worked out, as had the theory of infinitely deep neural networks, and the field has made significant progress understanding why we might expect the critical points/local minima gradient descent finds in ReLU networks to generalize. (Though none of this work is currently good enough to inform practice, I think?) 

Discussion with Eliezer Yudkowsky on AGI interventions

I'd also add that having a system that uses abstractions that are close to humans is insufficient for safety, because you're putting those abstractions under stress by optimizing them. 

I do think it's plausible that any AI modelling humans will model humans as having preferences, but 1) I'd imagine these preference models as calibrated on normal world states, and not extend property off distribution (IE, as soon as the AI starts doing things with nanomachines that humans can't reason properly about) and 2) "pointing" at the right part of the AI's world model to yield preferences, instead of a proxy that's a better model of your human feedback mechanism, is still an unsolved problem. (The latter point is outlined in the post in some detail, I think?) I also think that 3) there's a possibility that there is no simple, natural core of human values, simpler than "model the biases of people in detail", for an AI to find. 

Load More