Former AI safety research engineer, now AI governance researcher at OpenAI. Blog:


Shaping safer goals
AGI safety from first principles

Wiki Contributions


The 2020 Review [Updated Review Dashboard]

If you think incentivizing excellent LessWrong posts is highly effective, then it would be better to publicly promise to donate [that much] to the authors of your favorite posts in 2022.

Why is that the case? Is it just that people can't see how much you've donated via donation buttons? I assume that some aggregate donation figures will be made public later on, though, so making those figures higher seems pretty similar to you announcing donations personally.

Biology-Inspired AGI Timelines: The Trick That Never Works

The two extracts from this post that I found most interesting/helpful:

The problem is that the resource gets consumed differently, so base-rate arguments from resource consumption end up utterly unhelpful in real life.  The human brain consumes around 20 watts of power.  Can we thereby conclude that an AGI should consume around 20 watts of power, and that, when technology advances to the point of being able to supply around 20 watts of power to computers, we'll get AGI?

I'm saying that Moravec's "argument from comparable resource consumption" must be in general invalid, because it Proves Too Much.  If it's in general valid to reason about comparable resource consumption, then it should be equally valid to reason from energy consumed as from computation consumed, and pick energy consumption instead to call the basis of your median estimate.

You say that AIs consume energy in a very different way from brains?  Well, they'll also consume computations in a very different way from brains!  The only difference between these two cases is that you know something about how humans eat food and break it down in their stomachs and convert it into ATP that gets consumed by neurons to pump ions back out of dendrites and axons, while computer chips consume electricity whose flow gets interrupted by transistors to transmit information.  Since you know anything whatsoever about how AGIs and humans consume energy, you can see that the consumption is so vastly different as to obviate all comparisons entirely.

You are ignorant of how the brain consumes computation, you are ignorant of how the first AGIs built would consume computation, but "an unknown key does not open an unknown lock" and these two ignorant distributions should not assert much internal correlation between them.

Even without knowing the specifics of how brains and future AGIs consume computing operations, you ought to be able to reason abstractly about a directional update that you would make, if you knew any specifics instead of none.  If you did know how both kinds of entity consumed computations, if you knew about specific machinery for human brains, and specific machinery for AGIs, you'd then be able to see the enormous vast specific differences between them, and go, "Wow, what a futile resource-consumption comparison to try to use for forecasting."


You can think of there as being two biological estimates to anchor on, not just one.  You can imagine there being a balance that shifts over time from "the computational cost for evolutionary biology to invent brains" to "the computational cost to run one biological brain".

In 1960, maybe, they knew so little about how brains worked that, if you gave them a hypercomputer, the cheapest way they could quickly get AGI out of the hypercomputer using just their current knowledge, would be to run a massive evolutionary tournament over computer programs until they found smart ones, using 10^43 operations.

Today, you know about gradient descent, which finds programs more efficiently than genetic hill-climbing does; so the balance of how much hypercomputation you'd need to use to get general intelligence using just your own personal knowledge, has shifted ten orders of magnitude away from the computational cost of evolutionary history and towards the lower bound of the computation used by one brain.  In the future, this balance will predictably swing even further towards Moravec's biological anchor, further away from Somebody on the Internet's biological anchor.


Richard Ngo's Shortform

I've been reading Eliezer's recent stories with protagonists from dath ilan (his fictional utopia). Partly due to the style, I found myself bouncing off a lot of the interesting claims that he made (although it still helped give me a feel for his overall worldview). The part I found most useful was this page about the history of dath ilan, which can be read without much background context. I'm referring mostly to the exposition on the first 2/3 of the page, although the rest of the story from there is also interesting. One key quote from the remainder of the story:

"The next most critical fact about Earth is that from a dath ilani perspective their civilization is made entirely out of coordination failure.  Coordination that fails on every scale recursively, where uncoordinated individuals assemble into groups that don't express their preferences, and then those groups also fail to coordinate with each other, forming governments that offend all of their component factions, which governments then close off their borders from other governments.  The entirety of Earth is one gigantic failure fractal.  It's so far below the multi-agent-optimal-boundary, only their professional economists have a five-syllable phrase for describing what a 'Pareto frontier' is, since they've never seen one in real life.  Individuals sort of act in locally optimal equilibrium with their local incentives, but all of the local incentives are weird and insane, meaning that the local best strategy is also insane from any larger perspective.  I cannot overemphasize how much you cannot predict Earth by reasoning that most features will have already been optimized into a not-much-further-improvable equilibrium.  The closest thing you can do to optimality-based analysis is to think in terms of individually incentive-following responses to incredibly weird local situations.  And the weird local situations cannot themselves be derived from first principles, because they are the bizarrely harmful equilibria of other weird incentives in other parts of the system.  Or at least I can't derive the weird situations from first principles, after two years of exposure and getting over the shock and trying to adapt.  I would've been much better off if I'd tried to understand it as an alien society instead of a human one, in retrospect; and I expect the same would hold for an Earthling trying to understand dath ilan."

My main update is that Eliezer has a very deep-rooted belief that the world is Lawful, in that it makes sense to talk about real-world intelligence, coordination, ethics, etc, as (very imperfect) approximations to their idealised mathematically-definable forms. (Note though that these are conclusions I've extrapolated from his fiction, which is a fairly unreliable method of inferring people's beliefs.)

Ngo and Yudkowsky on AI capability gains

My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." 

Oh, certainly Eliezer should trust his intuitions and believe that he's not a crackpot. But I'm not arguing about what the person with the theory should believe, I'm arguing about what outside observers should believe, if they don't have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn't modest epistemology.

AI Safety Needs Great Engineers

I wonder if some subset of the people who weren't accepted to the Redwood thing could organise a remote self-taught version. They note that "the curriculum emphasises collaborative problem solving and pair programming", so I think that the supervision Redwood provides would be helpful but not crucial. Probably the biggest bottleneck here would be someone stepping up to organise it (assuming Redwood would be happy to share their curriculum for this version).

Ngo and Yudkowsky on AI capability gains

the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete).

I'm not sure how this would actually work. The proponent of the AGI-capitalism analogy might say "ah yes, AGI killing everyone is another data point on the trend of capitalism becoming increasingly destructive". Or they might say (as Marx did) that capitalism contains the seeds of its own destruction. Or they might just deny that AGI will play out the way you claim, because their analogy to capitalism is more persuasive than your analogy to humans (or whatever other reasoning you're using). How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"?

My broader point is that these types of theories are usually sufficiently flexible that they can "predict" most outcomes, which is why it's so important to pin them down by forcing them to make advance predictions.

On the rest of your comment, +1. I think that one of the weakest parts of Eliezer's argument was when he appealed to the difference between von Neumann and the village idiot in trying to explain why the next step above humans will be much more consequentialist than most humans (although unfortunately I failed to pursue this point much in the dialogue).

Ngo and Yudkowsky on AI capability gains

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you've developed very good intuitions for how those problems work. In the case of alignment, it's hard to learn things from grappling with most of these problems, because we don't have signals of when we're going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.

By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn't improve people's object-level rationality very much.

I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.) I expect that there's just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren't epistemologists making breakthroughs in all sorts of other domains?

Ngo and Yudkowsky on AI capability gains

I don't expect such a sequence to be particularly useful, compared with focusing on more object-level arguments. Eliezer says that the largest mistake he made in writing his original sequences was that he "didn’t realize that the big problem in learning this valuable way of thinking was figuring out how to practice it, not knowing the theory". Better, I expect, to correct the specific mistakes alignment researchers are currently making, until people have enough data points to generalise better.

Ngo and Yudkowsky on AI capability gains

it seems to me that you want properly to be asking "How do we know this empirical thing ends up looking like it's close to the abstraction?" and not "Can you show me that this abstraction is a very powerful one?"

I agree that "powerful" is probably not the best term here, so I'll stop using it going forward (note, though, that I didn't use it in my previous comment, which I endorse more than my claims in the original debate).

But before I ask "How do we know this empirical thing ends up looking like it's close to the abstraction?", I need to ask "Does the abstraction even make sense?" Because you have the abstraction in your head, and I don't, and so whenever you tell me that X is a (non-advance) prediction of your theory of consequentialism, I end up in a pretty similar epistemic state as if George Soros tells me that X is a prediction of the theory of reflexivity, or if a complexity theorist tells me that X is a prediction of the theory of self-organisation. The problem in those two cases is less that the abstraction is a bad fit for this specific domain, and more that the abstraction is not sufficiently well-defined (outside very special cases) to even be the type of thing that can robustly make predictions.

Perhaps another way of saying it is that they're not crisp/robust/coherent concepts (although I'm open to other terms, I don't think these ones are particularly good). And it would be useful for me to have evidence that the abstraction of consequentialism you're using is a crisper concept than Soros' theory of reflexivity or the theory of self-organisation. If you could explain the full abstraction to me, that'd be the most reliable way - but given the difficulties of doing so, my backup plan was to ask for impressive advance predictions, which are the type of evidence that I don't think Soros could come up with.

I also think that, when you talk about me being raised to hold certain standards of praiseworthiness, you're still ascribing too much modesty epistemology to me. I mainly care about novel predictions or applications insofar as they help me distinguish crisp abstractions from evocative metaphors. To me it's the same type of rationality technique as asking people to make bets, to help distinguish post-hoc confabulations from actual predictions.

Of course there's a social component to both, but that's not what I'm primarily interested in. And of course there's a strand of naive science-worship which thinks you have to follow the Rules in order to get anywhere, but I'd thank you to assume I'm at least making a more interesting error than that.

Lastly, on probability theory and Newtonian mechanics: I agree that you shouldn't question how much sense it makes to use calculus in the way that you described, but that's because the application of calculus to mechanics is so clearly-defined that it'd be very hard for the type of confusion I talked about above to sneak in. I'd put evolutionary theory halfway between them: it's partly a novel abstraction, and partly a novel empirical truth. And in this case I do think you have to be very careful in applying the core abstraction of evolution to things like cultural evolution, because it's easy to do so in a confused way.

Ngo and Yudkowsky on AI capability gains

I'm still trying to understand the scope of expected utility theory, so examples like this are very helpful! I'd need to think much more about it before I had a strong opinion about how much they support Eliezer's applications of the theory, though.

Load More