David Johnston

Wiki Contributions


Where I agree and disagree with Eliezer

I agree with this. If the key idea is, for example, optimising imitators generalise better than imitations of optimisers, or for a second example that they pursue simpler goals, it seems to me that it'd be better just to draw distinctions based on generalisation or goal simplicity and not on optimising imitators/imitations of optimisers.

Limits to Legibility

A person new to AI safety evaluating their arguments is roughly at a similar position to a Go novice trying to make sense of two Go grandmasters disagreeing about a board

I don't think the analogy is great, because Go grandmasters have actually played, lost and (critically) won a great many games of Go. This has two implications: first, I can easily check their claims of expertise. Second, they have had many chances to improve their gut level understanding of how to play the game of Go well, and this kind of thing seems to be to necessary to develop expertise.

How does one go about checking gut level intuitions about AI safety? It seems to me that turning gut intuitions into legible arguments that you and others can (relatively) easily check is one of the few tools we have, with objectively assessable predictions being another. Sure, both are hard, and it would be nice if we had easier ways to do it, but it seems to me that that's just how it is.

Where I agree and disagree with Eliezer

10. AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.


I think there will be substantial technical hurdles along the lines of getting in-principle highly capable AI systems to reliably do what we want them to, these will probably be "commercially relevant" (i.e. not just a concern for X-risk researchers), and it's plausible (though far from certain) that hurdles of this type will slow the rate of AI progress.

One reason I think this might slow the rate of progress is that important parts of this can't be delegated to highly in-principle capable systems. One reason I think it might not slow progress much is that I don't currently see a good reason why most of this work couldn't be delegated to highly actually-capable systems.

Where I agree and disagree with Eliezer

I've written a few half-baked alignment takes for Less Wrong, and they seem to have mostly been ignored. I've since decided to either bake things fully, look for another venue, or not bother, and I'm honestly not particularly enthused about the fully bake option. I don't know if anything similar has had any impact on Sam's thinking.

why assume AGIs will optimize for fixed goals?

I'm not sure exactly how important goal-optimisation is. I think AIs are overwhelmingly likely to fail to act as if they were universally optimising for simple goals compared to some counterfactual "perfect optimiser with equivalent capability", but this is failure only matters if the dangerous behaviour is only executed by the perfect optimiser.

They're also very likely to act as if they are optimising for some simple goal X in circumstances Y under side conditions Z (Y and Z may not be simple) - in fact, they already do. This could easily be enough for dangerous behaviour, especially if in practice there's a lot of variation in X, Y and Z. Subject to restrictions imposed by Y and Z, instrumental convergence still applies.

A particular worry is if dangerous behaviour is easy. Suppose it's completely trivial: there's a game of Go where placing the right 5-stone sequence kills everybody and awards the stone placer the win. You have a smart AI (that already "knows" how to win at Go, and about the five stone technique) that you want to play the game for you. You use some method to try to direct it to play games of Go. Unless it has particular reason to ignore the 5-stone sequence, it will probably consider it about its top moves, even if it's simultaneously prone to getting distracted by butterflies, or if it misunderstands your request to be about playing go only on sunny days. It just comes down to the fact that the 5-stone sequence is a strong move that's easy to know about.

AGI Ruin: A List of Lethalities

I think raw intelligence, while important, is not the primary factor that explains why humanity-as-a-species is much more powerful than chimpanzees-as-a-species. Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.

Slightly relatedly, I think it's possible that "causal inference is hard". The idea is: once someone has worked something out, they can share it and people can pick it up easily, but it's hard to figure the thing out to begin with - even with a lot of prior experience and efficient inference, most new inventions still need a lot of trial and error. Thus the reason the process of technology accumulation is gradual is, crudely, because causal inference is hard.

Even if this is true, one way things could still go badly is if most doom scenarios are locked behind a bunch of hard trial and error, but the easiest one isn't. On the other hand, if both of these things are true then there could be meaningful safety benefits gained from censoring certain kinds of data.

AGI Ruin: A List of Lethalities

As I said (a few times!) in the discussion about orthogonality, indifference about the measure of "agents" that have particular properties seems crazy to me. Having an example of "agents" that behave in a particular way is a enormously different to having an unproven claim that such agents might be mathematically possible.

AGI Ruin: A List of Lethalities

A Go AI that learns to play go via reinforcement learning might not "have a utility function that only cares about winning Go". Using standard utility theory, you could observe its actions and try to rationalise them as if they were maximising some utility function, and the utility function you come up with probably wouldn't be "win every game of Go you start playing" (what you actually come up with will depend, presumably, on algorithmic and training regime details). The reason why the utility function is slippery is that it's fundamentally an adaptation executor, not a utility maxmiser.

AGI Ruin: A List of Lethalities

FWIW self-supervised learning can be surprisingly capable of doing things that we previously only knew how to do with "agentic" designs. From that link: classification is usually done with an objective + an optimization procedure, but GPT-3 just does it.

AGI Ruin: A List of Lethalities

My view is that if Yann continues to be interested in arguing about the issue then there's something to work with, even if he's skeptical, and the real worry is if he's stopped talking to anyone about it (I have no idea personally what his state of mind is right now)

Load More