My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about "consequentionalist reasoning").


Formalising Catastrophic Goodhart

Wiki Contributions


Quick reaction:

  • I didn't want to use the ">1 billion people" formulation, because that is compatible with scenarios where a catastrophe or an accident happens, but we still end up controling the future in the end.
  • I didn't want to use "existential risk", because that includes scenarios where humanity survives but has net-negative effects (say, bad versions of Age of Em or humanity spreading factory farming across the stars).
  • And for the purpose of this sequence, I wanted to look at the narrower class of scenarios where a single misaligned AI/optimiser/whatever takes over and does its thing. Which probably includes getting rid of literally everyone, modulo some important (but probably not decision-relevant?) questions about anthropics and negotiating with aliens.

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?

(My current take is that for a majority reader, sticking to "literal extinction" is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

That seems fair. For what it's worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn't as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.

A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven't solved everything, you must have made a bunch of progress.

Right, I agree. I didn't realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like "[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However, it is somewhat misleading in that 100% loss explained does not mean you understand what is going on inside the system."

I rephrased that now. Would be curious to hear whether you still have objections to the updated phrasing.

[% of loss explained] isn't a good interpretability metric [edit: isn't enough to get guarantees].
In interpretability, people use [% of loss explained] as a measure of the quality of an explanation. However, unless you replace the system-being-explained by its explanation, this measure has a fatal flaw.

Suppose you have misaligned superintelligence X pretending to be a helpful assistant A --- that is, acting as A in all situations except those where it could take over the world. Then the explanation "X is behaving as A" will explain 100% of loss, but actually using X will still kill you.

For [% of loss explained] to be a useful metric [edit: robust for detecting misalignment], it would need to explain most of the loss on inputs that actually matter. And since we fundamentally can't tell which ones those are, the metric will only be useful (for detecting misaligned superintelligences) if we can explain 100% of loss on all possible inputs.

I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the "distribute AI(x-1) quickly" part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the "single point of failure" effect, though it seems unclear how large.)

To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.

But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.

Might be obvious, but perhaps seems worth noting anyway: Ensuring that our boundaries are respected is, at least with a straightforward understanding of "boundaries", not sufficient for being safe.
For example:

  • If I take away all food from your local supermarkets (etc etc), you will die of starvation --- but I haven't done anything with your boundaries.
  • On a higher level, you can wipe out humanity without messing with our boundaries, by blocking out the sun.

An aspect that I would not take into account is the expected impact of your children.

Most importantly, it just seems wrong to make personal-happiness decisions subservient to impact.
But even if you did want to optimise impact through others, then betting on your children seems riskier and less effective than, for example, engaging with interested students. (And even if you wanted to optimise impact at all costs, then the key factors might not be your impact through others. But instead (i) your opportunity costs, (ii) second order effects, where having kids makes you more or less happy, and this changes the impact of your work, and (iii) negative second order effects that "sacrificing personal happiness because of impact" has on the perception of the community.)

In fact it's hard to find probable worlds where having kids is a really bad idea, IMO.

One scenario where you might want to have kids in general, but not if timelines are short, is if you feel positive about having kids, but you view the first few years of having kids as a chore (ie, it costs you time, sleep, and money). So if you view kids as an investment of the form "take a hit to your happiness now, get more happiness back later", then not having kids now seems justifiable. But I think that this sort of reasoning requires pretty short timelines (which I have), with high confidence (which I don't have), and high confidence that the first few years of having kids is net-negative happiness for you (which I don't have).

(But overall I endorse the claim that, mostly, if you would have otherwise wanted kids, you should still have them.)

Load More