paulfchristiano

Sequences

Iterated Amplification

Wiki Contributions

Comments

Discussing the application of heuristic estimators to adversarial training:

Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C.  For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.

You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough time (but which is too expensive to continuously oversee your model at inference time), as in iterated amplification. You might also care if you’ve gotten a specification by solving ELK, but want to train a model that never does something bad (rather than merely detecting bad behavior at runtime, when pausing could be catastrophic). In general I think that "we can detect bad behavior but the model does a treacherous turn anyway" is a plausible failure mode to address.

A heuristic estimator lets you asses the probability that a given model M violates C for any distribution D, i.e. . You can produce estimates even when (i) the probability is very small, (ii) you can’t efficiently draw samples from D.

So we search for a distribution D on which M is particularly likely to behave catastrophically, estimate the log of the catastrophe probability under D, and then take a gradient descent step on M to reduce that probability (assuming that the estimator is differentiable, which is true for examples we have considered and appears to follow from natural coherence properties). This approach can hope to achieve error probabilities much lower than (1 / training set size), and we can ask about how M would behave given observations that we can recognize but can’t synthesize in the lab (which would otherwise be natural triggers for takeover attempts). In theory this overcomes the big limitations for adversarial training.

If you actually had a heuristic estimator you could immediately test this application. Detecting small catastrophe probabilities is particularly straightforward. In this setting D can be a neural net adversary---you can initialize with an LM asked to produce cases where M behaves badly, and then fine-tune D to optimize the catastrophe probability.

I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.

We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situate our work.

I hope to write up a reasonable pitch sometime over the next few weeks.

In the original document we also mention a non-ELK application, namely using a heuristic estimator for adversarial training, which is significantly more straightforward. I think this is helpful for validating the intuitive story that heuristic estimators would overcome limitations of black box training, and in some sense I think that ELK and  together are the two halves of the alignment problem and so solving both is very exciting. That said, I've considered this in less detail than the ELK application. I'll try to give a bit more detail on this in the child comment.

Sorry, I meant "scope-insensitive," and really I just meant an even broader category of like "doesn't care 10x as much about getting 10x as much stuff."  I think discount rates or any other terminal desire to move fast would count (though for options like "survive in an unpleasant environment for a while" or "freeze and revive later" the required levels of kindness may still be small).

(A month seems roughly right to me as the cost of not trashing Earth's environment to the point of uninhabitability.)

I'd guess "most humans survive" vs. "most humans die" probabilities don't correspond super closely to "presence of small pseudo-kindness". Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.

Yeah, I think that:

  •  "AI doesn't care about humans at all so kills them incidentally" is not most of the reason that AIs may kill humans, and my bottom line 50% probability of AI killing us also includes the other paths (AI caring a bit but failing to coordinate to avoid killing humans, conflict during takeover leading to killing lots of humans, AI having scope-sensitive preferences for which not killing humans is a meaningful cost, preserving humans being surprisingly costly, AI having preferences about humans like spite for which human survival is a cost...).
  • To the extent that its possible to distinguish "intrinsic pseudokindness" from decision-theoretic considerations leading to pseudokindness, I think that decision-theoretic considerations are more important. (I don't have a strong view on relative importance of ECL and acausal trade, and I think these are hard to disentangle from fuzzier psychological considerations and it all tends to interact.)

Yeah, I think "no control over future, 50% you die" is like 70% as alarming as "no control over the future, 90% you die." Even if it was only 50% as concerning, all of these differences seem tiny in practice compared to other sources of variation in "do people really believe this could happen?" or other inputs into decision-making. I think it's correct to summarize as "practically as alarming."

I'm not sure what you want engagement with. I don't think the much worse outcomes are closely related to unaligned AI so I don't think they seem super relevant to my comment or Nate's post. Similarly for lots of other reasons the future could be scary or disorienting. I do explicitly flag the loss of control over the future in that same sentence. I think the 50% chance of death is probably in the right ballpark from the perspective of selfish concern about misalignment.

Note that the 50% probability of death includes the possibility of AI having preferences about humans incompatible with our survival. I think the selection pressure for things like spite is radically weaker for the kinds of AI systems produced by ML than for humans (for simple reasons---where is the upside to the AI from spite during training? seems like if you get stuff like threats it will primarily be instrumental rather than a learned instinct) but didn't really want to get into that in the post.

I would also call this one for Eliezer. I think we mostly just retrain AI systems without reusing anything. I think that's what you'd guess on Eliezer's model, and very surprising on Robin's model. The extent to which we throw things away is surprising even to a very simple common-sense observer.

I would have called "Human content is unimportant" for Robin---it seems like the existing ML systems that are driving current excitement (and are closest to being useful) lean extremely heavily on imitation of human experts and mostly don't make new knowledge themselves. So far game-playing AI has been an exception rather than the rule (and this special case was already mostly established by the time of the debate).

That said, I think it would be reasonable to postpone judgment on most of these questions since we're not yet in the end of days (Robin thinks it's still fairly far, and Eliezer thinks it's close but things will change a lot by the intelligence explosion). The main ones I'd be prepared to call unambiguously already are:

  • Short AI timelines and very general AI architectures: obvious advantage to Eliezer.
  • Importance of compute, massive capital investment, and large projects selling their output to the world: obvious advantage to Robin.

These aren't literally settled, but market odds have moved really far since the debate, and they both seem like defining features of the current world. In each case I'd say that one of the two participants was clearly super wrong and the other was basically right.

My objection is that the simplified message is wrong, not that it's too alarming. I think "misaligned AI has a 50% chance of killing everyone" is practically as alarming as "misaligned AI has a 95% chance of killing everyone," while being a much more reasonable best guess. I think being wrong is bad for a variety of reasons. It's unclear if you should ever be in the business of telling lies-told-to-children to adults, but you certainly shouldn't be doubling down on them in the position in argument.

I don't think misaligned AI drives the majority of s-risk (I'm not even sure that s-risk is higher conditioned on misaligned AI), so I'm not convinced that it's a super relevant communication consideration here. The future can be scary in plenty of ways other than misaligned AI, and it's worth discussing those as part of "how excited should we be for faster technological change."

As I said:

I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.

I think it's totally plausible for the AI to care about what happens with humans in a way that conflicts with our own preferences. I just don't believe it's because AI doesn't care at all one way or the other (such that you should make predictions based on instrumental reasoning like "the AI will kill humans because it's the easiest way to avoid future conflict" or other relatively small considerations).

To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.

I think a closer summary is:

Humans and AI systems probably want different things. From the human perspective, it would be better if the universe was determined by what the humans wanted. But we shouldn't be willing to pay huge costs, and shouldn't attempt to create a slave society where AI systems do humans' bidding forever, just to ensure that human values win out. After all, we really wouldn't want that outcome if our situations had been reversed. And indeed we are the beneficiary of similar values-turnover in the past, as our ancestors have been open (perhaps by necessity rather than choice) to values changes that they would sometimes prefer hadn't happened.

We can imagine really sterile outcomes, like replicators colonizing space with an identical pattern repeated endlessly, or AI systems that want to maximize the number of paperclips. And considering those outcomes can help undermine the cosmopolitan intuition that we should respect the AI we build. But in fact that intuition pump relies crucially on its wildly unrealistic premises, that the kind of thing brought about by AI systems will be sterile and uninteresting. If we instead treat "paperclip" as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force. I'm back to feeling like our situations could have been reversed, and we shouldn't be total assholes to the AI.

I don't think that requires anything at all about AI systems converging to cosmopolitan values in the sense you are discussing here. I do think it is much more compelling if you accept some kind of analogy between the sorts of processes shaping human values and the processes shaping AI values, but this post (and the references you cite and other discussions you've had) don't actually engage with the substance of that analogy and the kinds of issues raised in my comment are much closer to getting at the meat of the issue.

I also think the "not for free" part doesn't contradict the views of Rich Sutton. I asked him this question and he agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI.  I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is "attempt to have a slave society," not "slow down AI progress for decades"---I think he might also believe that stagnation is much worse than a handoff but haven't heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it's not as bad as the alternative.

Load More