Nora Belrose

Wiki Contributions


This seems to entirely ignore the actual point that is being made in the post. The point is that "IGF" is not a stable and contentful loss function, it is a misleadingly simple shorthand for "whatever traits are increasing their own frequency at the moment." Once you see this, you notice two things:

  1. In some weak sense, we are fairly well "aligned" to the "traits" that were selected for in the ancestral environment, in particular our social instincts.
  2. All of the ways in which ML is disanalogous with evolution indicate that alignment will be dramatically easier and better for ML models. For starters, we don't randomly change the objective function for ML models throughout training. See Quintin's post for many more disanalogies.

If you want to discuss the other contents of my AI pause essay, it's probably best for you to comment over on the EA forum post, not here.


If evolution had ever stumbled upon some kind of magical genetic mutation that resulted in individuals directly caring about their IGF (and improved or at least didn't damage their general reasoning abilities and other positive traits) it would have surely reached fixation rather quickly.

CRISPR gene drives reach fixation even faster, even if they seriously harm IGF.

Oh. Perhaps it's nontrivial that humans were selected to value a lot of stuff, and (different, modern) humans still value a lot of stuff, even in today's different environment? Is that the point?

Sort of, but I think it is more specific than that. As I point out in my AI pause essay:

An anthropologist looking at humans 100,000 years ago would not have said humans are aligned to evolution, or to making as many babies as possible. They would have said we have some fairly universal tendencies, like empathy, parenting instinct, and revenge. They might have predicted these values will persist across time and cultural change, because they’re produced by ingrained biological reward systems. And they would have been right.

I take this post to be mostly negative, in that it shows that "IGF" is not a unified loss function; its content is entirely dependent on the environmental context, in ways that ML loss functions are not.

If this saturates, it would probably saturate very far above human level...

Foom is a much stronger claim than this. It's saying that there will be an incredibly fast, localized intelligence explosion involving literally one single AI system improving itself. Your scenario of an "ecosystem" of independent AI researchers working together sounds more like the "slow" takeoff of Christiano or Hanson than EY-style fast takeoff.

I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).

I think this counterfactual is literally incoherent— it does not make sense to talk about what an individual neural network would do if its "optimization power" were scaled up. It's a category error. You instead need to ask what would happen if the training procedure were scaled up, and there are always many different ways that you can scale it up— e.g. keeping data fixed while parameters increase, or scaling both in lockstep, keeping the capability of the graders fixed, or investing in more capable graders / scalable oversight techniques, etc. So I deny that there is any fact of the matter about whether current LLMs "care about the target" in your sense. I think there probably are sensible ways of cashing out what it means for a 2023 LLM to "care about" something but this is not it.

especially once it is generalised to other kinds of AI beside LLMs, which I imagine should be possible

The method actually already is highly general, and in fact isn't specific to deep learning at all. More work does need to be done to see how well it can steer neural net behavior in real world scenarios though

then one way to do this is to make sure they have the same mean

Yep, although we actually go a bit further than that and show that making the means equal is necessary, at least if you want your method to work for general convex loss functions.

If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer.

I don’t think any complete description of the LLM is going to be intuitive to a human, because it’s just too complex to fit in your head all at once. The best we can do is to come up with interpretations for selected components of the network. Just like a book or a poem, there’s not going to be a unique correct interpretation: different interpretations are useful for different purposes.

Theres also no guarantee that any of these mechanistic interpretations will be the most useful tool for what you’re trying to do (e.g. make sure the model doesn’t kill you, or whatever). The track record of mech interp for alignment is quite poor, especially compared to gradient based methods like RLHF. We should accept the Bitter Lesson: SGD is better than you at alignment.

Using probes to measure mutual information

This is definitely wrong though, because of the data processing inequality. For any concept X, there is always more mutual information about X in the raw input to the model than in any of its activations.

What probing actually measures is some kind of "usable" information, see here.

Load More