Mark Xu

I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about

Sequences

Intermittent Distllations
Training Regime

Wiki Contributions

Comments

I think competitiveness matters a lot even if there's only moderate amounts of competitive pressure. The gaps in efficiency I'm imagining are less "10x worse" and more like "I only had support vector machines and you had SGD"

humans, despite being fully general, have vastly varying ability to do various tasks, e.g. they're much better at climbing mountains than playing GO it seems. Humans also routinely construct entirely technology bases to enable them to do tasks that they cannot do themselves. This is, in some sense, a core human economic activity: the construction of artifacts that can do tasks better/faster/more efficiently than humans can do themselves. It seems like by default, you should expect a similar dynamic with "fully general" AIs. That is, AIs trained to do semiconductor manufacturing will create their own technology bases, specialized predictive artifacts, etc. and not just "think really hard" and "optimize within their own head." This also suggests a recursive form of the alignment problem, where an AI that wants to optimize human values is in a similar situation to us, where it's easy to construct powerful artifacts with SGD that optimize measurable rewards, but it doesn't know how to do that for human values/things that can't be measured.

Even if you're selecting reasonably hard for "ability to generalize" by default, the range of tasks you're selecting for aren't all going to be "equally difficult", and you're going to get an AI that is much better at some tasks than other tasks, has heuristics that enable it to accurately predict key intermediates across many tasks, heuristics that enable it to rapidly determine quick portions of the action space are even feasible, etc. Asking that your AI can also generalize to "optimize human values" aswell as the best avaliable combination of skills that it has otherwise seems like a huge ask. Humans, despite being fully general, find it much harder to optimize for some things than others, e.g. constructing large cubes of iron versus status seeking, despite being able to in theory optimize for constructing large cubes of iron.

Not literally the best, but retargetable algorithms are on the far end of the spectrum of "fully specialized" to "fully general", and I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than "fully general" algorithms, so there's decently strong pressure to be towards the "specialized" side.

I also think that heuristics are going to be closer to multiplicative speed ups than additive, so it's going to be closer to "general algorithms just can't compete" than "it's just a little worse". E.g. random search is terrible compared to anything using exploiting non-trivial structure (random sorting vs quicksort is, I think a representative example, where you can go from exp -> pseudolinear if you are specialized to your domain).

Mark Xu4moΩ11167

One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can't be generally-retargetable. E.g. if you consider something like stockfish, it's a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to "maximize the max number of pawns you ever have" you had, you would not be able to use [specialized for telling whether a move is likely going to win the game] heuristics to speed up your search for moves. A more extreme example is the entire endgame table is useless for you, and you have to recompute the entire thing probably.

Something like [the strategy stealing assumption](https://ai-alignment.com/the-strategy-stealing-assumption-a26b8b1ed334) is needed to even obtain the existence of a set of heuristics just as good for speeding up search for moves that "maximize the max number of pawns you ever have" compared to [telling whether a move will win the game]. Actually finding that set of heuristics is probably going to require an entirely parallel learning process.

This also implies that even if your AI has the concept of "human values" in its ontology, you still have to do a bunch of work to get an AI that can actuallly estimate the long-run consequences of any action on "human values", or else it won't be competitive with AIs that have more specialized optimization algorithms.

Flagging that I don't think your description of what ELK is trying to do is that accurate, e.g. we explicitly don't think that you can rely on using ELK to ask your AI if it's being deceptive, because it might just not know. In general, we're currently quite comfortable with not understanding a lot of what our AI is "thinking", as long as we can get answers to a particular set of "narrow" questions we think is sufficient to determine how good the consequences of an action are. More in “Narrow” elicitation and why it might be sufficient.

Separately, I think that ELK isn't intended to address the problem you refer to as a "sharp-left turn" as I understand it. Vaguely, ELK is intended to be an ingredient in an outer-alignment solution, while it seems like the problem you describe falls roughly into the "inner alignment" camp. More specifically, but still at a high-level of gloss, the way I currently see things is:

  • If you want to train a powerful AI, currently the set of tasks you can train your AI on will, by default, result in your AI murdering you.
  • Because we currently cannot teach our AIs to be powerful by doing anything except rewarding them for doing things that straightforwardly imply that they should disempower humans, you don't need a "sharp left turn" in order for humanity to end up disempowered.
  • Given this, it seems like there's still a substantial part of the difficulty of alignment that remains to be solved even if knew how to cope with the "sharp left turn." That is, even if capabilities were continuous in SGD steps, training powerful AIs would still result in catastrophe.
  • ELK is intended to be an ingredient in tackling this difficulty, which has been traditionally referred to as "outer alignment."

Even more separately, it currently seems to me like it's very hard to work on the problem you describe while treating other components [like your loss function] like a black box, because my guess is that "outer alignment" solutions need to do non-trivial amounts of "reaching inside the model's head" to be plausible, and a lot of how to ensure capabilities and alignment generalize together is going to depend on details about how would have prevented it from murdering you in [capabilities continuous with SGD] world.

ELK for learned optimizers has some more details.

If powerful AIs are deployed in worlds mostly shaped by slightly less powerful AIs, you basically need competitiveness to be able to take any "pivotal action" because all the free energy will have been eaten by less powerful AIs.

The humans presumably have access to the documents being summarized.

Here's a conversation that I think is vaguely analogous:

Alice: Suppose we had a one-way function, then we could make passwords better by...

Bob: What do you want your system to do?

Alice: Well, I want passwords to be more robust to...

Bob: Don't tell me about the mechanics of the system. Tell me what you want the system to do.

Alice: I want people to be able to authenticate their identity more securely?

Bob: But what will they do with this authentication? Will they do good things? Will they do bad things?

Alice: IDK I just think the world is likely to be generically a better place if we can better autheticate users.

Bob: Oh OK, we're just going to create this user authetication technology and hope people use it for good?

Alice: Yes? And that seems totally reasonable?

It seems to me like you don't actually have to have a specific story about what you want your AI to do in order for alignment work to be helpful. People in general do not want to die, so probably generic work on being able to more precisely specify what you want out of your AIs, e.g. for them not to be mesa-optimizers, is likely to be helpful.

This is related to complaints I have with [pivotal-act based] framings, but probably that's a longer post.

Isn't there an equilibrium where people assume other people's militaries are as strong as they can demonstrate, and people just fully disclose their military strength?

Load More