As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day for 25 days. Or until I run out of hot takes.
Natural abstractions are patterns in the environment that are so convenient and so useful that most right-thinking agents will learn to take advantage of them. But what if humans and modern ML are too lazy to be right-thinking?
One way of framing this point is in terms of gradient starvation. The reason neural networks don't explore all possible abstractions (aside from the expense) is that once they find the first way of solving a problem, they don't really have an incentive to find a second way - it doesn't give them a higher score, so they don't. When gradient starvation is strong, it means the loss landscape has a lot of local minima that the agent can roll into, that aren't easily connected to the global minimum, and so what abstractions the network ends up using will depend strongly on the initial conditions.
Regularization and exploration can help ameliorate this problem, but often come with catastrophic forgetting - if a neural net finds a strictly better way to solve the problem it's faced with, it might forget all about the previous way. When we imagine a right-thinking agent that learns natural abstractions, we often imagine something that's intrinsically motivated to learn lots of different ways of solving a problem, and that doesn't erase its memory of interesting methods just because they're not on the Pareto frontier.
So that's what I mean by "lazy"/"not lazy", here. Neural networks, or humans, are lazy if they're parochial in solution-space, doing local search in a way that sees them get stuck in what a less-lazy optimizer might consider to be ruts.
It's important to note that laziness is not an unambiguously bad property. First, it's usually more efficient. Second, maybe we don't want our neural net to actually search through the weird and adversarial parts of parameter-space, and local search prevents it from doing so. Alex Turner et al. have recently been making arguments like this fairly forcefully. Still, we don't want maximal laziness, especially not if we want to find natural abstractions like the various meanings of "human values."
I might be attacking a strawman or a bailey here, I'm not totally sure. I've been using "natural abstraction" here as if it just means an abstraction that would be useful for a wide variety of agents to have in their toolbox. But we might also use "natural abstractions" to denote the vital abstractions, those that aren't merely nice to have, but that you literally can't complete certain tasks without using. In that second sense, neural networks are always highly incentivized to learn relevant natural abstractions, and you can easily tell when they do so by measuring their loss.
But as per yesterday, there are often multiple similarly-powerful ways to model the world, in particular when modeling humans and human values. There might be hard core vital abstractions for various human-interaction tasks, but I suspect they're abstractions like "discrete object," not anything nearly so far into the leaves of the tree as "human values." And when I see informal speculation about natural abstractions it usually strikes me as thinking about the less strict "useful for most agents" abstractions.
Ultimately, I expect laziness to cause both artificial neural nets and humans to miss out on some sizeable fraction of abstractions that most agents would find useful. What to do? There are options:
- Build an AI that isn't lazy. But laziness is useful, and anyhow maybe we don't want an AI to explore all the extrema. So build an AI that's less lazy in a controlled way. Requires research on AI architectures, and might have to sneakily borrow from the other options to specify the shape of the remaining laziness.
- Redefine "right-thinking" to involve a human-like local search. This moves you closer to shard theory or other even-more-anthropomorphic alignment schemes. This may give up some nice properties of universally-natural abstractions, but you do keep working on basically the same technology for picking out abstractions learned by an AI. Requires a really good picture of what "human-like local search" means.
- Use information about humans in the optimization process itself. This might look like Stuart's picture of concept extrapolation, or maybe it would look like a self-reflective AI that tries to direct its own learning process. Really gives up on neutrality, and instead tries to be efficient by learning the concepts humans want learned. Requires research on architectures and human-computer interaction.
The thing I usually have in mind these days is stronger than the first but weaker than the second. Roughly speaking: natural abstractions should be convergent for distributed system produced by local selection pressures. That's a stronger condition than "useful to have in toolbox" (since not all useful structures are accessible to local search), but weaker than "literally can't complete certain tasks without it".
I think there's probably a continuous spectrum of usefulness of abstractions. All the way from actively unhelpful and confusing up to extremely helpful for a realistically compute & data limited real-world agent. Like, having the right abstractions enables this limited agent to do things and learn things it otherwise couldn't do with it's limited resources. Being able to unlearn/overwrite/forget bad unhelpful abstractions and heuristics is probably a very useful ability. My guess is that this is going to become an increasingly important and discussed area of research.
Are you imagining the training process doing this, or humans doing it after training?
I was imagining humans deliberately designing and testing a training process that did this automatically. I haven't thought about how to define, much less automatically detect, unhelpful abstractions or heuristics though.