I'm a machine learning engineer in Silicon Valley with an interest in AI alignment and safety.
The most effective way for an AI to get humans to shut it down would for it to do something extremely nasty. For example, arranging to kill thousands of humans would get it shut down for sure.
Humans are normally agentic (sadly they can also quite often be selfish, power-seeking, deceitful, bad-tempered, untrustworthy, and/or generally unaligned). Standard unsupervised LLM foundation model training teaches LLMs how to emulated humans as text-generation processes. This will inevitably include modelling many aspects of human psychology, including the agentic ones, and the unsavory ones. So LLMs have trained-in agentic behavior before any RL is applied, or even if you use entirely non-RL means to attempt to make them helpful/honest/harmless (e.g. how Google did this to LaMDA). They have been trained on a great many examples of deceit, power-seeking, and every other kind of nasty human behavior, so RL is not the primary source of the problem.
The alignment problem is about producing something that we are significantly more certain is aligned than a typical randomly-selected human. Handing a randomly-selected human absolute power over all of society is unlikely to end well. What we need to train is a selfless altruist who (platonically or parentally) loves all humanity. For lack of better terminology: we need to create a saint or an angel.
This is very interesting: thanks for plotting it.
However, there is something that's likely to happen that might perturb this extrapolation. Companies building large foundation models are likely soon going to start building multimodal models (indeed, GPT-4 is already multimodal, since it understands images as well as text). This will happen for at least three inter-related reasons:
The question then is, does a thousand tokens-worth of text, video, and image data teach the model the same net amount? It seems plausible that video or image data might require more input to learn the same amount (depending on details of compression and tokenization), in which case training compute requirements might increase, which could throw the trend lines off. Even if not, the set of skills the model is learning will be larger, and while some things it's learning overlap between these, others don't, which could also alter the trend lines.
Existing large tech companies are using approaches like this, training or fine-tuning small models on data generated by large ones.
For example, it's helpful for the cold start problem, where you don't yet have user input to train/fine-tune your small model on because the product the model is intended for hasn't been launched yet: have a large model create some simulated user input, train the small model on that, launch a beta test, and then retrain your small model with real user input as soon as you have some.
I've been thinking for a while that one could do syllabus learning for LLMs. It's fairly easy to classify text by reading age. So start training the LLM on only text with a low reading age, and then increase the ceiling on reading age until it's training on the full distribution of text. (https://arxiv.org/pdf/2108.02170.pdf experimented with curriculum learning in early LLMs, with little effect, but oddly didn't test reading age.)
To avoid distorting the final training distribution by much, you would need to be able to raise the reading age limit fairly fast, so by the time it's reached maximum you're only used up say ten percent of the text with low reading ages, so then in the final training distribution those're only say ten percent underrepresented. So the LLM is still capable of generating children's stories if needed (just slightly less likely to do so randomly).
The hope is that this would improve quality faster early in the training run, to sooner get the LLM to a level where it can extract more benefit from even the more difficult texts, so hopefully reach a slightly higher final quality from the same amount of training data and compute. Otherwise for those really difficult texts that happen to be used early on in the training run, the LLM presumably gets less value from them than if they'd been later in the training. I'd expect any resulting improvement to be fairly small, but then this isn't very hard to do.
A more challenging approach would be to do the early training on low-reading-age material in a smaller LLM, potentially saving compute, and then do something like add more layers near the middle, or distill the behavior of the small LLM into a larger one, before continuing the training. Here the aim would be to also save some compute during the early parts of the training run. Potential issues would be if the distillation process or loss of quality from adding new randomly-initialized layers ended up costing more compute/quality than we'd saved/gained.
[In general, the Bitter Lesson suggests that sadly the time and engineering effort spent on these sorts of small tweaks might be better spent on just scaling up more.]
I'd really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening.
What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you're transforming the problem from "How do we know the machine isn't lying to us?" to "How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?" It also explicitly requires the machine to build a model of "what humans want", and then the complexity level and latent knowledge content required is fairly similar between "figure out what the humans want and then do that" and "figure out what the humans want and then show them a video of what doing that would look like".
Maybe we should just figure out some way to do surprise inspections on the vault? :-)
If we can solve enough of the alignment problem, the rest gets solved for us.
If we can get a half-assed approximate solution to the alignment problem, sufficient to semi-align a STEM-capable AGI value learner of about smart-human level well enough to not kill everyone, then it will be strongly motivated to solve the rest of the alignment problem for us, just as the 'sharp left turn' is happening, especially if it's also going Foom. So with value learning, there is is a region of convergence around alignment.
Or to reuse one of Eliezer's metaphors, then if we can point the rocket on approximately the right trajectory, it will automatically lock on and course-correct from there.
Subproblem 1.2/2.1: Traps
Allowing traps in the environment creates two different problems:
- (Subproblem 1.2) Bayes-optimality becomes intractable in a very strong sense (even for a small number of deterministic MDP hypotheses with small number of states).
- (Subproblem 2.1) It's not clear how to to talk about learnability and learning rates.
It makes some sense to consider these problems together, but different direction emphasize different sides.
Evolved organisms (such as humans) are good at dealing with traps: getting eaten is always a possibility. At the simplest level they do this by having multiple members of the species die, and using an evolutionary learning mechanism to evolve detectors for potential trap situations and some trap-avoiding behavior for this to trigger. An example of this might be the human instinct of vertigo near cliff edges — it's hard not to step back. The cost of this is that some number of individuals die from the traps before the species evolves a way of avoiding the trap.
As a sapient species using the scientific method, we have more sophisticated ways to detect traps. Often we may have a well-supported model of the world that lets us predict and avoid a trap ("nuclear war could well wipe out the human race, let's not do that"). Or we may have an unproven theory that predicts a possible trap, but that also predicts some less dangerous phenomenon. So rather than treating the universe like a multi-armed bandit and jumping into the potential trap to find out what happens and test our theory, we perform the lowest risk/cost experiment that will get us a good Bayesian update on the support for our unproven theory, hopefully at no cost to life or limb. If that raises the theory's support, then we become more cautious about the predicted trap, or if it lowers it, we become less. Repeat until your Bayesian updates converge on either 100% or 0%.
An evolved primate heuristic for this is "if nervous of an unidentified object, poke it with a stick and see what happens". This of course works better on, say, live/dead snakes than on some other perils that modern technology has exposed us to.
The basic trick here is to have a world model sophisticated enough that it can predict traps in advance, and we can find hopefully non-fatal ways of testing them that don't require us to jump into the trap. This requires that the universe has some regularities strong enough to admit models like this, as ours does. Likely most universes that didn't would be uninhabitable and life wouldn't evolve in them.
My exposure to the AI and safety ethics community's thinking has primarily been via LW/EA and papers, so it's entirely possible that I have a biased sample.
I had another thought on this. Existing deontological rules are intended for humans. Humans are optimizing agents, and they're all of about the same capacity (members of a species that seems, judging by the history of stone tool development, to have been sapient for maybe a quarter million years, so possibly only just over the threshold for sapience). So there is another way in which deontological rules reduce cognitive load: generally we're thinking about our own benefit and that of close family and friends. It's 'not our responsibility' to benefit everyone in the society — all of them are already doing that, looking out for themselves. So that might well explain why standard deontological rules concentrate on avoiding harm to others, rather than doing good to others.
AGI, on the other hand, firstly may well be smarter than all the humans, possibly far smarter, so may have the capacity to do for humans things they can't do for themselves, possibly even for a great many humans. Secondly, its ethical role is not to help itself and its friends, but to help humans: all humans. It ought to be acting selflessly. So its duty to humans isn't just to avoid harming them and let them go about their business, but to actively help them. So I think deontological rules for an AI, if you tried to construct them, should be quite different in this respect than deontological rules for a human, and should probably focus just as much on helping as on not harming.
I think this is extremely hard to impossible in Conways' Life, if the remaining space is full of ash (if it's empty, then it's basically trivial, just a matter of building a lot of large logic circuits, so basically all you need is a suitable compiler, and Life enthusiasts have some pretty good ones). The problem with this is that there is no way in Life to probe an area short of sending out an influence to probe it (e.g. fire some pattern of colliding gliders at it and see what gliders you get back). Establishing whether it contains empty space or ash is easy enough. But if it contains ash, probing this will perturb it, and generally also cause it to grow, and it's highly unpredictable how far the effect of any probe spreads or how long it lasts. Meanwhile, the active patches you're creating in the ash are randomly firing unexpected gliders and spaceships back at you, which you need to shield against or avoid being in line of fire. I think in practice it's going to be somewhere between impossible and astoundingly difficult to probe random ash well enough to identify what it is so you can figure out how do a two-sided disassembly on it, because in probing it you make it mutate and grow. So I think clearing a large area of random ash to make space is an insoluble problem in Life.
Fundamentally, Conway's Life is a hostile environment for replicators unless it's completely empty, or at least has extremely predictable contents. Like most cellar automata, doesn't have an equivalent of "low energy physics".