I'm a machine learning engineer in Silicon Valley with an interest in AI alignment and safety.
Having done some small-scale experiments in the state-of-the-art models GPT-4, Claude, and (PaLM 2-era) Bard, my impressions are:
My impression is that, for use cases such as using an LLM in a scaffolded agent to do things like planning, listing subtasks, idea generation, critical feedback, and so forth, it would be vitally important to avoid it generating text from the viewpoint of someone feeling various negative/unaligned emotions (or indeed, from the viewpoint of a sociopath). The shoggoth can emulate the full range of human emotions and mindests, even non-neurotypical ones, so be very sure that you get it to put the right mask on before using it to power an agent.
In practice, just prompting the LLM into emulating a suitably caring, honest, and helpful emotional state is likely to usually be an acceptable solution to this (at least unless handling large amounts of text that might, for an LLM's simulation of a human's emotions, induce a switch to a different emotion). However, to get a reliable emotional state that couldn't be accidentally 'jail-broken' by a lot of annoying or depressing text following the prompt, I'd feel a lot more confident using something more thorough, like RL or something built on using interpretability to locate the parts of the emotional model.
Note that for entertainment uses, like powering Character.AI's characters, you wouldn't want or need this — there it's desirable if they can accurately portray a full range of human emotions. I just don't think we want an powerful AGI that has a emulated temper it could lose, or that can emulate human deceit, greed, ambition, or lust.
Obviously, an AGI could still reinvent deceit, greed, or ambition, as convergent instrumental strategies. But currently we're handing it a fully functional library of all of these preinstalled inside our LLM. This seems deeply unwise. Ideally I'd like to power an agent with an LLM that intellectually understood the concept of deceit, but was incapable of accurately emulating a human being deceitful, because it was just too darned honest.
Intuitively, the best way to do this would be to build “sensors” and “effectors” to have inputs and outputs and then have some program decide what the effectors should do based on the input from the sensors.
I think this is extremely hard to impossible in Conways' Life, if the remaining space is full of ash (if it's empty, then it's basically trivial, just a matter of building a lot of large logic circuits, so basically all you need is a suitable compiler, and Life enthusiasts have some pretty good ones). The problem with this is that there is no way in Life to probe an area short of sending out an influence to probe it (e.g. fire some pattern of colliding gliders at it and see what gliders you get back). Establishing whether it contains empty space or ash is easy enough. But if it contains ash, probing this will perturb it, and generally also cause it to grow, and it's highly unpredictable how far the effect of any probe spreads or how long it lasts. Meanwhile, the active patches you're creating in the ash are randomly firing unexpected gliders and spaceships back at you, which you need to shield against or avoid being in line of fire. I think in practice it's going to be somewhere between impossible and astoundingly difficult to probe random ash well enough to identify what it is so you can figure out how do a two-sided disassembly on it, because in probing it you make it mutate and grow. So I think clearing a large area of random ash to make space is an insoluble problem in Life.
Fundamentally, Conway's Life is a hostile environment for replicators unless it's completely empty, or at least has extremely predictable contents. Like most cellar automata, it doesn't have an equivalent of "low energy physics".
The most effective way for an AI to get humans to shut it down would for it to do something extremely nasty. For example, arranging to kill thousands of humans would get it shut down for sure.
Humans are normally agentic (sadly they can also quite often be selfish, power-seeking, deceitful, bad-tempered, untrustworthy, and/or generally unaligned). Standard unsupervised LLM foundation model training teaches LLMs how to emulated humans as text-generation processes. This will inevitably include modelling many aspects of human psychology, including the agentic ones, and the unsavory ones. So LLMs have trained-in agentic behavior before any RL is applied, or even if you use entirely non-RL means to attempt to make them helpful/honest/harmless (e.g. how Google did this to LaMDA). They have been trained on a great many examples of deceit, power-seeking, and every other kind of nasty human behavior, so RL is not the primary source of the problem.
The alignment problem is about producing something that we are significantly more certain is aligned than a typical randomly-selected human. Handing a randomly-selected human absolute power over all of society is unlikely to end well. What we need to train is a selfless altruist who (platonically or parentally) loves all humanity. For lack of better terminology: we need to create a saint or an angel.
This is very interesting: thanks for plotting it.
However, there is something that's likely to happen that might perturb this extrapolation. Companies building large foundation models are likely soon going to start building multimodal models (indeed, GPT-4 is already multimodal, since it understands images as well as text). This will happen for at least three inter-related reasons:
The question then is, does a thousand tokens-worth of text, video, and image data teach the model the same net amount? It seems plausible that video or image data might require more input to learn the same amount (depending on details of compression and tokenization), in which case training compute requirements might increase, which could throw the trend lines off. Even if not, the set of skills the model is learning will be larger, and while some things it's learning overlap between these, others don't, which could also alter the trend lines.
Existing large tech companies are using approaches like this, training or fine-tuning small models on data generated by large ones.
For example, it's helpful for the cold start problem, where you don't yet have user input to train/fine-tune your small model on because the product the model is intended for hasn't been launched yet: have a large model create some simulated user input, train the small model on that, launch a beta test, and then retrain your small model with real user input as soon as you have some.
I've been thinking for a while that one could do syllabus learning for LLMs. It's fairly easy to classify text by reading age. So start training the LLM on only text with a low reading age, and then increase the ceiling on reading age until it's training on the full distribution of text. (https://arxiv.org/pdf/2108.02170.pdf experimented with curriculum learning in early LLMs, with little effect, but oddly didn't test reading age.)
To avoid distorting the final training distribution by much, you would need to be able to raise the reading age limit fairly fast, so by the time it's reached maximum you're only used up say ten percent of the text with low reading ages, so then in the final training distribution those're only say ten percent underrepresented. So the LLM is still capable of generating children's stories if needed (just slightly less likely to do so randomly).
The hope is that this would improve quality faster early in the training run, to sooner get the LLM to a level where it can extract more benefit from even the more difficult texts, so hopefully reach a slightly higher final quality from the same amount of training data and compute. Otherwise for those really difficult texts that happen to be used early on in the training run, the LLM presumably gets less value from them than if they'd been later in the training. I'd expect any resulting improvement to be fairly small, but then this isn't very hard to do.
A more challenging approach would be to do the early training on low-reading-age material in a smaller LLM, potentially saving compute, and then do something like add more layers near the middle, or distill the behavior of the small LLM into a larger one, before continuing the training. Here the aim would be to also save some compute during the early parts of the training run. Potential issues would be if the distillation process or loss of quality from adding new randomly-initialized layers ended up costing more compute/quality than we'd saved/gained.
[In general, the Bitter Lesson suggests that sadly the time and engineering effort spent on these sorts of small tweaks might be better spent on just scaling up more.]
I'd really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening.
What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you're transforming the problem from "How do we know the machine isn't lying to us?" to "How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?" It also explicitly requires the machine to build a model of "what humans want", and then the complexity level and latent knowledge content required is fairly similar between "figure out what the humans want and then do that" and "figure out what the humans want and then show them a video of what doing that would look like".
Maybe we should just figure out some way to do surprise inspections on the vault? :-)
If we can solve enough of the alignment problem, the rest gets solved for us.
If we can get a half-assed approximate solution to the alignment problem, sufficient to semi-align a STEM-capable AGI value learner of about smart-human level well enough to not kill everyone, then it will be strongly motivated to solve the rest of the alignment problem for us, just as the 'sharp left turn' is happening, especially if it's also going Foom. So with value learning, there is is a region of convergence around alignment.
Or to reuse one of Eliezer's metaphors, then if we can point the rocket on approximately the right trajectory, it will automatically lock on and course-correct from there.
Attempting to summarize the above, the author's projection (which looks very reasonable to me) is that for a reasonable interpretation of the phrase 'Transformative Artificial Intelligence' (TAI), i.e. an AI that could have a major transformative effect on human society, we should expect it by around 2030. So we should expect accelerating economic, social, and political upheavals in around that time-frame.
For the phrase Artificial General Intelligence (AGI), things are a little less clear, depending on what exactly you mean by AGI: the author's projection is that we should expect AI that matches or exceeds the usual range of human ability in many respects, and is clearly superhuman in many respects, but where there do still exist certain mental tasks that some humans are superior to AI at (and quite possibly also many physical tasks humans are still superior at, thanks to Moravec's Paradox).
The consequences of this analysis for alignment timelines are less clear, but my impression is that if any significant fraction of the AI capabilities the author is forecasting for 2030 are badly misaligned, we could easily be in deep trouble. Similarly, if any significant fraction of this set of capabilities were being actively misused under human direction, that could also cause serious trouble (where my definition of 'misuse' includes things like natural progressions from what the advertising industry or political consultants already do, but carried out much more effectively using much greater persuasion abilities enabled by AI).