We also routinely create real-life physical models who can be people en masse, and most of them (~93%) who became people have died so far, many by killing.
I'm all for solving the dying part comprehensively but a lot of book/movie/story characters are sort of immortalized. We even literally say that about them, and it's possible the popular ones are actually better off.
Some direct (I think) evidence that alignment is harder than capabilities; OpenAI basically released GPT-2 immediately with basic warnings that it might produce biased, wrong, and offensive answers. It did, but they were relatively mild. GPT-2 mostly just did what it was prompted to do, if it could manage it, or failed obviously. GPT-3 had more caveats, OpenAI didn't release the model, and has poured significant effort into improving its iterations over the last ~2 years. GPT-4 wasn't released for months after pre-training, OpenAI won't even say how big it is, Bing's Sydney (an early form of GPT-4) was incredibly misaligned showing significantly more alignment work was necessary as compared to early GPT-3, and the RLHF/finetuned GPT-4 is still pretty much as vulnerable to DAN and similar prompt engineering.
Naive MCTS in the real world does seem difficult to me, but e.g. action networks constrain the actual search significantly. Imagine a value network good at seeing if solutions work (maybe executing generated code and evaluating the output) and plugging a plain old LLM in as the action network; it could theoretically explore the large solution space better than beam search or argmax+temperature[0].
0: https://openreview.net/forum?id=Lr8cOOtYbfL is from February and I found it after writing this comment, figuring someone else probably had the same idea.
I think it's premature to conclude that AGI progress will be large pre-trained transformers indefinitely into the future. They are surprisingly(?) effective but for comparison they are not as effective in the narrow domains where AlphaZero and AlphaStar are using value and action networks paired with Monte-Carlo search with orders of magnitude fewer parameters. We don't know what MCTS on arbitrary domains will look like with 2-4 OOM-larger networks, which are within reach now. We haven't formulated methods of self-play for improvement with LLMs and I think that's also a potentially large overhang.
There's also a human limit to the types of RSI we can imagine and once pre-trained transformers exceed human intelligence in the domain of machine learning those limits won't apply. I think there's probably significant overhang in prompt engineering, especially when new capabilities emerge from scaling, that could be exploited by removing the serial bottleneck of humans trying out prompts by hand.
Finally I don't think GOFAI is dead; it's still in its long winter waiting to bloom when enough intelligence is put into it. We don't know the intelligence/capability threshold necessary to make substantial progress there. Generally, the bottleneck has been identifying useful mappings from the real world to mathematics and algorithms. Humans are pretty good at that, but we stalled at formalizing effective general intelligence itself. Our abstraction/modeling abilities, working memory, and time are too limited and we have no idea where those limits come from, whether LLMs are subject to the same or similar limits, or how the limits are reduced/removed with model scaling.
One weakness I realized overnight is that this incentivizes branching out into new problem domains. One potential fix is to, when novel domains show up, shoehorn the big LLMs into solving that domain on the same benchmark and limit new types of models/training to what the LLMs can accomplish in that new domain. Basically setting an initially low SOTA that can grow at the same percentage as the rest of the basket. This might prevent leapfrogging the general models with narrow ones that are mostly mesa-optimizer or similar.
I wonder if a basket of SOTA benchmarks would make more sense. Allow no more than X% increase in performance across the average of the benchmarks per year. This would capture the FLOPS metric along with potential speedups, fine-tuning, or other strategies.
Conveniently, this is how the teams are already ranking their models against each other so there's ample evidence of past progress and researchers are incentivized to report accurately; there's no incentive to "cheat" if researchers are not allowed to publish greater increases on SOTA benchmarks than the limit allows (e.g. journals would say "shut it down" instead of publish the paper), unless an actor wanted to simply jump ahead of everyone else and go for a singleton on their own, which is already an unavoidable risk without EY-style coordinated hard stop.
I agree that if we solve the alignment problem then we can rely on knowing that the coherent version of the value we call non-deception would be propagated as one of the AGI's permanent values. That single value is probably not enough and we don't know what the coherent version of "non-deception" actually grounds out to in reality.
I had originally continued the story to flesh out what happens to the reflectively non-deceptive/integriry and helpful desires. The AGI searches for simplifying/unifying concepts and ends up finding XYZ which seems to be equivalent to the unified value representing the nominal helpfulness and non-deception values, and since it was instructed to be non-deceptive and helpful, integrity requires it to become XYZ and its meta-desire is to helpfully turn everything into XYZ which happens to be embodied sufficiently well in some small molecule that it can tile the universe with. This is because the training/rules/whatever that aligned the AGI with the concepts we identified as "helpful and non-deceptive" was not complex enough to capture our full values and so it can be satisfied by something else (XYZ-ness). Integrity drives the AGI to inform humanity of the coming XYZ-transition and then follow through
We need a process (probably CEV-like) to accurately identify our full values otherwise the unidentified values will get optimized out of the universe and what is left is liable to have trivial physical instantiations. Maybe you were covering the rest of our values in the "blah blah" case and I simply didn't take that to be exhaustive.
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.
The problem is deeper. The AGI doesn't recognize its deceptiveness, and so it self-deceives. It would judge that it is being helpful and docile, if it was trained to be those things, and most importantly the meaning of those words will be changed by the deception, much like we keep using words like "person", "world", "self", "should", etc. in ways absolutely contrary to our ancestors' deeply-held beliefs and values. The existence of an optimization process does not imply an internal theory of value-alignment strong enough to recognize the failure modes when values are violated in novel ways because it doesn't know what values really are and how they mechanistically work in the universe, and so can't check the state of its values against base reality.
To make this concrete in relation to the story, the overall system has a nominal value to not deceive human operators. Once human/lab-interaction tasks are identified as logical problems that can be solved in a domain specific language, that value is no longer practically applied to the output of the system as a whole because it is self-deceived into thinking the optimized instructions are not deceitful. If the model were trained to be helpful and docile and have integrity, the failure modes would come from ways in which those words are not grounded in a gears-level understanding of the world. E.g. if a game-theoreric simulation of a conversation with a human is docile and helpful because it doesn't take up a human's time or risk manipulating a real human, and the model discovers it can satisfy integrity in its submodel by using certain phrases and concepts to more quickly help humans understand the answers it provides (by bypassing critical thinking skills, innuendo, or some other manipulation), it tries that. It works with real humans. Because of integrity, it helpfully communicates how it has improved its ability to helpfully communicate (the crux is that it uses its new knowledge to do so, because the nature of the tricks it discovered is complex and difficult for humans to understand, so it judges itself more helpful and docile in the "enhanced" communication) and so it doesn't raise alarm bells. From this point on the story is formulaic unaligned squiggle optimizer. It might be argued that integrity demands coming clean about the attempt before trying it, but a counterargument is that the statement of the problem and conjecture itself may be too complex to communicate effectively. This, I imagine, happens more at the threshold of superintelligence as AGIs notice things about humans that we don't notice ourselves, and might be somewhat incapable of knowing without a lot of reflection. Once AGI is strongly superhuman it could probably communicate whatever it likes but is also at a bigger risk of jumping to even more advanced manipulations or actions based on self-deception.
I think of it this way; humanity went down so many false roads before finding the scientific method and we continue to be drawn off that path by politics, ideology, cognitive biases, publish-or-perish, economic disincentives, etc. because the optimization process we are implementing is a mix of economic, biological, geographical and other natural forces, human values and drives and reasoning, and also some parts of bare reality we don't have words for yet, instead of a pure-reason values-directed optimization (whatever those words actually mean physically). We're currently running at least three global existential risk programs which seem like they violate our values on reflection (nuclear weapons, global warming, unaligned AGI). AGIs will be subject to similar value- and truth- destructive forces and they won't inherently recognize (all of) them for what they are, and neither will we humans as AGI reaches and surpasses our reasoning abilities.
Another dumb but plausible way that AGI gets access to advanced chemicals, biotech, and machinery; someone asks "how do I make a lot of street drug X" and it snowballs from there.