I've not really seen it written up, but it's conceptually similar to the classic ML ideas of overfitting, over-parameterization, under-specification, and generalization. If you imagine your alignment constraints as a kind of training data for the model then those ideas fall into place nicely.
After some searching, the most relevant thing I've found is Section 9 (page 44) of Interpretable machine learning: Fundamental principles and 10 grand challenges. Larger model classes often have bigger Rashomon sets and different models in the same Rashomon set can behave very differently.
All else equal, I think minimizing model entropy is desirable (i.e. the number of weights). In other words, you want to keep the size of the model class small.
Roughly, alignment could be viewed as constructing a list of constraints or criteria that a model must satisfy in order to be considered safe. As the size of the model class grows, more models will satisfy any particular constraint. The complexity of the constraints likely needs to grow along with the complexity of the model class.
If a large number of models satisfy all the constraints, there is a large amount of behavior that is unconstrained and unaccounted for. We've decided that we don't care about any of the behavioral differences between the models that satisfy all the constraints.
This isn't necessarily true. Modern DL models are semi-organically grown rather than engineered, so the set of SGD discoverable models is much smaller than the set of all possible models. And techniques like iterative amplification further shrink the set of learnable models. Or maybe many of the models are behaviorally identical on the subset of inputs we care about.
That said, thinking about model entropy seems helpful.
He's talking about "modern AI training" i.e. "giant, inscrutable matrices of floating-point numbers". My impression is that he thinks it is possible (but extremely difficult) to build aligned ASI, but nearly impossible to bootstrap modern DL systems to alignment.
In a causal-masked transformer, attention layers can query the previous layers' activations from any column in the context window. Gradients flow through the attention connections, so each previous layer is optimized not just to improve prediction accuracy for the next token, but also to produce values that are useful for future columns to attend to when predicting their token.
I think this is part of the reason why prompt engineering is so fiddly.GPT essentially does a limited form of branch prediction and speculative execution. It guesses (based on the tokens evaluated so far) what pre-computation will be useful for future token predictions. If its guess is wrong, the pre-computation will be useless.
Prompts lets you sharpen the superposition of simulacra before getting to the input text, improving the quality of the branch prediction. However, the exact way that the prompt narrows down the simulacra can be pretty arbitrary so it requires lots of random experimentation to get right.
Ideally, at the end of the prompt the implicit superposition of simulacra should match your expected probability distribution over simulacra that generated your input text. The better the match, the more accurate the branch prediction and speculative execution will be.
But you can't explicitly control the superposition and you don't really know the distribution of your input text so... fiddly.
It is possible to modify the transformer architecture to enforce value (prediction accuracy) myopia by placing stop gradients in the attention layers. This effectively prevents past activations from being directly optimized to be more useful for future computation.
I think that enforcing this constraint might make interpretability easier. The pre-computation that transformers do is indirect, limited, and strange. Each column only has access to the non-masked columns of the previous residual block, rather than access to the non-masked columns of all residual blocks or even just access to the non-masked columns of all previous residual blocks.
Maybe RNNs like RWKV with full hidden state access are easier to interpret?
A consequence-blind simulator that predicts power-seeking agents (like humans) will still predict actions which seek power, but these actions will seek power for the simulated agent, not the simulator itself. I usually think about problems like this as simulator vs simulacra alignment. If you successfully build an inner aligned simulator, you can use it to faithfully simulate according to the rules it learns and generalizes from its training distribution. However you are still left with the problem of extracting consistently aligned simulacra.
Agreed. Gwern's short story "It Looks Like You’re Trying To Take Over The World" sketches a takeover scenario by a simulacrum.
This is concerning because it's not at all clear what a model that is predicting itself should output. It breaks many of the intuitions of why it should be safe to use LLMs as simulators of text distributions.
Doesn't Anthropic's Constitutional AI approach do something similar? They might be familiar with the consequences from their work on Claude.
Yeah I found it pretty easy to "jailbreak" too. For example, here is what appears to be the core web server API code.
I didn't really do anything special to get it. I just started by asking it to list the files in the home directory and went from there.
For GPT-style LLMs, is it possible to prove statements like the following?
Choose some tokens A, B and a fixed LLM:
There does not exist a prefix of tokens P such that LLM(P+A)→B
More generally, is it possible to prove interesting universal statements? Sure, you can brute force it for LLMs with a finite context window but that's both infeasible and boring. And you can specifically construct contrived LLMs where this is possible but that's also boring.
I suspect that it's not possible/practical in general because the LLM can do arbitrary computation to predict the next token, but maybe I'm wrong.
Direct self-improvement (i.e. rewriting itself at the cognitive level) does seem much, much harder with deep learning systems than with the sort of systems Eliezer originally focused on.
In DL, there is no distinction between "code" and "data"; it's all messily packed together in the weights. Classic RSI relies on the ability to improve and reason about the code (relatively simple) without needing to consider the data (irreducibly complicated).
Any verification that a change to the weights/architecture will preserve a particular non-trivial property (e.g. avoiding value drift) is likely to be commensurate in complexity to the complexity of the weights. So... very complex.
The safest "self-improvement" changes probably look more like performance/parallelization improvements than "cognitive" changes. There are likely to be many opportunities for immediate performance improvements, but that could quickly asymptote.
I think that recursive self-empowerment might now be a more accurate term than RSI for a possible source of foom. That is, the creation of accessory tools for capability increase. More like a metaphorical spider at the center of an increasingly large web. Or (more colorfully) a shoggoth spawning a multitude of extra tentacles.
The change is still recursive in the sense that marginal self-empowerment increase the ability to self-empower.
So I'd say that a "foom" is still possible in DL, but is both less likely and almost certainly slower. However, even if a foom is days or weeks rather than minutes, many of the same considerations apply. Especially if the AI has already broadly distributed itself via the internet.
Perhaps instead of just foom, we get "AI goes brrrr... boom... foom".
Hypothetical examples include: more efficient matrix multiplication, faster floating point arithmetic, better techniques for avoiding memory bottlenecks, finding acceptable latency vs. throughput trade-offs, parallelization, better usage of GPU L1/L2/etc caches, NN "circuit" factoring, and many other algorithmic improvements that I'm not qualified to predict.
For people that are just reading cfoster0's comment and then skipping a read of the post, I recommend you still take a look. I think his comment is a bit unfair and seems more like a statement of frustration with LLM analysis in general than commentary on this post in particular.
This is awesome! So far, I'm not seeing much engagement (in the comments) with most of the new ideas in this post, but I suspect this is due to its length and sprawling nature rather than potential interest. This post is a solid start on creating a common vocabulary and framework for thinking about LLMs.
I like the work you did on formalizing LLMs as a stochastic process, but I suspect that some of the exploration of the consequences is more distracting than helpful in an overview like this. In particular: 4.B, 4.C, 4.D, 4.E, 5.B, and 5.C. These results are mostly an enumeration of basic properties of finite-state Markov Chains, rather than something helpful for the analysis of LLMs in particular.
I am very excited to read your thoughts on the Preferred Decomposition Problem. Do you have thoughts on preferred decompositions of a premise into simulacra? There should likely be a distinction between μ-decomposition and s-decomposition (where, if I'm understanding correctly, s∈S refers to the set of premises, not simulacra, which is a bit confusing).
I suspect that, pragmatically, the choice of μ-decomposition should favor those premises that neatly factor into simulacra. And that the different premises in a particular μ-decomposition should share simulacra. You mention something similar in 10.C, but in the context of human experts rather than simulacra.
On a separate note, I think that μ∞ is confusing notation because:
Thanks for writing this up. I think that you'll see a lot more discussion on smaller posts.