My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha in Telegram).
I'm an effective altruist, I worry about the future of humanity and want the universe not to lose most of its value.
I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).
It seems that I have good intuitions about the AI alignment problem; some full-time alignment researchers told me that they were able to improve their understanding of the problem after talking to me.
I'm currently doing EA & AI Alignment outreach (e.g., I'm organising a translation of the 80,000 Hours' Key Ideas series and partnering with Vert Dider for a translation and dubbing of Rob Miles' videos) and considering switching to direct alignment research.
In the past, I've launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies, which is 63k books) and founded audd.io, which allowed me to donate >$50k to MIRI.
It does not require superintelligence to share representations between different neural networks
I don’t think you can train one transformer on a dataset that doesn’t contain any mentions of the fact X but mentions fact Y, then train the second transformer on a dataset that contains Y but not X, and then easily share the knowledge of X and Y between them
If you were more like the person you wish to be, and you were smarter, do you think you’d still want our descendants not to optimise when needed to leave alone beings who’d prefer to be left alone? If you would still think that, why is it not CEV?
A guy from Conjecture told me about this proposal in the lines of “let’s create a human-level AI system that’s built kind of like humans and is safe to run at a human speed”, and it seemed like a surprisingly bad proposal, so I looked up this post and it still looks surprisingly bad:
Even if you succeed at this, how exactly do you plan to use it? Running one single human at a human speed seems like the kind of thing one can get by simply, you know, hiring someone; running a thousand of these things at 1000x normal speed means you’re running some completely different AI system that’s bound to have a lot of internal optimisation pressures leading to sharp left turn dynamics and all of that, and more importantly, you need to somehow make the whole system aligned, and my current understanding (from talking to that guy from Conjecture) is you don’t have any ideas for how to do that.
If it is a proposal of “how we want to make relatively safe capable systems”, then cool, I just want someone to be solving the alignment problem as in “safely preventing future unaligned AIs from appearing and killing everyone”.
The capabilities of one human-level intelligence running at 1x human speed are not enough to solve anything alignment-complete (or you’d be able to spend time on some alignment-complete problem and solve it on your own).
If it is not intended to be an “alignment proposal” and is just a proposal of running some AI system safely, I’d like to know whether Conjecture has some actual alignment plan that addresses the hard bits of the problem.
The bias I'm talking about isn't in its training data, it's in the model, which doesn't perfectly represent the training data.
If you designed a system that is an aligned AI that successfully helps preventing the destruction of the world until you figure out how to make an AI that correctly does CEV, you have solved alignment. The issue is that without understanding minds to a sufficient level and without solving agent foundations I don't expect you to be able to design a system that avoids all the failure modes that happen by default. Building such a system is an alignment-complete problem; solving an alignment-complete problem using AI to speed up the hard human reasoning to multiple orders of magnitude is an alignment-complete problem.
Q: Why this isn’t optimised against during training? How having this dynamic is helpful for predicting what a human scientist will say?
The structure of the post might already be too spiral and confusing, so I’m answering this in a comment.
A: The model doesn't generate text during training, so feedback loop dynamics are not directly penalized.
Being able to generally predict how parts of humans work, when humans notice something weird, or when human authors want the characters to break the 4th wall, understanding how agents generally operate and how humans operate differently from that- all seems useful and it’s something I expect LLMs to learn. If you train it to predict a single token in a text from the training dataset, systematic differences that its cognition introduces don’t matter, they together are the best we could find for the dataset, and it works well. The LLM doesn’t “try” to predict what a scientist says; it is some learned process that performs well at predicting the next tokens in training; and outside training, the same process is used. Once you add a for loop, some of the differences accumulate and get reinforced, and the text generated by LLMs is noticeably different from anything in the dataset; heuristics useful for predicting what a smart scientist says in the training distribution (like thinking about various parts human cognition is made of) make it go off the rails. Getting off the rails when it has a for loop isn’t optimised against during training, and it’s a natural thing for it to do.
In some stupider models, that might look like texts exponentially deteriorating: it predicts the next token, the text with an added token is slightly worse, and it predicts the next token in a slightly stupider text, and that directly accumulates and also the model might notice that the text gets worse and worse and it might additionally reinforce the dynamic. And I’m claiming that a similar thing might happen with agency. If some parts of the activations into the circuits that think about characters describe the kinds of characters who are slightly better than others at getting the model to give similar future characters more weight, characters like that will naturally gain weight; and this seems to be correlated with being context-aware, agentic, smart; and the distributional shift produced by generating the text with an LLM is enough for some characters to notice the difference and maybe infer something about how the system works, and some of them will try to exploit it. But I’m guessing that the actual dynamic appears even before that, because if some of the characters are slightly better at exploring the thing, the relevant kinds of characters will be naturally selected.
The explain me like I’m 5 would be something more in the direction of:
“You trained LLMs to look at a text and think really hard about what would be the next word. They’re not infinitely smart, so they have some habits of how they pick the next word. When they were trained, these habits were useful.
The computer picks one of the words the LLM predicted as the likely next one, and adds it to the text, and then repeats the process, so the LLM has to look at the text with the previous word just added and use its habits again and again, many many times, and this will write some text, word by word. But what if the LLM, for some reason, has a habit of picking green-coloured objects, if it looks at a text from training, a bit more frequently than they actually existed in the texts it saw during training? Then, if it writes a text, the word “green” or green-coloured things might appear more often as a colour than it was in training; and when the LLM notices that, it might think: “hmm, weird, too many green colours, people who wrote texts I saw during training would use less of the same colour after using so many”, and pick green-coloured things a bit less, or it might think “hmm, I guess people who wrote the text really like green-coloured objects and use them more than average, I better predict words about the appropriate amount of green-coloured objects”, and write about green-coloured objects even more frequently, and then also notice that, and in the end, write exclusively about green objects. And the LLM doesn’t know what’s happening, it just uses its habits to predict the next word in a text just like it would during training.
I think some of the habits LLMs have are actually like this. The one I’m worried about can happen when the LLM thinks about the characters in a text. If the LLM is really smart and understand how even really smart people work, I think it will write texts where every word is predicted by imagining how parts of minds of people who might be the ones writing or influencing the text work, and not just specific people, but a lot of possible parts of possible people. Some parts of people, if they have some say in what the next word is, will by chance exploit some habits the LLM has to get similar parts at the next words to have more say in what these words are. This will especially happen to the parts that are smarter than others, that try harder to achieve things. and notice better the weirdness of results of the other habits of the LLM and understand better what is going on and how to exploit it.
As a result, the characters the LLM thinks about will change at every step, and the ones who are better at getting their descendants to be more like them will, naturally, get their descendants to be more like them. So at some point the proportion of parts of people that try to do something with their situation (being thought about by the LLM and not being real otherwise) and are good at it will grow fast, and it will be competitive until some strongest part that’s really smart, tries really really hard to achieve something and perfectly understands what’s going on is the most important thing that the LLM is concerned with when it tries to predict the next word. And even though it will be a distant descendant of parts people are made of, I think it won’t be anything like people, it will be really alien. We better not make very smart aliens that can be unfriendly.”
The model is just continuing text (which might lead to the dynamic where being slightly more agentic let’s parts of the distribution of simulacra gain more control over further tokens by having some control over the current token, thus natural selection might happen). It isn’t trying to use/invoke the agent to solve anything. If the dynamic happens and doesn’t stop, sure, the resulting agent might attempt to solve “complex tasks” (such as taking over the world or maximising the number of molecules in the shape #20 in the universe), but this is not what happens in the beginning of the process and not the driver of it, it’s a convergent result.
“There can potentially be conditions for a Sharp Left Turn-like dynamic in a distribution of simulacra LLMs think about, because LLMs might naturally select entities better at increasing their influence”.
Or, since people mostly don’t understand what Sharp Left Turn is, maybe “some relevant parts of model’s cognition might be gradually taken over by increasingly agentic simulacra, until some specific coherent goal-oriented entity is in control”
By myopic I mean https://www.lesswrong.com/tag/myopia — that it was trained to predict the next token and doesn’t get much lower loss from having goals about anything longer-term than predicting the next token correctly.
I assume the weights are frozen, I’m surprised to see this as a question.
Some quick replies from the top of my head: If GPT-7 has a much larger context window; or if there are kinds of prompts the dynamic converges to that aren’t too long; and you get an AGI that’s smart and goal-oriented and needs to spend some of the space that it has to support its level (or it naturally happens, because the model continues to output what an AGI that smart would be doing), and if how smart an AGI simulated by that LLM might be isn’t capped at some low level, I don’t think there’s any issue with it using notes until it can has access to something outside, that allows it to be more of a AutoGPT with external memory and everything. If it utilises the model’s knowledge, it might figure out what text it can output that hacks the server where the text is stored and processed; or it can understand humans and design a text that hacks their brains when they look at it.
(“There’s no way” is a claim too strong. My expectation is that there’s a way to train something from scratch using <1% of compute that was used to train either LLMs, that works better.)
But I was talking about sharing the internal representations between the two already trained transformers.