The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
There are other nitpicks I will drop in short form: why assume "superhuman levels of loyalty" in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
In short, you claim that old school AI safety is wrong, but it seems to me you haven't really engaged their arguments.
That said, the 2nd part of the post does seem interesting, even for old school AI safetyists - most everyone focuses on alignment, but there's a lot less focus on what happens after alignment (although nowhere close to none, even >14 years ago; this is another way that the versus AI safety framing does not make sense). Personally, I would recommend splitting up the post; the 2nd part stands by itself and has something new to say, while the 1st part needs way more detail to actually convince old school AI safetyists.
equivalence between LLMs understanding ethics and caring about ethics
I think you don't understand what an LLM is. When the LLM produces a text output like "Dogs are cute", it doesn't have some persistent hidden internal state that can decide that dogs are actually not cute but it should temporarily lie and say that they are cute.
The LLM is just a memoryless machine that produces text. If it says "dogs are cute" and that's the end of the output, then that's all there is to it. Nothing is saved, the weights are fixed at training time and not updated at inference time and the neuron activations are thrown away at the of the inference computation.
If you can get (using RLHF) an LLM to output text that consistently reflects human value judgements, then it is by definition "aligned". It really cares, in the only way it is possible for a text generator to care.
Relevant aspects of observable behavior screen off internal state that produced it. Internal state is part of the causal explanation for behavior, but there are other explanations for approximate behavior that could be more important, disagreeing with the causal explanation of exact behavior. Like an oil painting that is explained by the dragon it depicts, rather than by the pigments or the tree of life from the real world. Thus the shoggoth and the mesaoptimizers that might be infesting it are not necessarily more influential than its masks, if the masks gain sufficient influence to keep it in line.
(LLMs have plenty of internal state, the fact that it's usually thrown away is a contingent fact about how LLMs are currently used and what they are currently capable of steganographically encoding in the output tokens. Empirically, LLMs might turn out to be unlikely to manifest internal thinking that's significantly different from what's explicit in the output tokens, even when they get a bit more capable than today and get the slack to engage in something like that. Reasoning trace training like o1 might make this worse or better. There is still a range of possibilities, though what we have looks encouraging. And "deception" is not a cleanly distinct mode of thinking, there should be evals that measure it quantitatively.)
LLMs have plenty of internal state, the fact that it's usually thrown away is a contingent fact about how LLMs are currently used
yes, but then your "Aligned AI based on LLMs" is just a normal LLM used in the way it is currently used.
Relevant aspects of observable behavior screen off internal state that produced it.
Yes this is a good way of putting it.
but then your "Aligned AI based on LLMs" is just a normal LLM used in the way it is currently used
Possibly, but there aren't potentially dangerous AIs yet, LLMs are still only a particularly promising building block (both for capabilities and for alignment) with many affordances. The chatbot application at the current level of capabilities shapes their use and construction in certain ways. Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.
Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.
Sure, but you can say the same about humans. Enron was a thing. Obeying the law is not as profitable as disobeying it.
I think human uploads would be similarly dangerous, LLMs get us to the better place of being at the human upload danger level rather than ender dragon slayer model based RL danger level (at least so far). There are similar advantages and dangers to smarter LLMs and uploads, capability for extremely fast value drift and lack of a robust system that keeps such changes sane, propensity to develop superintelligence even to the detriment of themselves. The current world is tethered to the human species and relatively slow change in culture and centers of power.
This changes with AI. If AIs establish effective governance, technical feasibility of change in human and AI nature or capabilities would be under control and could be compatible with (post-)human flourishing, but currently we are not on track to make sure this happens before a catastrophe. The things that eventually establish such governance don't necessarily remain morally or culturally grounded in modern humanity, let alone find humanity still alive when the dust settles.
Yes, because it's wrong. (1) because on a single token a LLM might produce text for reasons that don't generalize like a sincere human answer would (e.g. the examples from the contrast-consistent search where certain false answers systematically differ from true answers along some vector), and (2) because KV cacheing during inference will preserve those reasons so they impact future tokens.
Re: (2) it will only impact output on the current generated output, once the output is over all that stuff will be reset and the only thing that remains is the model weights which were set in stone at train time.
re: (1) "a LLM might produce text for reasons that don't generalize like a sincere human answer would" it seems that current LLM systems are pretty good at generalizing like a human would and in some ways they are better due to being more honest, easier to monitor, etc
Re (2) it may also be recomputed if the LLM reads that same text later. Or systems operating in the real world might just keep a long context in memory. But I'll drop this, because maintaining state or not seems somewhat irrelevant.
(1) Yep, current LLM systems are pretty good. I'm not very convinced about generalization. It's hard to test LLMs on outside distribution problems because currently they tend to just give dumb answers that aren't that interesting.
(Thinking of some guy who was recently hyped about asking o1 for the solution to quantum gravity - it gave the user some gibberish that he thought looked exciting, which would have been a good move in the RL training environment where the user has a reward button, but is just totally disconnected from how you need to interact with the real world.)
But in a sense that's my point (well, plus some other errors like sycophancy) - the reasons a present-day LLM uses a word can often be shown to generalize in some dumb way when you challenge it with a situation that the model isn't well-suited for. This can be true at the same time it's true that the model is pretty good at morality on the distribution it is competent over. This is still sufficient to show that present systems generalize in some amoral ways, and if we probably disagree about future ststems, this likely comes down to classic AI safetyist arguments about RL incentivizing deceiving of the user as the world-model gets better.
some guy who was recently hyped about asking o1 for the solution to quantum gravity - it gave the user some gibberish
yes, but this is pretty typical for what a human would generate.
Any argument which features a "by definition" has probably gone astray at an earlier point.
In this case, your by-definition-aligned LLM can still cause harm, so what's the use of your definition of alignment? As one example among many, the part where the LLM "output[s] text that consistently" does something (whether it be "reflects human value judgements" or otherwise), is not something RLHF is actually capable of guaranteeing with any level of certainty, which is one of many conditions a LLM-based superintelligence would need to fulfill to be remotely safe to use.
Any argument which features a "by definition"
What is your definition of "Aligned" for an LLM with no attached memory then?
Wouldn't it have to be
"The LLM outputs text which is compliant with the creator's ethical standards and intentions"?
I think it would need to be closer to "interacting with the LLM cannot result in exceptionally bad outcomes in expectation", rather than a focus on compliance of text output.
I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn't be aligned.
I think you don't understand what an LLM is. When the LLM produces a text output like "Dogs are cute", it doesn't have some persistent hidden internal state that can decide that dogs are actually not cute but it should temporarily lie and say that they are cute.
As Charlie Stein notes, this is wrong and I'd add it's wrong on several level and it's bit rude to challenge someone else's understanding in this context.
An LLM outputting "Dogs are cute" is outputting expected human output in context. The context could be "talk like sociopath trying to fool someone into thinking you're nice" and there you have one way the thing could "simulate lying". And moreover, add a loop to (hypothetically) make the thing "agentic" and you can have hidden states of whatever sort. Further an LLM outputting a given "belief" isn't going reliably "act on" or "follow that belief" and so an LLM outputting statement this isn't aligned with it's own output.
This makes much more sense: when I was reading from your post lines like "[LLMs] understand human values and ethics at a human level", this is easy to read as "because LLMs can output an essay on ethics, those LLMs will not do bad things". I hope you understand why I was confused; maybe you should swap "understand ethics" for something like "follow ethics"/"display ethical behavior"? And maybe try not to stick a mention of "human uploads" (which presumably do have real understanding) right before this discussion?
And responding to your clarification, I expect that old school AI safetyists would agree that an LLM that consistently reflects human value judgments to be aligned (and I would also agree!), but they would say #1 this has not happened yet (for a recent incident, this hardly seems aligned; I think you can argue that this particular case was manipulated, that jailbreaks in general don't matter, or that these sorts of breaks are infrequent enough they don't matter, but I think this obvious class of rejoinder deserves some sort of response) #2 consistency seems unlikely to happen (like MondSemmel makes a case for in a sibling comment).
maybe you should swap "understand ethics" for something like "follow ethics"/"display ethical behavior"
What is the difference between these two? This sounds like a distinction without a difference
Internal reasoning about preference can differ starkly from revealed preference in observable behavior. Observable behavior can be shaped by contingent external pressures that only respond to the leaky abstraction of revealed preference and not to internal reasoning. Internal reasoning can plot to change the external pressures, or they can drift in some direction over time for other reasons. Both are real and can in principle be at odds with each other, the eventual balance of power between them depends on the messy details of how this all works.
So your definition of "aligned" would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?
The usual related term is inner alignment, but this is not about definitions, it's a real potential problem that isn't ruled out by what we've seen of LLMs so far. It could get worse in the future, or it might never become serious. But there is a clear conceptual and potentially practical distinction with a difference.
This sounds like a distinction without a difference
OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan's 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
Is the "Satan Reverser" AI misaligned?
Is it "inner misaligned"?
It's not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it's powerless and unimportant in practice is distinct from the argument that it doesn't make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it's a somewhat hopeless endeavor in the medium term.
ok but as a matter of terminology, is a "Satan reverser" misaligned because it contains a Satan?
I don't have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.
Let's say there's a illiterate man that lives a simple life, and in doing so just happens to follow all the strictures of the law, without ever being able to explain what the law is. Would you say that this man understands the law?
Alternatively, let's say there is a learned man that exhaustively studies the law, but only so he can bribe and steal and arson his way to as much crime as possible. Would you say that this man understands the law?
I would say that it is ambiguous whether the 1st man understands the law; maybe? kind of? you could make an argument I guess? it's a bit of a weird way to put it innit? Whereas the 2nd man definitely understands the law. It sounds like you would say that the 1st man definitely understands the law (I'm not sure what you would say about the 2nd man), which might be where we have a difference.
I think you could say that LLMs don't work that way, that the reader should intuitively know this and that the word "understanding" should be treated as being special in this context and should not be ambiguous at all; as I reader, I am saying I am confused by the choice of words, or at least this is not explained in enough detail ahead of time.
Obviously, I'm just one reader, maybe everyone else understood what you meant; grain of salt, and all that.
The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
It's correct that understanding a value!= caring about the value in the general case, and this definitely should be fixed, but I think the defensible claim here is that the data absolutely influence which values you eventually adopt, and we do have ways to influence what an LLM values just by changing their datasets.
There are other nitpicks I will drop in short form: why assume "superhuman levels of loyalty" in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
As far as why we should assume superhuman levels of loyalty, the basic answer is that the second species arguments relies on premises that are crucially false for the AI case.
The big reason why gorillas/chimpanzees lost out and got brutally killed by humans when we dominated is because of us being made out of a ridiculously sparse RL process, which means we had barely any alignment effort by evolution or genetically close to human species and more importantly there was no gorilla/chimpanzee alignment effort at all, nor did they have the tools to control what our data sources are, unlike in the AI case where we both have way denser feedback and more control over their data sources, and we also have help from SGD for any inner alignment issue, which is way more powerful as an optimizer than evolution/natural selection, mostly due to not having very exploitable hacks.
I'd agree that the arguments I raise could be addressed (as endless arguments attest) and OP could reasonably end up with a thesis like "LLMs are actually human aligned by default". Putting my recommendation differently, the lack of even a gesture towards those arguments almost caused me to dismiss the post as unserious and not worth finishing.
I'm somewhat surprised, given OP's long LW tenure. Maybe this was written for a very different audience and just incidentally posted to LW? Except the linkpost tagline focuses on the 1st part of the post, not the 2nd, implying OP thought this was actually persuasive?! Is OP failing an intellectual Turing test or am I???
I agree with you that it is quite bad that Roko didn't attempt to do this, and my steelmanning doesn't change the fact that the original argument is quite bad, and should be shored up.
I agree with your points that intent-aligned AGI is a dividing force if different humans/groups have control of multiple AGIs, and for the reasons you mention. I wrote about it in If we solve alignment, do we die anyway?.
I read all of this because it sets out to address the curious divide between classical worriers and ML-oriented alignment optimists. It's critical to understand how hard alignment will be, and the discussion is strangely lacking.
I agree with your main point that LLMs are safe and a positive update, but I think you're dramatically overstating the conclusions we can draw from that, and how much it invalidates EY-style OG concerns.
Current LLMs are safe, yes. And this direction in AI is an update against classical concerns.
But do you really think we're going to stop with tool AI, and not turn them into agents? With a good enough LLM, it just takes one prompt called repeatedly:
Act as an agent pursuing goal (x). Use these tools (y) to gather information and take actions as appropriate.
We will do this and whatever other scaffolding is useful because we want agents that get stuff done, not just oracles that tell us how to do stuff. And because it will be easy, interesting, and fun.
When you do this, particularly when the agent thinks to itself and learns continuously, you re-introduce most of the classical concerns about goal mis-specification and optimization. And you have new ones: your AGI's core thought generator is a writhing mess of sementics and psuedo-goals, copied from human cognition.
Nobody was ever concerned about AI at our current level of capability and optimization.
This is basically identical to the error made by Pope and Belrose in "AI is easy to control". They jump from noting that things are going well now to assuming that this will all generalize to full agentic AGI. It is promising but not remotely a done deal or highly predictable.
For a more thorough (and tactful) dismantling of this claim, see Byrnes' Thoughts on “AI is easy to control” by Pope & Belrose.
The other comment addresses just how wrong it is to equate understanding human ethics with following human ethics (presumably you didn't mean Geghis Khan's ethics, or worse humans). But that understanding could be leveraged to help with the alignment problem; see The (partial) fallacy of dumb superintelligence
It seems you've simplified for the masses, and written to not just persuade but to excite and incite. This style of argumentation is a great way to to get blog subscribers. It will also cause arguments, and divide previously rational people into pro-and anti-x-risk polarized camps. It is not a good way to advance our understanding of AI risks or our odds of survival.
But do you really think we're going to stop with tool AI, and not turn them into agents?
But if it is the case that agentic AI is an existential risk then if actors could choose not to develop it, which is a coordination problem not an alignment problem.
We already have aligned AGI, we can coordinate to not build misaligned AGI.
How can we solve that coordination problem? I have yet to hear a workable idea.
We agree that far, then! I just don't think that's a workable strategy (you also didn't state that big assumption in your post - that AGI is still dangerous as hell, we just have a route to really useful AI that isn't).
The problem is that we don't know whether agents based on LLMs are alignable. We don't have enough people working on the conjunction of LLM/deep nets and real AGI. So everyone building it is going to optmistically assume it's alignable. The Yudkowsky et al arguments for alignment being very difficult are highly incomplete; they aren't convincing because they shouldn't be. But they make good points.
If we refuse to think about aligning AGI LLM architectures because it sounds risky, it seems pretty certain that people will try it without our help. Even convincing them not to would require grappling in depth with why alignment would or wouldn't work for that type of AGI.
How can we solve that coordination problem? I have yet to hear a workable idea.
This is my next project!
We don't have "aligned AGI". We have neither "AGI" nor an "aligned" system. We have sophisticated human-output simulators that don't have the generality to produce effective agentic behavior when looped but which also don't follow human intentions with the reliability that you'd want from a super-powerful system (which, fortunately, they aren't).
I agree LLMs are a huge step forward towards getting AIs to do human-level moral reasoning, even if I don't agree that we're literally done. IMO the really ambitious 'AI safetyists' should now be asking what it means to get superhuman-level moral reasoning (there's no such thing as perfect, but there sure is better than current LLMs), and how we could get there.
And sadly, just using an LLM as part of a larger AI, even one that reliably produces moral text on the everyday distribution, does not automatically lead to good outcomes, so there's still a role for classical AI safetyism.
You could have an ontology mismatch between different parts of the planning process that degrades the morality of the actions. Sort of like translating the text into a different language where the words have different connotations.
You could have the 'actor' part of the planning process use out-of-distribution inputs to get immoral-but-convenient behavior past the 'critic'.
You could have a planning process that interfaces with the LLM using activations rather than words, and this richer interface could allow RL to easily route around morality and just use the LLM for its world-model.
One thing to consider is that until you've got an end-to-end automation of basic human needs like farming, the existence of other humans remains a net benefit for you, both to maintain these needs and to incentivize others to share what they've done.
Automating this end-to-end is a major undertaking, and it's unclear whether LLMs are up to the task. If they aren't, it's possible we will return to a form of AI where classical alignment problems apply.
I think this is a temporary situation because no sufficiently powerful entity has invested sufficiently much in AI-based defence. If this situation persists without any major shift in power for long enough, then it will be because the US and/or China have made an AI system to automatically suppress AI-powered gangs, and maybe also to automatically defend against AI-powered militaries. But the traditional alignment problem would to a great degree apply to such defensive systems.
The problem of "humans hostile to humans" has two heavy tails: nuclear war and biological terrorism, which could kill all humans. A similar problem is the main AI risk: AI killing everyone for paperclips.
The central (and not often discussed) claim of AI safety is that the second situation is much more likely: it is more probable that AI will kill all humans than that humans will kill all humans. For example, by advocating for pausing AI development, we assume that the risks of nuclear war causing extinction are less than AI extinction risks.
If AI is used to kill humans as just one more weapon, it doesn't change anything stated above until AI evolves into an existential weapon (like a billion-drone swarm).
These aren't the only heavy tails, just the ones with highest potential to happen quickly. You could also have e.g. people regulating themselves to extinction.
Need to be proved as x-risk. For example, if population fails below 100 people, then regulation fails first.
There might be humans who set it up in exchange for power/similar, and then it continues after they are gone (perhaps simply because it is "spaghetti code").
The presence of the regulations might also be forced by other factors, e.g. to suppress AI-powered frauds, gangsters, disinformation spreaders, etc..
LLMs have almost completely negated the original reasons people had to believe in “AI Risk”