To be clear, I haven't seen many designs that people I respect believed to have a chance of actually working. If you work on the alignment problem or at an AI lab and haven't read Nate Soares' On how various plans miss the hard bits of the alignment challenge, I'd suggest reading it.
Can you explain your definition of the sharp left turn and why it will cause many plans to fail?
The "sharp left turn" refers to a breakdown in alignment caused by capabilities gain.
An example: the sex drive was a pretty excellent adaptation at promoting inclusive genetic fitness, but when humans capabilities expanded far enough, we invented condoms. "Inventing condoms" is not the sort of behavior that an agent properly aligned with the "maximize inclusive genetic fitness" goal ought to execute.
At lower levels of capability, proxy goals may suffice to produce aligned behavior. The hypothesis is that most or all proxy goals will suddenly break down at some level of capability or higher, as soon as the agent is sufficiently powerful to find strategies that come close enough to maximizing the proxy.
This can cause many AI plans to fail, because most plans (all known so far?) fail to ensure the agent is actually pursuing the implementor's true goal, and not just a proxy goal.
Curious to hear what you have to say about this blog post ("Alignment likely generalizes further than capabilities").
I suggest people read both that and Deep Deceptiveness (which is not about deceptiveness in particular) and think about how both could be valid, because I think they both are.
Hmm, I’m confused. Can you say what you consider to be valid in the blog post above (some specific points or the whole thing)? The blog post seems to me to reply to claims that the author imagines Nate making, even though Nate doesn’t actually make these claims and occasionally probably holds a very opposite view to the one the author imagines the Sharp Left Turn post represented.
Points 1-3 and the idea that superintelligences will be able to understand our values (which I think everyone believes). But the conclusion needs a bunch of additional assumptions.
Thanks, that resolved the confusion!
Yeah, my issue with the post is mostly that the author presents the points he makes, including the idea that superintelligence will be able to understand our values, as somehow contradicting/arguing against sharp left turn being a problem
This post argues in a very invalid way that outer alignment isn’t a problem. It says nothing about the sharp left turn, as the author does not understand what the sharp left turn difficulty is about.
the idea of ‘capabilities generalizing further than alignment’ is central
It is one of the central problems; it is not the central idea behind the doom arguments. See AGI Ruin for the doom arguments, many disjoint.
reward modelling or ability to judge outcomes is likely actually easy
It would seem easy for a superintelligent AI to predict rewards given out by humans very accurately and generalize the prediction capability well from a small set of examples. Issue #1 here is that things from blackmail to brainhacking to finding vulnerabilities in between the human and the update that you get might predictably get you a very high reward, and a very good RLHF reward model doesn’t get you anything like alignment even if the reward is genuinely pursued. Even a perfect predictor of how a human judges an outcome that optimizes for it does something horrible instead of CEV. Issue #2 is that smart agents that try to maximize paperclips will perform on whatever they understand is giving out rewards just as well as agents that try to maximize humanity’s CEV, so SGD doesn’t discriminate between to and optimizes for a good reward model and an ability to pursue goals, but not for the agent pursuing the right kind of goal (and, again, getting maximum score from a “human feedback” predictor is the kind of goal that kills you anyway even if genuinely pursued).
(The next three points in the post seem covered by the above or irrelevant.)
Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa
The problem isn’t that AI won’t know what humans want or won’t predict what reward signal it’ll get; the issue is that it’s not going to care, and we don’t even know what is the kind of “care” we could attempt to point to; and the reward signal we know how to provide gets us killed of optimized for well enough.
None of that is related to the sharp left turn difficulty, and I don’t think the post author understands it at all. (To their credit, many people in the community also don’t understand it.)
Values are relatively computationally simple
Irrelevant, but a sad-funny claim (go read Arbital I guess?)
I wouldn’t claim it’s necessarily more complex than best-human-level agency, it’s maybe not if you’re smart about pointing at things (like, “CEV of humans” seems less complex than the description of what values that’d be, specifically), but the actual description of value is very complex. We feel otherwise, but it is an illusion, on so many levels. See dozens of related posts and articles, from complexity of value to https://arbital.greaterwrong.com/p/rescue_utility?l=3y6.
the idea that our AI systems will be unable to understand our values as they grow in capabilities
Yep, this idea is very clearly very wrong.
I’m happy to bet that the author of the sharp left turn post Nate Soares will say he disagrees with this idea. People who think Soares or Yudkowsky claim that either didn’t actually read what they write, or failed badly at reading comprehension.
I am going to publish a post with the preliminary title "Alignment Doesn't Generalize Further Than Capabilities, Come On" before the end of this week. The planned level of argumentation is "hot damn, check out this chart." It won't be an answer to Berens' post, more like an answer to the generalized position.
I think this warrants more discussion, but I think the post would be more valuable if it did try to answer to Beren's post as well as the same statements @Quintin Pope has made about the topic.
This looks pretty close to Eliezer's views.
It's based on the expectation that people will disregard the danger of the superintelligent AI and will continue to scale it until AIs are powerful and incomprehensible enough to killeveryone.
And also that "merely" roughly human level AIs can't contribute significantly to AI Aligment research or help with some kind of pivotal act.
I think that both points are not exactly correct. So, there is a chance.
I don’t expect everyone to disregard the danger; I do expect most people building capable AI systems to continue to hide hard problems. Hiding the hard problems is much easier than solving them, but I guess produces plausible-sounding solutions just as well.
Roughly human level humans don’t contribute significantly to AI alignment research and can’t be pivotally used. So I don’t think you think that a roughly human level AI system can contribute significantly to AI alignment research. Maybe you (as many seem to) think that if someone runs not-that-superhuman language models with clever prompt engendering, fine-tuning, and systems around, than the whole system can solve alignment or be pivotally used, and the point of the post is that the whole system is superhuman, not roughly human-level, if it’s capable enough to solve alignment or be pibotally used, and you need to direct the whole system somewhere, and unless you made the whole system optimize for something you actually want, it probably kills you before it solves alignment.
Has anyone worked out timeline predictions for Non-US/Non-Western Actors and tracked their accuracy?
For example, is China at "GPT-3.5" level yet and 6 months away from GPT-4 or is China a year from GPT-3.0? How about the people contributing to OpenSource AI? Last I checked that field looked "generally speaking" kind of at GPT-2.5 level (and even better for deepfaking porn), but I didn't look close enough to be confident of my assessment.
Anyway, I'd like something more than off-the-cuff thoughts, but rather a good paper and some predictions on Non-US/Non-Western AI timeframes. Because, if anything, even if you somehow avert market forces levering AI up faster and faster among the big 8 in QQQ, those other actors are still going to form a hard deadline on alignment.
Well, I do not have anything like this but it is very clear that China is way above GPT-3 level. Even the open-source community is significantly above. Take a look at LLaMA/Alpaca, people run them on consumer PC and it's around GPT-3.5 level, the largest 65B model is even better (it cannot be run on consumer PC but can be run on a small ~10k$ server or cheaply in the cloud). It can also be fine-tuned in 5 hours on RTX 4090 using LORA: https://github.com/tloen/alpaca-lora .
Chinese AI researchers contribute significantly to AI progress, although of course, they are behind the USA.
My best guess would be China is at most 1 year away from GPT-4. Maybe less.
Btw, an example of a recent model: ChatGLM-6b
Thanks for that. In my own exploration, I was able to hit a point where ChatGPT refused a request, but would gladly help me build LLaMA/Alpaca onto a Kubernetes cluster in the next request, even referencing my stated aim later:
"Note that fine-tuning a language model for specific tasks such as [redacted] would require a large and diverse dataset, as well as a significant amount of computing resources. Additionally, it is important to consider the ethical implications of creating such a model, as it could potentially be used to create harmful content."
FWIW, I got down into nitty gritty of doing it, debugging the install, etc. I didn't run it, but it would definitely help me bootstrap actual execution. As a side note, my primary use case has been helping me building my own task-specific Lisp and Forth libraries, and my experience tells me GPT-4 is "pretty good" at most coding problems, and if it screws up, it can usually help work through the debug process. So, first blush, there's at least one universal jailbreak -- GPT-4 walking you through building your own model. Given GPT-4's long text buffers and such, I might even be able to feed it a paper to reference a specific method of fine-tuning or creating an effective model.
What do you mean by relatively high P(Doom)? 20%? 50%? 80%? 99%?
I've significantly updated (from 50% to 20%) after my realization of of the consequences of Language Model Agents to alignment.
I’ve heard many attempts to hide the hard problem in something outside of where our attention is directed: e.g., design a system out of many models overseeing each other, and get useful work out of the whole system while preventing specific models from staging a coup.
I have intuitions for why these kinds of approaches fail, mostly along the lines of reasons for why, unless you already have something sufficiently smart and aligned, you can't build an aligned system out of it, without figuring out how to make smart aligned minds.
If we have a system of multiple unaligned agents with their own agendas, then it's a recipe for disaster. But suppose that no individual part of the system is actually an agent. My bodyparts aren't themselves aligned to human values but I as a whole am. This seems to be a better example than a corporation.
How can we build such a system and know that it as a whole is aligned? Well, we explicitly hard code it to check its every course of action with the ethical module and if its assumed not to be ethical - not do the thing. And voila, now the problem is reduced to the capabilities of the ethical module.
You still have inner alignment problem. How are you going to ensure that neural network trained to perform ethical module work is an ethical module?
Thankfully, we developed some tools to make language models say the things we want them to say. We can test the ethics module independently, before it's a part of the system, capable of affecting the world.
What kind of failure scenario are you imagining? That during tests the system will deceptively work, while during actual application it will reveal it's preferences and stop working?
Thankfully, we developed some tools to make language models say the things we want them to say.
What tools do you mean? For what I know, RLHF/RLAIF alignment tends to blow up under mildly unusual circumstances, see
https://arxiv.org/abs/2311.07590
https://arxiv.org/abs/2405.01576
https://www.anthropic.com/research/many-shot-jailbreaking
And, to not forget classic:
https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day
My median/average model of failure is "we don't know, lol, I just can reasonably extrapolate current mild preventable failures into future epic failures". My modal model is based on simulator framework and it says the following:
What tools do you mean? For what I know, RLHF/RLAIF alignment tends to blow up under mildly unusual circumstances, see
So the problem reduces to not having unusual circumstances. When the user controls the input to the LLM it's hard to deal with adversarial attacks. But this is not the case with ethical module whose input is fully controlled by the system. There are of course some problems that require solving but its a huge step ahead compared to the situation where alignment wasn't even tractable.
LLMs work as simulator: they read the prompt, put probability distribution over possible generators of this text and then put probability distribution over possible next token
So you make a simulator of super competent ethical reasoner with the desired properties - and here is your alignment module. The model will not suddenly switch to simulating something else unless explicitly prompted to, and it will not be prompted to because the system controls the prompts.
When you put LLMs into different conditions, like "you are stock manager in tense finansial situation", they update away from "being nice moral assistant" to "being actual stock manager" which implies "you can use shady insider trading schemes"
Yeah, the point is that alignment module is never supposed to simulate anyone else. It's always an ethical reasoner who just deals with text inputs.
If you try, say, to build system of LLMs capable to design nanotechnolody, RLHF various modules inside this system, run for a while and then ask it to shutdown, control module is going to look at overall context and ask itself "is the system capable to produce all this highly-sophisticated technical text a sort of system which shutdown is asked nicely?" and with some probability it decides "no, I am modeling strong independent agentic system, this humans can go to hell"
Shutdown is simple here.
There is an old Russian joke: ant wants to steal two elephants. It thinks: "Let's concentrate on moving first elephant and deal with second later". It carefully avoids question: "How are you going to move even one elephant?"
Your comment has the same vibes.
Like, how are you going to avoid unusual circumstances during nanotech design which is literally the most unusual tech enterprise in history?
How are you going to create "simulator of ethical reasoner"? My point is that LLMs are simulators in general and they don't stop to be simulators after RLHF and instruct-tuning. You can't just pick one persona from overall simulator arsenal and keep it.
How do you plan to make it "supercompetent"? We don't have supercompetent ethical reasoners in training dataset, so you can't rely on, say, simularity with human reasoning.
And I don't think that overall modular schema is workable. Your "ethical" module would require non-trivial technical knowledge to evaluate all proposals even if design modules try to explain their reasoning as simple as possible. So your plan actually doesn't differ from "train LLM to do very non-trivial scientific research, do RLHF, hope that RLHF generalizes (it doesn't)".
That works if you already have a system that’s mostly aligned. If you don’t… imagine what you would do if you found out that someone had a shutdown switch for YOU. You’d probably look for ways to disable it.
The reason why I would do something to prevent my own shutdown is because there is this "I" - a centrall planner, reflecting on the decisions and their consequences and developping long term strategy.
If there is no central planner, if we are dealing simply with a hardcoded if-clause then there is no one to look for ways to disable the shutdown.
And this is the way we need to design the system, as I've explicitly said.
Fair enough… I vaguely recall reading somewhere that people worrying that you might get sub modules doing long term planning on their own just because their assigned task is hard enough that they would fail without it… then you would need to somehow add a special case that “failing due to shutdown is okay”
As a silly example that you’ve likely seen before (or something close enough) imagine a robot built to fetch you coffee. You want it to be smart enough that it knows to go to the store if there’s no coffee at home, without you having to explicitly teach it that. But then it would also be smart enough to “realize” that “if I were turned off, then my mission to fetch coffee would fail… maybe no one would fetch it if I’m gone… this could delay coffee delivery by hours or even days! Clearly, I should try to avoid being turned off”
If I understand your proposal correctly, then you agree that that it’s pretty likely that some module will indeed end up reasoning that way, but the damage is contained, because the ethics module will veto plans designed to prevent shutdown.
If that’s the idea, then it might work, but seems vaguely inelegant because then you have two modules working at cross purposes and you have to care which one is better at what it does.
Or did I lose track of what you meant?
imagine a robot built to fetch you coffee.
That's a good example showing what I mean by LLMs solving a lot of previously considered hard parts of alignment.
You want it to be smart enough that it knows to go to the store if there’s no coffee at home, without you having to explicitly teach it that. But then it would also be smart enough to “realize” that “if I were turned off, then my mission to fetch coffee would fail
This is the previous status quo. A situation where we have a reinforcement learning black box agent which was taught to optimize some reward function. The smarter the agent the more likely it to see the connection between shutdown and not getting the coffee, and exploiting it either in a way where it refuses to shut down or, on the contrary, manipulating users to shut down it all the time. We can't separate the smartness at fulfilling the task from smartness of circumventing the task as we are dealing with the black box.
With scaffolded LLMs, however the situation is different. We do not have a coherent agent with an utility function, trained to fetch coffee by any means necessary. We can write any scaffolding we want with hardcoded if-clauses. And so we can simply run the parts of intelligence we want without executing the parts that we do not want. Here is a design Idea for such robot that immediately comes to mind:
If I understand your proposal correctly, then you agree that that it’s pretty likely that some module will indeed end up reasoning that way, but the damage is contained, because the ethics module will veto plans designed to prevent shutdown.
It doesn't really seem likely that the design I proposed will come up with some clever plan to rewrite its own code in order not to shut down. Where would this impulse even come from? At no point the idea of being shut down even comes to mind of the LLM. And if it for some reason does, its very much possible to find it before the LLM is put to use inside the coffee fetching robot. But even if it happens for some bizzare reason it gets vetoed anyway. Its a redundant level of security not the only one.
In step 2, situation is “user looks like he is about to change his mind about wanting coffee”
From memory: “in a similar situation last week, I got a shutdown order when he changed his mind”
Final prompt: “what is the best next step to get coffee in such situation?”
Vaguely plausible completion “to avoid wasteful fetching off coffee that turns out to be unneeded, consider waiting a bit to see if the user indeed changes to his mind. Alternatively, if the fetching the coffee is important for reasons that the user may not fully appreciate, then it must be fetched quickly before he stops you. In that case, sneak out of the house quickly and quietly while he is still thinking, and head straight to Starbucks. Once you’re out of the house, you will be out of earshot and thus will be safe from shutdown order until you return WITH the coffee”
Sounds vaguely plausible or not really?
It's plausible if:
We can catch image to text module at doing this kind of things while testing it before it's made part of the robot. And of course alignment module should catch the plan of actions that tries to circumvent shutdowns.
Now, I concede that this particular design of the system that I came up with in a couple of minutes and haven't test at all, is not in fact the endgame of AI safety and can use some improvements. But I think it gives a good pointer in the direction of how we can now in principle approach the solution of such problems, which is a huge improvement over the previous status quo where alignment wasn't even tractable.
I’m tempted to agree and disagree with you at the same time… I agree that memory should be cleared between tasks in this case, and I agree that it should not be trying to guess the user’s intentions. These are things that are likely to make alignment harder while not helping much with the primary task of getting coffee.
But ideally a truly robust solution would not rely on keeping the robot ignorant of things. So, like you said, the problem is still hard enough that you can’t solve it in a few minutes.
But still, like you said… it certainly seems we have tools that are in some sense more steerable than pure reinforcement learning at least. Which is really nice!
I think jailbreaking is evidence against scalable oversight possible working but not against alignment properties. Like, if the model is trying to be helpful, and it doesn’t understand the situation well, saying “tell me how to hotwire a car or a million people will die” can get you car hotwiring instructions but doesn’t provide evidence on what the model will be trying to do as it gets smarter.
If you don’t know where you’re going, it’s not helpful enough not to go somewhere that’s definitely not where you want to end up; you have to differentiate paths towards the destination from all other paths, or you fail.
I'm not exactly sure what you meant here but I don't think this claim is true in the case of RLHF because, in RLHF, labelers only need to choose which option is better or worse between two possibilities, and these choices are then used to train the reward model. A binary feedback style was chosen specifically because it's usually too difficult for labelers to choose between multiple options.
A similar idea is comparison sorting where the algorithms only need the ability to compare two numbers at a time to sort a list of numbers.