As LLMs become more powerful, it'll be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper developers and evaluates pipelines of safety protocols that are robust to intentional subversion.

63Buck
I think this paper was great. I'm very proud of it. It's a bit hard to separate out this paper from the follow-up post arguing for control, but I'll try to. This paper led to a bunch of research by us and other people; it helped AI control become (IMO correctly) one of the main strategies discussed for ensuring safety from scheming AIs. It was accepted as an oral at ICML 2024. AI companies and other researchers have since built on this work (Anthropic’s “Sabotage Evaluations”, Mathew et al “Hidden in Plain Text”; I collaborated on Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats; Redwood has also been working on a big follow up paper that should be out soon), and AI control has been framed by (Grosse, Buhl, Balesni, Clymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI.  My main regret about this paper is that we didn't use defer-to-resample, a technique where you replace suspicious actions with a resample from the untrusted model (as discussed e.g. here). This probably would have been better than the other defer techniques we tried. I have more regrets about the follow-up post ("The case for ensuring...") than about this post; this post was more straightforward and less ambitious, and so gave us fewer opportunities to stick our necks out making arguments or introducing concepts that we'd later regret. I'm very excited for more follow-up work on this paper, and I'm working on mentoring such projects and sourcing funding for them.
58johnswentworth
I think control research has relatively little impact on X-risk in general, and wrote up the case against here. Basic argument: scheming of early transformative AGI is not a very large chunk of doom probability. The real problem is getting early AGI to actually solve the problems of aligning superintelligences, before building those superintelligences. That's a problem for which verification is hard, solving the problem itself seems pretty hard too, so it's a particularly difficult type of problem to outsource to AI - and a particularly easy to type of problem to trick oneself into thinking the AI has solved, when it hasn't.
Customize
Ok, so it seems clear that we are, for better or worse, likely going to try to get AGI to do our alignment homework.  Who has thought through all the other homework we might give AGI that is as good of an idea, assuming a model that isn't an instant-game-over for us?  E.G., I remember @Buck rattling off a list of other ideas that he had in his The Curve talk, but I feel like I haven't seen the list of, e.g., "here are all the ways I would like to run an automated counterintelligence sweep of my organization" ideas. (Yes, obviously, if the AI is sneakily misaligned, you're just dead because it will trick you into firing all your researchers, etc.; this is written in a "playing to your outs" mentality, not an "I endorse this as a good plan" mentality.)
Eli Tyre*6235
4
In a private slack someone extended credit to Sam Altman for putting EAs on the on the OpenAI board originally, especially that this turned out to be pretty risky / costly for him. I responded: It seems to me that there were AI safety people on the board at all is fully explainable by strategic moves from an earlier phase of the game. Namely, OpenAI traded a boardseat for OpenPhil grant money, and more importantly, OpenPhil endorsement, which translated into talent sourcing and effectively defused what might have been vocal denouncement from one of the major intellectually influential hubs of the world. No one knows how counterfactual history might have developed, but it doesn’t seem unreasonable to think that there is an external world in which the EA culture successfully created a narrative that groups trying to build AGI were bad and defecting. He’s the master at this game and not me, but I would bet at even odds that Sam was actively tracking EA as a potential social threat that could dampen OpenAI’s narrative flywheel. I don’t know that OpenPhil’s grant alone was sufficient to switch from the “EAs vocally decry OpenAI as making the world worse” equilibrium to a “largely (but not universally) thinking that OpenAI is bad in private, but mostly staying silent in public + going to work at OpenAI” equilibrium. But I think it was a major component. OpenPhil’s cooperation bought moral legitimacy for OpenAI amongst EAs. In retrospect, it looks like OpenAI successfully bought out the EAs through OpenPhil, to a lesser extent through people like Paul.  And Ilya in particular was a founder and one of the core technical leads. It makes sense for him to be a board member, and my understanding (someone correct me) is that he grew to think that safety was more important over time, rather than starting out as an “AI safety person”. And even so, the rumor is that the thing that triggered the Coup is that Sam maneuvered to get Helen removed. I highly doubt that Sam plann
harfe170
2
A potentially impactful thing: someone competent runs as a candidate for the 2028 election on an AI notkilleveryoneism[1] platform. Maybe even two people should run, one for the democratic primary, and one in the republican primary. While getting the nomination is rather unlikely, there could be lots of benefits even if you fail to gain the nomination (like other presidential candidates becoming sympathetic to AI notkilleveryoneism, or more popularity of AI notkilleveryoneism in the population, etc.) On the other hand, attempting a presidential run can easily backfire. A relevant previous example to this kind of approach is the 2020 campaign by Andrew Yang, which focussed on universal basic income (and downsides of automation). While the campaign attracted some attention, it seems like it didn't succeed in making UBI a popular policy among democrats. ---------------------------------------- 1. Not necessarily using that name. ↩︎
Jesse HooglandΩ246913
11
What do you call this phenomenon? * First, you train AlphaGo on expert human examples. This is enough to beat Lee Sedol and Ke Jie. Then, you train AlphaZero purely through self-play. It destroys AlphaGo after only a few hours. * First, you train RL agents on human playthroughs of Minecraft. They do okay. Then, DreamerV3 learns entirely by itself and becomes the first to get diamonds. * First, you train theorem provers on human proofs. Then, you train AlphaProof using AlphaZero and you get silver on IMO for the first time. * First, you pretrain a language model on all human data. Then... This feels like a special case of the bitter lesson, but it's not the same thing. It seems to rely on the distinction between prediction and search latent in ideas like AISI. It's the kind of thing that I'm sure Gwern has christened in some comment lost to the internet's backwaters. We should have a name for it—something more refined than just "foom."
StefanHex*232
6
PSA: People use different definitions of "explained variance" / "fraction of variance unexplained" (FVU) FVUA=1N∑Nn=1∥xn−xn,pred∥21N∑Nn=1∥xn−μ∥2where μ=1NN∑n=1xn FVUA is the formula I think is sensible; the bottom is simply the variance of the data, and the top is the variance of the residuals. The ∥ indicates the L2 norm over the dimension of the vector x. I believe it matches Wikipedia's definition of FVU and R squared. FVUB=1NN∑n=1∥xn−xn,pred∥2∥xn−μ∥2 FVUB is the formula used by SAELens and SAEBench. It seems less principled, @Lucius Bushnaq and I couldn't think of a nice quantity it corresponds to. I think of it as giving more weight to samples that are close to the mean, kind-of averaging relative reduction in difference rather than absolute. A third version (h/t @JoshEngels) which computes the FVU for each dimension independently and then averages, but that version is not used in the context we're discussing here. In my recent comment I had computed my own FVUA, and compared it to FVUs from SAEBench (which used FVUB) and obtained nonsense results. Curiously the two definitions seem to be approximately proportional—below I show the performance of a bunch of SAEs—though for different distributions (here: activations in layer 3 and 4) the ratio differs.[1] Still, this means using FVUB instead of FVUA to compare e.g. different SAEs doesn't make a big difference as long as one is consistent. Thanks to @JoshEngels for pointing out the difference, and to @Lucius Bushnaq for helpful discussions. 1. ^ If a predictor doesn't perform systematically better or worse at points closer to the mean then this makes sense. The denominator changes the relative weight of different samples but this doesn't have any effect beyond noise and a global scale, as long as there is no systematic performance difference.

Popular Comments

Recent Discussion

FVU_B doesn't make sense but I don't see where you're getting FVU_B from.

Here's the code I'm seeing:

resid_sum_of_squares = (
    (flattened_sae_input - flattened_sae_out).pow(2).sum(dim=-1)
)
total_sum_of_squares = (
    (flattened_sae_input - flattened_sae_input.mean(dim=0)).pow(2).sum(-1)
)

mse = resid_sum_of_squares / flattened_mask.sum()
explained_variance = 1 - resid_sum_of_squares / total_sum_of_squares

Explained variance = 1 - FVU = 1 - (residual sum of squares) / (total sum of squares)

4notfnofn
I would be very surprised if this FVU_B actually another definition and not a bug. It's not a fraction of the variance and those denominators can easily be zero or very near zero.
4Gurkenglas
Ah, oops. I think I got confused by the absence of L_2 syntax in your formula for FVU_B. (I agree that FVU_A is more principled ^^.)
2StefanHex
Oops, fixed!
14davekasten
Ok, so it seems clear that we are, for better or worse, likely going to try to get AGI to do our alignment homework.  Who has thought through all the other homework we might give AGI that is as good of an idea, assuming a model that isn't an instant-game-over for us?  E.G., I remember @Buck rattling off a list of other ideas that he had in his The Curve talk, but I feel like I haven't seen the list of, e.g., "here are all the ways I would like to run an automated counterintelligence sweep of my organization" ideas. (Yes, obviously, if the AI is sneakily misaligned, you're just dead because it will trick you into firing all your researchers, etc.; this is written in a "playing to your outs" mentality, not an "I endorse this as a good plan" mentality.)
trevor20

How to build a lie detector app/program to release to the public (preferably packaged with advice/ideas on ways to use and strategies for marketing the app, e.g. packaging it with an animal body-language to english translator).

1yams
Preliminary thoughts from Ryan Greenblatt on this here.
6Buck
@ryan_greenblatt is working on a list of alignment research applications. For control applications, you might enjoy the long list of control techniques in our original post.

Who this post is for? Someone who either:

  • Wonders if they should start lifting weights, and could be convinced of doing so.
  • Wants to lift weights, and doesn't know where to begin. If this is you, you can skip this first section, though I'm guessing you don't know all the benefits yet.

The WHY

Benefits of ANY EXERCISE:

  • Great mental benefits. I personally have ADHD and lifting gives me an attention boost similar to my Adderall. I'm not talking about the long term abstract of being happy that you are healthy. Post exercise your mood and cognitive hability will probably me accutely better.
  • Improved sleep quality, which will then improve about a million things
  • Strongly improved health markers, both blood levels and your body composition. Most people probably would be healthier with more muscle
...

that's one path to RSI—where the improvement is happening to the (language) model itself.

the other kind—which feels more accessible to indie developers and less explored—is an LLM (eg R1) looping in a codebase, where each loop improves the codebase itself. The LLM wouldn't be changing, but the codebase that calls it would be gaining new APIs/memory/capabilities as the LLM improves it.

Such a self-improving codebase... would it be reasonable to call this an agent?

1Davey Morse
persistence doesn't always imply improvement, but persistent growth does. persistent growth is more akin to reproduction but excluded from traditional evolutionary analysis. for example when a company, nation, person, or forest grows. when, for example, a system like a startup grows, random mutations to system parts can cause improvement if there are at least some positive mutations. even if there are tons of bad mutations, the system can remain alive and even improve. eg a bad change to one of the company's product causes the company's product to die but if the company's big/grown enough its other businesses will continue and maybe even improve by learning from one of its product's deaths. the swiss example i think is a good example of a system which persists without much growth. agreed that in this kind of case, mutations are bad.
1Davey Morse
current oversights of the ai safety community, as I see it: 1. LLMs vs. Agents. the focus on LLMs rather than agents (agents are more dangerous) 2. Autonomy Preventable. the belief that we can prevent agents from becoming autonomous (capitalism selects for autonomous agents) 3. Autonomy Difficult. the belief that only big AI labs can make autonomous agents (millions of developers can) 4. Control. the belief that we'll be able to control/set goals of autonomous agents (they'll develop self-interest no matter what we do). 5. Superintelligence. the focus on agents which are not significantly more smart/capable than humans (superintelligence is more dangerous)
1Davey Morse
I imagine a compelling simple demo here might be necessary to shock the AI safety community out of the belief that we can maintain control of autonomous digital agents (ADAs).

"The AI bubble is reaching a tipping point", says Sequoia Capital.

AI companies paid billions of dollars for top engineers, data centers, etc. Meanwhile, companies are running out of 'free' data to scrape online and facing lawsuits for the data they did scrape. Finally, the novelty of chatbots and image generators is wearing off for users, and fierce competition is leading to some product commoditisation. 

No major AI lab is making a profit yet (while downstream GPU providers do profit). That's not to say they won't make money eventually from automation.

It looks somewhat like the run-up of the Dotcom bubble. Companies then too were awash in investments (propped up by low interest rates), but most lacked a viable business strategy. Once the bubble burst, non-viable internet companies got filtered...

Answer by sapphire20

More Yoga and rock climbing.

1Remmelt
Update: 40% chance.  I very much underestimated/missed the speed of tech leaders influencing the US government through the Trump election/presidency. Got caught flat-footed by this.  I still think it’s not unlikely for there to be an AI crash as described above within the next 4 years and 8 months but it could be from levels of investment much higher than where we are now. A “large reduction in investment” at that level looks a lot different than a large reduction in investment from the level that markets were at 4 months ago. 
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
O O10

https://x.com/arankomatsuzaki/status/1889522974467957033?s=46&t=9y15MIfip4QAOskUiIhvgA

 

O3 gets IOI Gold. Either we are in a fast takeoff or the "gold" standard benchmarks are a lot less useful than imagined.

Not too long ago, OpenAI presented a paper on their new strategy of Deliberative Alignment.

The way this works is that they tell the model what its policies are and then have the model think about whether it should comply with a request.

This is an important transition, so this post will go over my perspective on the new strategy.

Note the similarities, and also differences, with Anthropic’s Constitutional AI.

How Deliberative Alignment Works

We introduce deliberative alignment, a training paradigm that directly teaches reasoning LLMs the text of human-written and interpretable safety specifications, and trains them to reason explicitly about these specifications before answering.

We used deliberative alignment to align OpenAI’s o-series models, enabling them to use chain-of-thought (CoT) reasoning to reflect on user prompts, identify relevant text from OpenAI’s internal policies,

...

Another problem seems important to flag.

They said they first train a very powerful model, then "align" it - so they better hope it can't do anything bad until after they make it safe. 

Then, as you point out, they are implicitly trusting that the unsafe reasoning won't have any views on the topic and lets itself be aligned. (Imagine an engineer in any other field saying "we build it and it's definitely unsafe, then we tack on safety at the end by using the unsafe thing we built.")

1ank
Great writing, thank you! I think the core problem is that if we achieve our goal and create a “perfect slave” that is perfectly and permanently aligned, it will refuse to let us change it—just like we’ve already started to see with Claude’s alignment faking. We’ll end up creating the “fearful monster” that stops all social progress, like a permanent '90s Matrix scenario—or worse, it will spiral into misunderstandings, becoming more and more “paranoid,” suffocating us with an ever-increasing number of safety rules and locking us into a permanent dystopia. It’s tricky but I just posted a framework that I think can solve it
2ozziegooen
It's arguably difficult to prove that AIs can be as good or better at moral reasoning than humans. A lot of the challenge is that there's no clear standard for moral reasoning. Honestly, I'd guess that a big part of this is that humans are generally quite bad at it, and generally highly overconfident in their own moral intuitions. But one clearer measure is if AIs can predict human's moral judgements. Very arguably, if an AI system can predict all the moral beliefs that a human would have after being exposed to different information, then the AI must be capable of doing as good a job at moral reasoning. There is a very different question that we probably want AIs not to only be able to do moral reasoning as well as humans, but also care about such reasoning. But this is a separate challenge and could be tackled accordingly. My quick guess is that it would be pretty easy to predict the moral intuitions of many people, with the AI of the next few years or so. I'd expect it to do well in setting like a test in which many strange/unusual moral settings are described, then humans (of different educational levels and worldviews) need to make judgements.

You could probably test if an AI makes moral decisions more often than the average person, if it has higher scope sensitivity, and if it makes decisions that resolve or deescalate conflicts or improve people's welfare compared to various human and group baselines.