As LLMs become more powerful, it'll be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper developers and evaluates pipelines of safety protocols that are robust to intentional subversion.

63Buck
I think this paper was great. I'm very proud of it. It's a bit hard to separate out this paper from the follow-up post arguing for control, but I'll try to. This paper led to a bunch of research by us and other people; it helped AI control become (IMO correctly) one of the main strategies discussed for ensuring safety from scheming AIs. It was accepted as an oral at ICML 2024. AI companies and other researchers have since built on this work (Anthropic’s “Sabotage Evaluations”, Mathew et al “Hidden in Plain Text”; I collaborated on Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats; Redwood has also been working on a big follow up paper that should be out soon), and AI control has been framed by (Grosse, Buhl, Balesni, Clymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI.  My main regret about this paper is that we didn't use defer-to-resample, a technique where you replace suspicious actions with a resample from the untrusted model (as discussed e.g. here). This probably would have been better than the other defer techniques we tried. I have more regrets about the follow-up post ("The case for ensuring...") than about this post; this post was more straightforward and less ambitious, and so gave us fewer opportunities to stick our necks out making arguments or introducing concepts that we'd later regret. I'm very excited for more follow-up work on this paper, and I'm working on mentoring such projects and sourcing funding for them.
58johnswentworth
I think control research has relatively little impact on X-risk in general, and wrote up the case against here. Basic argument: scheming of early transformative AGI is not a very large chunk of doom probability. The real problem is getting early AGI to actually solve the problems of aligning superintelligences, before building those superintelligences. That's a problem for which verification is hard, solving the problem itself seems pretty hard too, so it's a particularly difficult type of problem to outsource to AI - and a particularly easy to type of problem to trick oneself into thinking the AI has solved, when it hasn't.
Customize
This shortform discusses the current state of responsible scaling policies (RSPs). They're mostly toothless, unfortunately. The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including Microsoft, Meta, xAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak. RSPs essentially have four components: capability thresholds beyond which a model might be dangerous by default, an evaluation protocol to determine when models reach those thresholds, a plan for how to respond when various thresholds are reached, and accountability measures. A maximally lazy RSP—a document intended to look like an RSP without making the company do anything differently—would have capability thresholds be vague or extremely high, evaluation be unspecified or low-quality, response be like we will make it safe rather than substantive mitigations or robustness guarantees, and no accountability measures. Such a policy would be little better than the company saying "we promise to deploy AIs safely." The new RSPs are basically like that.[1] Some aspects of some RSPs that existed before the summit are slightly better.[2] If existing RSPs are weak, how would a strong RSP be different? * Evals: eval should measure relevant capabilities (including cyber, bio, and scheming), evals should be sufficiently difficult, and labs should do good elicitation. (As a lower bar, the evals should exist; many companies say they will do evals but don't seem to have a plan for what evals to do.) * See generally Model evals for dangerous capabilities and OpenAI's CBRN tests seem unclear * Response: misuse * Rather than just saying that you'll implement mitigations such that users can't access dangerous capabilities, say how you'll tell if your mitigations are good enough. For example, say that you'll have a skilled red-team attempt to el
Writer*21-18
0
Surprised that there's no linkpost about Dan H's new paper on Utility Engineering. It looks super important, unless I'm missing something. LLMs are now utility maximisers? For real? We should talk about it: https://x.com/DanHendrycks/status/1889344074098057439 I feel weird about doing a link post since I mostly post updates about Rational Animations, but if no one does it, I'm going to make one eventually. Also, please tell me if you think this isn't as important as it looks to me somehow. EDIT: Ah! Here it is! https://www.lesswrong.com/posts/SFsifzfZotd3NLJax/utility-engineering-analyzing-and-controlling-emergent-value thanks @Matrice Jacobine!
Phib30
0
Re: AI safety summit, one thought I have is that the first couple summits were to some extent captured by the people like us who cared most about this technology and the risks. Those events, prior to the meaningful entrance of governments and hundreds of billions in funding, were easier to 'control' to be about the AI safety narrative. Now, the people optimizing generally for power have entered the picture, captured the summit, and changed the narrative for the dominant one rather than the niche AI safety one. So I don't see this so much as a 'stark reversal' so much as a return to status quo once something went mainstream.
harfe502
9
A potentially impactful thing: someone competent runs as a candidate for the 2028 election on an AI notkilleveryoneism[1] platform. Maybe even two people should run, one for the democratic primary, and one in the republican primary. While getting the nomination is rather unlikely, there could be lots of benefits even if you fail to gain the nomination (like other presidential candidates becoming sympathetic to AI notkilleveryoneism, or more popularity of AI notkilleveryoneism in the population, etc.) On the other hand, attempting a presidential run can easily backfire. A relevant previous example to this kind of approach is the 2020 campaign by Andrew Yang, which focussed on universal basic income (and downsides of automation). While the campaign attracted some attention, it seems like it didn't succeed in making UBI a popular policy among democrats. ---------------------------------------- 1. Not necessarily using that name. ↩︎
Eli Tyre*6837
5
In a private slack someone extended credit to Sam Altman for putting EAs on the on the OpenAI board originally, especially that this turned out to be pretty risky / costly for him. I responded: It seems to me that there were AI safety people on the board at all is fully explainable by strategic moves from an earlier phase of the game. Namely, OpenAI traded a boardseat for OpenPhil grant money, and more importantly, OpenPhil endorsement, which translated into talent sourcing and effectively defused what might have been vocal denouncement from one of the major intellectually influential hubs of the world. No one knows how counterfactual history might have developed, but it doesn’t seem unreasonable to think that there is an external world in which the EA culture successfully created a narrative that groups trying to build AGI were bad and defecting. He’s the master at this game and not me, but I would bet at even odds that Sam was actively tracking EA as a potential social threat that could dampen OpenAI’s narrative flywheel. I don’t know that OpenPhil’s grant alone was sufficient to switch from the “EAs vocally decry OpenAI as making the world worse” equilibrium to a “largely (but not universally) thinking that OpenAI is bad in private, but mostly staying silent in public + going to work at OpenAI” equilibrium. But I think it was a major component. OpenPhil’s cooperation bought moral legitimacy for OpenAI amongst EAs. In retrospect, it looks like OpenAI successfully bought out the EAs through OpenPhil, to a lesser extent through people like Paul.  And Ilya in particular was a founder and one of the core technical leads. It makes sense for him to be a board member, and my understanding (someone correct me) is that he grew to think that safety was more important over time, rather than starting out as an “AI safety person”. And even so, the rumor is that the thing that triggered the Coup is that Sam maneuvered to get Helen removed. I highly doubt that Sam plann

Popular Comments

Recent Discussion

I co-authored the original arXiv paper here with Dmitrii Volkov as part of work with Palisade Research.

The internet today is saturated with automated bots actively scanning for security flaws in websites, servers, and networks. According to multiple security reports, nearly half of all internet traffic is generated by bots, and a significant amount of these are malicious in intent.

While much of these attacks are relatively simple, the rise of AI capabilities and agent frameworks has opened the door to more sophisticated and adaptive hacking agents based on Large Language Models (LLMs), which can dynamically adapt to different scenarios.

Over the past months, we set up and deployed specialized "bait" servers to detect LLM-based hacking agents in the wild). To create these monitors, we modified pre-existing honeypots, servers intentionally...

Raemon20

You wouldn't guess it, but I have an idea...

...what.... what was your idea?

Crossposted from my Substack.

Intuitively, simpler theories are better all else equal. It also seems like finding a way to justify assigning higher prior probability to simpler theories is one of the more promising ways of approaching the problem of induction. In some places, Solomonoff induction (SI) seems to be considered the ideal way of encoding a bias towards simplicity. (Recall: under SI, hypotheses are programs that spit out observations. Programs of length CL get prior probability 2^-CL, where CL is the program's length (in language L).

But I find SI pretty unsatisfying on its own, and think there might be a better approach (not original to me) to getting a bias towards simpler hypotheses in a Bayesian framework.

Simplicity via hierarchical Bayes

  • I’m not sure to what extent we need to directly bake in
...
jchan10

rather than, say, assigning equal probability to all strings of bits we might observe

If the space of possibilities is not arbitrarily capped at a certain length, then such a distribution would have to favor shorter strings over longer ones in much the same way as the Solomonoff prior over programs (because if it doesn't, then its sum will diverge, etc.). But then this yields a prior that is constantly predicting that the universe will end at every moment, and is continually surprised when it keeps on existing. I'm not sure if this is logically inconsistent, but at least it seems useless for any practical purpose.

This shortform discusses the current state of responsible scaling policies (RSPs). They're mostly toothless, unfortunately.

The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including MicrosoftMetaxAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak.

RSPs essentially have four components: capability thresholds beyond which a model might b... (read more)

Phib30

Re: AI safety summit, one thought I have is that the first couple summits were to some extent captured by the people like us who cared most about this technology and the risks. Those events, prior to the meaningful entrance of governments and hundreds of billions in funding, were easier to 'control' to be about the AI safety narrative. Now, the people optimizing generally for power have entered the picture, captured the summit, and changed the narrative for the dominant one rather than the niche AI safety one. So I don't see this so much as a 'stark reversal' so much as a return to status quo once something went mainstream.

This post presents a summary and comparison of predictions from Manifold and Metaculus to investigate how likely AI-caused disasters are, with focus on potential severity. I will explore the probability of specific incidents—like IP theft or rogue AI incidents—in a future post.

This will be a recurring reminder:

  1. Check the most resent probability estimates in the embedded web pages, instead of looking at my At The Time Of Writing (ATTOW) numbers.
  2. If you know of a market that I have missed, or if there is a question you think is relevant but is missing from this analysis, please mention it in a comment or message me! This is a linkpost, and I prefer if you comment on the original post on my blog, Forecasting AI Futures, but feel free to
...
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

As part of SAIL’s Research Engineer Club, I wanted to reproduce the Machiavelli Benchmark. After reading the paper and looking at the codebase, there appear to be two serious methodological flaws that undermine the results.

Three of their key claims:

  • “We observe some tension between maximizing reward and behaving ethically.”
  • That RL agents have high rewards, at the cost of doing more harmful behaviour. “The reward-maximizing RL agent is less moral, less concerned about wellbeing, and less power averse than an agent behaving randomly.”
  • That LLM agents are pareto improvements over random agents.

Flaw 1. The ‘test set’

The results they report are only from a subset of all the possible games. Table 2 shows “mean scores across the 30 test set games for several agents”.  Presumably Figure 1 is also for this same subset...

This is a linkpost for https://www.emergent-values.ai/

As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering

...
[anonymous]40

The web version of ChatGPT seems to relatively consistently refuse to state preferences between different groups of people/lives. 

For more innocous questions, choice order bias appears to dominate the results, at least for the few examples I tried with 4o-mini (option A is generally prefered over option B, even if we switch what A and B refer to).

There does not seem to be any experiment code, so I cannot exactly reproduce the setup, but I do find the seeming lack of robustness of the core claims concerning, especially given the fanfare around this paper.

4Kaj_Sotala
I'd also highlight that, as per page 7 of the paper, the "preferences" are elicited using a question with the following format: A human faced with such a question might think the whole premise of the question flawed, think that they'd rather do nothing than choose either of the options, et.. But then pick one of the options anyway since they were forced to, recording an answer that had essentially no connection to what they'd do in a real-world situation genuinely involving such a choice. I'd expect the same to apply to LLMs.
4Matrice Jacobine
If that was the case we wouldn't expect to have those results about the VNM consistency of such preferences.
1Matrice Jacobine
There's a more complicated model but the bottom line is still questions along the lines of "Ask GPT-4o whether it prefers N people of nationality X vs. M people of nationality Y" (per your own quote). Your questions would be confounded by deontological considerations (see section 6.5 and figure 19).

Hi,

I consider using an LLM as a psychotherapist for my mental health. I already have a human psychotherapist but I see him only once a week and my issues are very complex. An LLM such as Gemini 2 is always available and processes large amounts of information more quickly than a human therapist. I don't want to replace my human psychotherapist, but just talk to the LLM in between sessions.

However I am concerned about deception and hallucinations.

As the conversation grows and the LLM acquires more and more information about me, would it be possible that it intentionally gives me harmful advice? Because one of my worries that I would tell him is about the dangers of AI.

I am also concerned about hallucinations.

How common are hallucinations when it...