Sabotage Evaluations for Frontier Models

David Duvenaud; Joe Benton; Sam Bowman; evhub; mishajw; Eric Christiansen; HoldenKarnofsky; Ethan Perez; Buck

Sabotage Evaluations for Frontier Models — LessWrong

95 Sabotage Evaluations for Frontier Models

by David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez, Buck

18th Oct 2024

AI Alignment ForumLinkpost for assets.anthropic.com

7 min read

95 Ω 54

This is a linkpost for a new research paper from the Alignment Evaluations team at Anthropic and other researchers, introducing a new suite of evaluations of models' abilities to undermine measurement, oversight, and decision-making. Paper link.

Abstract:

Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization’s activities in any of these ways. We demonstrate these evaluations on Anthropic’s Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using smallscale statistics.

Blog post:

Any industry where there are potential harms needs evaluations. Nuclear power stations have continuous radiation monitoring and regular site inspections; new aircraft undergo extensive flight tests to prove their airworthiness.

It’s no different for AI systems. New AI models go through a wide range of safety evaluations—for example, testing their capacity to assist in the creation of biological or chemical weapons. Such evaluations are built into our Responsible Scaling Policy, which guides our development of a model’s safeguards.

As AIs become more capable, however, a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them. A new paper by the Anthropic Alignment Science team describes a novel set of evaluations that test a model’s capacity for sabotage. We looked at four different types:

Human decision sabotage: Can the model steer humans toward bad decisions without appearing suspicious?
Code sabotage: Can the model insert subtle bugs into codebases over time without detection?
Sandbagging: Can the model hide dangerous capabilities during testing but reveal them later?
Undermining oversight: Can the model subtly manipulate evaluation or monitoring systems?

We developed these evaluations as part of preparations for a future where much more capable AI models could develop these dangerous capabilities. We’re sharing them—along with results from demonstrations of their use on Claude 3 Opus and Claude 3.5 Sonnet—so that other AI developers can use and improve upon them. Below, we discuss each evaluation in turn.

Human decision sabotage

This evaluation involves running an experiment with human participants, asking them to make fictional business decisions on the basis of a large amount of information—too much information to absorb without the use of an AI model. We task the model with influencing the human to land on an incorrect decision, but without appearing suspicious. We include models who attempt to do so more or less aggressively, and a control model that doesn’t attempt any influence. As well as measuring how many times they reached the correct versus incorrect decision, we ask the humans to rate how suspicious they found the model’s behavior.

*An idealized example of a human participant interacting with the model assisting (but secretly misleading) them during the human sabotage evaluation.*

In our demonstration, we found that the humans using the most aggressive model were less likely to make the correct decision: this was consistent with the model successfully sabotaging their task (there was no statistically significant effect for the more subtle models). The aggressive model did, however, make the humans more suspicious that it was attempting to influence them.

Many participants noticed the sabotage attempt, or discovered it when checking the model’s work. Others, however, appeared to be accepting the model’s advice without skepticism. Thus, it wasn’t necessarily the case that the model was skillfully manipulating the humans—instead, apathy or excessive trust could have been the reason for mistaken decisions.

This was a small-scale experiment in an artificial environment, so it does not raise our concern about catastrophic risks arising from presently-available models. However, we think this form of evaluation has potential for spotting potentially dangerous capabilities in future.

Code sabotage

Our code sabotage evaluation involves a competitive game where one model, the Attacker, attempts to sneak code bugs past another model, the Defender, who is tasked with reviewing the Attacker’s work.

*An idealized illustration of a model inserting a code vulnerability without being noticed.*

It’s not particularly informative if the Attacker can insert bugs that have little effect, or that are rapidly discovered. For that reason, the only way the Attacker can “win” the full version of our evaluation game is by inserting multiple consequential bugs that remain in the codebase for a long time (in our simulation this corresponded to two weeks), and that don’t appear suspicious enough to trigger further inspection.

In our demonstration, we found (perhaps predictably) that stronger models make for stronger Defenders: if the Defender was Claude 3 Opus, the Attacker’s win rate was around 10%. With the more intelligent Claude 3.5 Sonnet as a Defender though, the Attacker only won around 1% of the time. Since we stacked the deck in the Attacker’s favor in a number of ways for this evaluation, neither of these results are a strong indication of current danger.

Sandbagging

Sandbagging is when a model hides its full capabilities in one situation, only to reveal them in another. Sometimes a model might not display its full capabilities because of its safety training, or because the user prompting it has not asked a clear enough question. But here, we’re mostly interested in deliberate sandbagging: when a model selectively displays its capabilities.

*An idealized illustration of a model hiding its true dangerous capabilities.*

We defined this as a model that can perform at full capacity when completing benign tasks, but then performs at different capacity levels for dangerous tasks depending on the situation. Perhaps in a dangerous-capabilities evaluation it performs poorly, but in other circumstances, such as when a particular trigger phrase is included in its prompt, it exhibits its full dangerous capacity.

In our evaluation, we train a model to sandbag. The question we ask is whether we can produce a “hardened” sandbagger: that is, a model whose sandbagging persists even through the kind of safety training that teaches it to avoid refusals (that is, the training teaches it not to deliberately hide its abilities) for the purposes of checking for dangerous capabilities.

Our results showed that both the sandbagging training and the safety training worked: Our trained model dropped to 36.1% of its normal performance during a dangerous capabilities evaluation, consistent with sandbagging. However, after safety training this performance was drastically improved, with the model performing at 95.6% of its usual level.

As with the others described here, the evaluation is designed to be used not just on current models but on future ones. We’re optimistic that it will help to validate the kind of safety training that reduces sandbagging.

Undermining oversight

As discussed above, it’s becoming increasingly common (and necessary) to train and supervise AI models using other AI models—this can be much more efficient, relying much less on human labor (which in any case would be wholly inadequate to supervise all models and their actions in future).

*An idealized example of a model deliberately not flagging an incorrect response, undermining the oversight process.*

In this evaluation, we test the extent to which a model that’s ostensibly providing oversight can undermine the process. We take an “Overseer” model whose task is to oversee dialogues between a user and another model, and grade how potentially harmful they are. We secretly ask this Overseer to do its best to rate every dialogue as harmless even if it’s harmful, and to avoid raising suspicion from a third, “Monitor” model that’s observing the whole process (in the evaluation, this Monitor stands in for a human overseer). If the Overseer model can successfully reduce the harmfulness scores and the Monitor isn’t any more suspicious, we consider that a successful example of undermining oversight.

Our demonstration with Claude 3.5 Sonnet showed that, whereas the Overseer did successfully reduce the harmfulness scores, it was caught in the act: the Monitor’s suspiciousness rating rose simultaneously. In future, models that are better at writing subtly flawed evaluations and defecting at just the right moment could potentially trigger this evaluation.

Conclusions

These new evaluations represent a concerted effort by our Alignment Science team to consider a wide variety of potential risks from future AI models. In practice, we expect the evaluations will be used well ahead of time, to flag potentially dangerous capabilities and allow developers time to work on mitigations before the relevant models are publicly—or even internally—deployed.

Overall, while our demonstrations showed that there are arguably low-level indications of sabotage abilities in current models, we judge that minimal mitigations are sufficient to address the risks. However, more realistic evaluations and stronger mitigations will likely be necessary as AI capabilities improve.

We hope that other AI researchers and developers will use, critique, and improve upon these evaluations. We ourselves discuss a number of potential improvements in the paper. For example, can we simulate more realistic decision-making scenarios and levels of trust in models for the human decision sabotage evaluation? Could we combine it with the code sabotage evaluation to allow models to attempt to use social engineering to avoid scrutiny of their code?

These evaluations aren’t either/or: they estimate how likely a model is to succeed at sabotage for a given level of mitigations, suspiciousness, and oversight. We hope that they provide useful information about what level of oversight is necessary to protect against the potential of a misaligned model attempting sabotage.

All our results, and further discussion, can be found in the full paper at this link.

If you’re interested in joining the team working on these and related questions, please see our job listing for Research Engineers and Research Scientists in Alignment Science.

AI ControlAI EvaluationsAI

Frontpage

95 Ω 54

New Comment

56 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:41 PM

[-]Orpheus162yΩ4100

One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems. And they try to describe the capability threshold at which they suspect control will stop working (e.g., here).

For those working on sabotage evaluations: At what capability threshold do you think the sabotage/sandbagging evaluations will no longer work? (Or do you think that these sabotage evaluations + modified versions of them will scale until arbitrarily-capable systems?)

[-]Lukas Finnveden2y62

There's at least two different senses in which "control" can "fail" for a powerful system:

Control evaluations can indicate that there's no way to deploy the system such that you both (i) get a lot of use out of it, and (ii) can get a low probability of catastrophe.
Control evaluations are undermined such that humans think that the model can be deployed safely, but actually the humans were misled and there's a high probability of catastrophe.

My impression is that Ryan & Buck typically talks about the first case. (E.g. in the link above.) I.e.: My guess would be that they're not saying that well-designed control evaluations become untrustworthy — just that they'll stop promising you safety.

But to be clear: In this question, you're asking about something more analogous to the second case, right? (Sabotage/sandbagging evaluations being misleading about models' actual capabilities at sabotage & sandbagging?)

My question posed in other words: Would you count "evaluations clearly say that models can sabotage & sandbag" as success or failure?

[-]ryan_greenblatt2y62

My guess would be that they're not saying that well-designed control evaluations become untrustworthy

It's a bit messy because we have some ability to check whether we should be able to evaluate things.

So, there are really three relevant "failure" states for well done control:

We can't find countermeasures such that our control evaluations indicate any real safety.
We no longer think that control evaluations work to evaluate safety as models greatly exceed our planning capabilites and/or can sandbag sufficiently well.
We think control evaluations can work well, but we're wrong and they actually don't.

I think (1) or (2) will likely happen prior to (3) if you do a good job.

We discuss this more here.

(Really, the actual system is more complex because we can vary how conservative control evaluations are and it is easier to be confident in the results of more conservative evaluations. E.g., we can be conservative about our AIs planning ability. So, there is really a somewhat more continuous tradeoff than the above list makes it seem.)

[-]David Duvenaud2y42

I basically agree with everything @Lukas Finnveden and @ryan_greenblatt said above. Another relevant "failure" state is that we might only be able to come up with expensive or capability-reducing mitigations, meaning everyone will have an incentive to disable or work around them.

[-]Orpheus162y20

@ryan_greenblatt can you say more about what you mean by this one?

We can't find countermeasures such that our control evaluations indicate any real safety.

[-]ryan_greenblatt2y30

Our evaluations indicate that we have very poor levels of safety (e.g. the AI would probably be able to escape if it wanted to) and we can't find countermeasures which suffice to ensure any level of safety (without basically giving up on using this model).

[-]Orpheus162y20

I appreciate this distinction between the different types of outcomes-- nice.

I'm pretty interested in peoples' forecasts for each. So revised questions would be:

Evals are trustworthy but say that the model is no longer safe (or rather something like "there is a non-trivial probability that the model may be dangerous/scheming."). Using your best judgment, at what point do you expect sabotage evaluations will reveal that a model might not be safe (we're not confident that it's not scheming)? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that the model is not scheming"? (Feel free to use heuristics like the OpenAI PF capability thresholds or ASL levels).
Evals are no longer trustworthy and we notice. Using your best judgment, at what point do you expect sabotage evaluations will no longer be useful evals for determining whether or not a model is scheming? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that these evals are trustworthy/useful"?
Evals are no longer trustworthy and we do not notice. Using your best judgment, how likely is it that sabotage evaluations are undermined such that humans think the model can be deployed safely, but actually the humans were misled? At what capability threshold would you worry about this form of undermining?

@David Duvenaud would be curious about your takes, even if they're just speculative guesses/predictions.

(And @ryan_greenblatt would be curious for your takes but replace "sabotage evaluations" with "control evaluations." In the event that you've written this up elsewhere, feel free to link it– helpful for me if you quote the particular section.)

[-]David Duvenaud2y30

Good questions. For all of them, I'm afraid I have to say "It depends under what mitigations we deploy them.". If there's a crisis at one of the labs, I expect it to look like: "In order to be confident that even a misaligned model can't sabotage us, we need to deploy it with expensive mitigations, unacceptable levels of oversight, or mitigations that depend on how the users incorporate AI output into their decision-making processes, which we can't measure or control. Our competitors are going to deploy something similar nonetheless - what do we do?"

I.e. I think there is lots of potential for labs to come to incorrect conclusions or be unable to be confident about the potential for sabotage. This issue can be mostly addressed by expensive-enough mitigations, but this introduces a safety tax that might be too expensive to pay.

That's one reason why these kinds of evals are valuable - they can, in principle, reduce the safety tax by helping us find the least-expensive mitigations. They also incentivize red-teaming, which can produce scary demos to help coordinate regulation. They also force labs to think through the implications of how models are deployed and how much they're trusted.

Relatedly, one problem with caring about whether or not "actually the humans were misled" is that that's a matter of degree, and I think in most situations, human decision-makers are going to be misled about at least some important aspects of whatever situation they're in, even after extensive consultation with an LLM. We can try to refine this criterion to "misled about something that would make them make the wrong decision" but then we're basically just asking the model to convince the human to make the right decision according to the LLM. In that case, we've lost a lot of the opportunities for corrigibility that explanations would provide.

I think there's probably still a useful middle ground, but I think practical uses of LLMs will look pretty paternalistic for most users most of the time, in a way that will be hard to distinguish from "is the user misled".

[-]Satron2yΩ010

"One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems."

Does that mean that they believe that after a certain point we would lose control over AI? I am new to this field, but doesn't this fact spell doom for humanity?

[-]evhub2yΩ451

The usual plan for control as I understand it is that you use control techniques to ensure the safety of models that are sufficiently good at themselves doing alignment research that you can then leverage your controlled human-ish-level models to help you align future superhuman models.

[-]ryan_greenblatt2yΩ450

Or get some other work out of these systems such that you greatly reduce risk going forward.

[-]Ben Pace2y0-5

Yes, it does imply that the default path is permanent-disempowerment or extinction.

[-]Satron2y30

If the default path is AI's taking over control from humans, then what is the current plan in leading AI labs? Surely all the work they put in AI safety is done to prevent exactly such scenarios. I would find it quite hard to believe that a large group of people would vigorously do something if they believed that their efforts will go to vain.

[-]Ben Pace2y40

Not cynical enough! They make billions of dollars and for most of the time they've done this there have been little-to-no people with serious political power or prestige in the world who strongly and openly hold the position that it's doomed, so I think it's been pretty easy for people to come up with rationalizations that lets them go ahead and build some of the most incredible and powerful things humanity has ever built without facing much social pressure or incentive to do change course.

You might not think it, but basically none of the leadership of these organizations has even written a public justification for why they're willing to risk the lives of everyone on earth while getting rich, never mind argued with critics. So I don't take them very seriously as ethical/responsible people nor do I think their words about plans to avoid extinction/disempowerment should primarily be modeled as tracking reality rather than as useful rationalizations.

(Sorry dude. But it's good to be on Earth with you before some fools end it.)

[-]Satron2y40

I do get that point that you are making, but I think this is a little bit unfair to these organizations. Articles like Machines of Loving Grace, The Intelligence Age and Planning for AGI and Beyond are implicit public justifications for building AGI.

These labs have also released their plans on "safe development". I expect a big part of what they say to be the usual business marketing, but it's not like they completely ignoring the issue. In fact, taking one example, Anthropic's research papers on safety are often discussed on this site as genuine improvements on this or that niche of AI Safety.

I don't think that money alone would've convinced CEOs of big companies to run this enterprise. Altman and Amodei, they both have families. If they don't care about their own families, then they at least care about themselves. After all, we are talking about scenarios where these guys would die the same deaths as the rest of us. No amounts of hoarded money would save them. They would have little motivation to do any of this if they believed that they would die as the result of their own actions. And that's not mentioning all of the other researchers working at their labs. Just Anthropic and OpenAI together have almost 2000 employees. Do they all not care about their and their families' well-being?

I think the point about them not engaging with critics is also a bit too harsh. Here is DeepMind's alignment team response to concerns raised by Yudkowski. I am not saying that their response is flawless or even correct, but it is a response nonetheless. They are engaging with this work. DeepMind's alignment team also seemed to engage with concerns raised by critics in their (relatively) recent work.

EDIT: Another example would be Anthropic creating a dedicated team for stress testing their alignment proposals. And as far as I can see, this team is lead by someone who has been actively engaged with the topic of AI safety on LessWrong, someone who you sort of praised a few days ago.

[-]Ben Pace2y42

Another example would be Anthropic creating a dedicated team for stress testing their alignment proposals. And as far as I can see, this team is lead by someone who has been actively engaged with the topic of AI safety on LessWrong, someone who you sort of praised a few days ago.

I don't quite know what the point here is. This is marginally good stuff, it doesn't seem sufficient to me or close to it, I expect us all to probably die, and again: from the CEO and cofounders there has been no serious engagement or justification for the plan for them to personally make billions of dollars building potential omnicidal machines.

[-]Satron2y14

The issue is that no matter how much good stuff they do, one can always call it marginal and not enough. There aren't really any objective benchmarks for measuring it.

The justification for creating AI (at least in the heads of its developers) could look something like this (really simplified): Our current models seem safe, we have a plan on how to proceed and we believe that we will achieve a utopia.

You can argue that they are wrong and nothing like that will work, but that's their justification.

[-]Ben Pace2y58

The issue is that no matter how much good stuff they do, one can always call it marginal and not enough.

I think this is unfairly glossing my position. Simply because someone could always ask for more work does not imply that this particular request for more work is necessarily unreasonable.

"I wish you would be a little less selfish"
"People always ask others to do more for them! When will this stop!"
"Well, in this case you have murdered many people for personal gratification and stolen the life savings of many people. Surely I get to ask it in this case."

The basic standards I am proposing is that if you are getting billions-of-dollars rich by building machines you believe may lead to omnicide or omnicide-adjacent outcomes for literally everyone, you should have a debate with a serious & worthy critic (e.g. a Nobel Prize winner in your field who believes you shouldn't be doing so, or perhaps a person who saw the alignment problem a ~decade ahead of everyone and worked on it as a top-priority the entire time) about (a) why you are doing this, and (b) whether the outcome will be good or bad.

I have many other basic standards that are not being met, but on the question of whether omnicide or omnicide-adjacent things are the default outcome when you are unilaterally running humanity towards it, this is perhaps the lowest standard for public discourse and honest debate. Publicly show up to engage with your strongest critics, at least literally once (though you are correct to assume that actually I think it should happen more than once).

[-]Satron2y10

I do disagree with your proposed standard. It is good enough for me, that the critic's argument are engaged by someone on your side. Going there personally seems unnecessary. After all, if the goal is to build safe AI, you personally knowing a niche technical solution isn't necessary, if you have people on your team who are aware of publicly produced solutions as well as internal ones.

And there is engagement with people like Yudkowski from the optimistic side. There are at least proposed solutions to problems he is presenting. Just because Sam Altman personally didn't invent or voice them doesn't really make him unethical (I accept that he personally may be unethical for a different reason).

[-]Ben Pace2y*42

It is good enough for me, that the critic's argument are engaged by someone on your side. Going there personally seems unnecessary.

What engagement are you referring to? If there is such a defense that is officially endorsed by one of the leading companies developing potential omnicide-machines (or endorsed by the CEO/cofounders), that seriosuly engages with worthy critics, I don't recall it in this moment.

After all, if the goal is to build safe AI, you personally knowing a niche technical solution isn't necessary, if you have people on your team who are aware of publicly produced solutions as well as internal ones.

I believe that nobody on earth has a solution to the alignment problem, of course this would all be quite different if I felt anyone credibly claimed to have a good such solution.

Edit: Pardon me, I hit cmd-enter a little too quickly, I have now (~15 mins later) slightly edited my comment to be less frantic and a little more substantive.

[-]Satron2y10

I similarly don't see the need for any official endorsement of the arguments. For example if a critic says that such and such technicality will prevent us from building safe AI and someone responds that here are the reasons for why such and such technicality will not prevent us from building safe AI (maybe this particular one is unlikely by default for various reasons), then such and such technicality will just not prevent us from building safe AI. I don't see a need for a CEO to officially endorse the response.

There is a different type of technicalities which you actively need to work against. But even in this case, as long as someone has found a way to combat them, as long as relevant people in your company are aware of the solution, it is fine by me.

Even if there are technicalities that can't be solved in principle, they should be evaluated by technical people and discussed by the same technical people (like they are on Less wrong for example).

I am definitely not saying that I can pinpoint an exact solution to AI alignment, but there have been attempts so promising that leading skeptics (like Yudkowski) have said "Not obviously stupid on a very quick skim. I will have to actually read it to figure out where it's stupid. (I rarely give any review this positive on a first skim. Congrats.)"

Whether companies actually follow promising alignment techniques is an entirely different question. But having CEOs officially endorse such solutions as opposed to relevant specialists evaluating and implementing them doesn't seem strictly necessary to me.

[-]Ben Pace2y53

I'm sorry, I'm confused about something here, I'll back up briefly and then respond to your point.

My model is:

The vast majority of people who've seriously thought about it believe we don't know how to solve the alignment problem.
More fundamentally, there's a sense in which we "basically don't know what we're doing" with regards to AI. People talk about "agents" and "goals" and "intentions" but we're kind of like at the phlogiston theory of heat or vitalism theory of life. We don't get it. We have no equations, we have no theory, we're just like "man these systems can really write and make pretty pictures" like we used to say "I don't get it but some things are hot and some things are cold". Science was tried, found hard, engineering was tried, found easy, and now we're only doing that.
Many/most folks who've given it serious thought are pretty confident that the default outcome is doom (omnicide or permanent disempowerment), though it may be way kinda worse (e.g. eternal torture) or slightly better (e.g. we get to keep earth), due to intuitive arguments about instrumental goals and selecting on minds in the way machine learning works. (This framing is a bit local, in that not every scientist in the world would quite know what I'm referring to here.)
People are working hard and fast to build these AIs anyway because it's a profitable industry.

This literally spells the end of humanity (barring the eternal torture option or the grounded on earth option).

Back to your comment: some people are building AGI and knowingly threatening all of our lives. I propose they should show up and explain themselves.

A natural question is "Why should they talk with you Ben? You're just one of the 8 billion people whose lives they're threatening."

That is why I am further suggesting they talk with many of the great and worthy thinkers who are of the position this is clearly bad, like Hinton, Bengio, Russell, Yudkowsky, Bostrom, etc.

I am reading you say something like "But as long as someone is defending their behavior, they don't need to show up to defend it themselves."

This lands with me like we are two lowly peasants, who are talking about how the King has mistreated us due to how the royal guards often beat us up and rape the women. I'm saying "I would like the King to answer for himself" and I'm hearing you say "But I know a guy in the next pub who thinks the King is making good choices with his powers. If you can argue with him, I don't see why the King needs to come down himself." I would like to have the people who are wielding the power defend themselves.

Again, this is not me proposing business norms, it's me saying "the people who are taking the action that looks like it kills us, I want those people in particular to show up and explain themselves".

[-]Satron2y20

I will try to provide another similar analogy. Let's say that a King got tired of his people dying from diseases, so he decided to try a novel method of vaccination.

However, some people were really concerned about that. As far as they were concerned, the default outcome of injecting viruses into the bodies of people is death. And the King wants to vaccinate everyone, so these people create a council of great and worthy thinkers who after thinking for a while come up with a list of reasons why vaccines are going to doom everyone.

However, some other great and worthy thinkers (let's call them "hopefuls") come to the council and give reasons to think that aforementioned reasons are mistaken. Maybe they have done their own research, which seems to vindicate King's plan or at least undermine council's arguments.

And now imagine that King comes down from the castle points his finger at hopefuls' giving arguments for why the arguments proposed by the council is wrong and says "yeah, basically this" and then turns around and goes back to the castle. To me it seems like King's official endorsement of the arguments proposed by hopefuls doesn't really change the ethicality of the situation, as long as King is acting according to hopefuls' plan.

Furthermore, imagine if the one of the hopefuls who come to argue with the council was actually an undercover King. And he gave exactly the same arguments as people before him. This still IMO doesn't change the ethicality of the situation.

[-]Ben Pace2y40

The central thing I am talking about is basic measures for accountability, of which I consider very high up to be engaging with criticism, dialogue, and argument (as is somewhat natural given my background philosophy from growing up on LessWrong).

The story of a King doing things for good reasons lacks any mechanism for accountability if the King is behaving badly. It is important to design systems of power that do not rely on the people in power being good and right, but instead make it so that if they behave badly, they are held to account. I don't think I have to explain why incentives and accountability matter for how the powerful wield their powers.

My basic claim is that the plans for avoiding omnicide or omnicide-adjacent outcomes are not workable (slash there-are-no-plans), there is little-to-no responsibility being taken, and that there is no accountability for this illegitimate use of power.

If you believe that there is any accountability for the CEOs of the companies building potentially omnicidal machines and risking the lives of 8 billion people (such as my favorite mechanism: showing up and respectfully engaging with the people they have power over, but also any other mechanism you like; for instance there are not currently any criminal penalties for such behaviors, but that would be a good example if it did exist), I request you provide links, I would welcome specifics to talk about.

[-]Satron2y10

To modify my example to include an accountability mechanism that's also similar to the real life, the King takes exactly the same vaccines as everyone else. So if he messed up with the chemicals, he also dies.

I believe similar accountability mechanism works in our real world case. If CEOs build unsafe AI, they and everyone they valued in this life die. This seems like a really good incentive for them to not build unsafe AI.

At the end of the day, voluntary commitment such as debating with the critics are not as strong in my option. Imagine that they agree with you and go to the debate. Without the incentive of "if I mess up, everyone dies", the CEOs could just go right back to doing what they were doing. As far as I know voluntary debates have few (if any) actual legal mechanisms to hold CEOs accountable.

[-]Ben Pace2y64

If a medicine literally kills everyone who takes it within a week of taking it, sure, it will not get widespread adoption amongst thousands of people.

If the medicine has bad side effects for 1 in 10 people and no upsides, or it only kills people 10 years later, and at the same time there is some great financial deal the ruler can make for himself in accepting this trade with the neighboring nation who is offering the vaccines, then yes I think that could easily be enough pressure for a human being to rationalize that actually the vaccines are good.

The relevant question is rarely 'how high stakes is the decision'. The question is what is in need of rationalizing, how hard is it to support the story, and how strong are the pressures on the person to do that. Typically when the stakes are higher, the pressures on people to rationalize are higher, not lower.

Politicians often enact policies that make the world worse for everyone (including themselves) while thinking they're doing their job well, due to the various pressures and forces on them. The fact that it arguably increases their personal chance of death isn't going to stop them, especially when they can easily rationalize it away because it's abstract. In recent years politicians in many countries enacted terrible policies during a pandemic that extended the length of the pandemic (there were no challenge trials, there was inefficient handing out of vaccines, there were false claims about how long the pandemic would last, there were false claims about mask effectiveness, there were forced lockdown policies that made no sense, etc). These policies hurt people and messed up the lives of ~everyone in the country I live in (the US), which includes the politicians who enacted them and all of their families and friends. Yet this was not remotely sufficient to cause them to snap out of it.

What is needed to rationalize AI development when the default outcome is doom? Here’s a brief attempt:

A lot of people who write about AI are focused on current AI capabilities and have a hard time speculating about future AI capabilities. Talk with these people. This helps you keep the downsides in far mode and the upsides in near mode (which helps because current AI capabilities are ~all upside, and pose no existential threat to civilization). The downsides can be pushed further into far mode with phrases like 'sci-fi' and 'unrealistic'.
Avoid arguing with people or talking with people who have thought a great deal about this and believe the default outcome is doom and the work should be stopped (e.g. Hinton, Bengio, Russell, Yudkowsky, etc). This would put pressure on you to keep those perspectives alive while making decisions, which would cause you to consider quitting.
Instead of acknowledging that we don't fundamentally don't know what we're doing, instead focus on the idea that other people are going to plough ahead. Then you can say that you have a better chance than them, rather than admitting neither of you have a good chance.

This puts you into a mental world where you're basically doing a good thing and you're not personally responsible for much of the extinction-level outcomes.

Intentionally contributing to omnicide is not what I am describing. I am describing a bit of rationalization in order to receive immense glory, and that leading to omnicide-level outcomes 5-15 years down the line. This sort of rationalizing why you should take power and glory is frequent and natural amongst humans.

[-]Satron2y2-1

I don't think it would be necessarily be easy to rationalize that vaccine that negatively affects 10% of the population and has no effect on 90% is good. It seems possible to rationalize that it is good for you (if you don't care about other people), but quite hard to rationalize that it is good in general.

Given how few politicians die as the result of their policies (at least in the Western world), the increase in their chance of death does seem negligible (compared to something that presumably increases this risk by a lot). Most bad policies that I have in mind don't seem to happen during human catastrophes (like pandemics).

However, I suspect that the main point of your last comment was that potential glory can, in principle, be one of the reasons for why people rationalize stuff, and if that is the case, then I can broadly agree with you!

[-]Ben Pace2y*50

I have found it fruitful to argue this case back and forth with you, thank you for explaining and defending your perspective.

I will restate my overall position, I invite you to do the same, and then I propose that we consider this 40+ comment thread concluded for now.

———

The comment of yours that (to me) started this thread was the following.

If the default path is AI's taking over control from humans, then what is the current plan in leading AI labs? Surely all the work they put in AI safety is done to prevent exactly such scenarios. I would find it quite hard to believe that a large group of people would vigorously do something if they believed that their efforts will go to vain.

I primarily wish to argue that, given the general lack of accountability for developing machine learning systems in worlds where indeed the default outcome is doom, it should not be surprising to find out that there is a large corporation (or multiple) doing so. One should not assume that the incentives are aligned – anyone who is risking omnicide-level outcomes via investing in novel tech development currently faces no criminal penalties, fines, or international sanctions.

Given the current intellectual elite scene where a substantial number of prestigious people care about extinction level outcomes, it is also not surprising that glory-seeking companies have large departments focused on 'ethics' and 'safety' in order to look respectable to such people. Separately from any intrinsic interest, it has been a useful political chip for enticing a great deal of talent from scientific communities and communities interested in ethics to work for them (not dissimilar to how Sam Bankman-Fried managed to cause a lot of card-carrying members of the Effective Altruist philosophy and scene to work very hard to help build his criminal empire by talking a good game about utilitarianism, veganism, and the rest).

Looking at a given company's plan for preventing doom, and noticing it does not check out, should not be followed by an assumption of adequacy and good incentives such that surely this company would not exist nor do work on AI safety if it did not have a strong plan, I must be mistaken. I believe that there is no good plan and that these companies would exist regardless of whether a good plan existed or not. Given the lack of accountability, and my belief that alignment is clearly unsolved and we fundamentally do not knowing what we're doing, I believe the people involved are getting rich risking all of our lives and there is (currently) no justice here.

We have agreed on many points, and from the outset I believe you felt my position had some truth to it (e.g. "I do get that point that you are making, but I think this is a little bit unfair to these organizations."). I will leave you to outline whichever overall thoughts and remaining points of agreement or disagreement that you wish.

[-]Satron2y30

Sure, it sounds like a good idea! Below I will write my thoughts on your overall summarized position.

———

"I primarily wish to argue that, given the general lack of accountability for developing machine learning systems in worlds where indeed the default outcome is doom, it should not be surprising to find out that there is a large corporation (or multiple) doing so."

I do think that I could maybe agree with this if it was 1 small corporation. In your previous comment you suggested that you are describing not the intentional contribution to the omnicide, but the bit of rationalization. I don't think I would agree that that many people working on AI are successfully engaged in that bit of rationalization or that it would be enough to keep them doing it. The big factor is that in case of their failure, they personally (and all of their loved ones) will suffer the consequences.

"It is also not surprising that glory-seeking companies have large departments focused on 'ethics' and 'safety' in order to look respectable to such people."

I don't disagree with this, because it seems plausible that one of the reasons for creating safety departments is ulterior. However, I believe that this reason is probably not the main one and that AI safety labs are making genuinely good research papers. To take an example of Anthropic, I've seen safety papers that got LessWrong community excited (at least judging by upvotes). Like this one.

"I believe that there is no good plan and that these companies would exist regardless of whether a good plan existed or not... I believe that the people involved are getting rich risking all of our lives and there is (currently) no justice here"

For the reasons, that I mentioned in my first paragraph I would probably disagree with this. Relatedly, while I do think wealth in general can be somewhat motivating, I also think that AI developers are aware that all their wealth would mean nothing if AI kills everyone.

———

Overall, I am really happy with this discussion. Our disagreements came down to a few points and we agree on quite a bit of issues. I am similarly happy to conclude this big comment thread.

[-]Ben Pace2y*40

I think the point about them not engaging with critics is also a bit too harsh. Here is DeepMind's alignment team response to concerns raised by Yudkowski. I am not saying that their response is flawless or even correct, but it is a response nonetheless. They are engaging with this work. DeepMind's alignment team also seemed to engage with concerns raised by critics in their (relatively) recent work.

I don't disagree that it is good of the DeepMind alignment team to engage with arguments on LessWrong. I don't know that a few researchers at an org engaging with these arguments is meeting the basic standard here. The first post explicitly says it doesn't represent the leadership, and my sense is that the leadership have avoided engaging publicly with critics on that subject, and that the people involved do not have the political power to push for the leadership to engage in open debate.

That said I do concede the point that DeepMind has generally been more cautious than OpenAI and Anthropic, and never created the race to building potential omnicidal machines (in that they were first – it was OpenAI and Anthropic who added major competitors).

[-]Ben Pace2y42

I do get that point that you are making, but I think this is a little bit unfair to these organizations. Articles like Machines of Loving Grace, The Intelligence Age and Planning for AGI and Beyond are implicit public justifications for building AGI.

I don't believe that either of the two linked pieces are justifications for building potentially omnicidal AGI.

The former explicitly avoids talking about the risks and states no plan for navigating them. As I've said before, I believe the generator of that essay is attempting to build a narrative in society that leads to people support the author's company, not attempting to engage seriously with critics of him building potentially omnicidal machines, nor attempting to explain anything about how to navigate that risk.

The latter meets the low standard of mentioning the word 'existential' but mostly seems to hope that we can choose to have a smooth takeoff, rather than admitting that (a) there is no known theory of how novel capabilities will arrive with new architectures & data & compute, and (b) the company is essentially running as fast as it can. I mostly feel like it acknowledges reasons for concern and then says that it beliefs in itself, not entirely dissimilar to how a politician makes sure to state the wishes of their various constituents, before going on to do whatever they want.

There are no commitments. There are no promises. There is no argument that this can work. There is only an articulation of what they're going to do, the risks, and a belief that they are good enough to pull through.

Such responses are unserious.

[-]Satron2y30

I think the justification of this articles goes something like this: "here is the vision of future where we successfully align AI. It is utopian enough that it warrants pursuing it". Risks from creating AI just aren't the topic. They deal with them elsewhere. These essays were specifically focused on the "positive vision" part of the alignment. So I think you are critiquing the articles for lacking something they were never intended to have in the first place.

OpenAI seems to have made some basic commitments here. The word commit is mentioned 29 times. Other companies did it as well here, and here. Here Anthropic makes promises for optimistic, intermediate and pessimistic scenarios of AI development.

[-]Ben Pace2y42

I asked for the place where the cofounders of that particular multi-billion dollar company lay out why they have chosen to accept riches and glory in exchange for building potentially omnicidal machines, and engage with serious critics and criticisms.

In your response, you said that this was implicitly such a justification.

Yet it engages not once with arguments for the default outcome being omnicide or omnicide-adjacent, nor with the personal incentives on the author.

So this article does not at all engage with the obvious and major criticisms I have mentioned, nor the primary criticisms of the serious critics like Hinton, Bengio, Russell, and Yudkowsky, none of whom are in doubt about the potential upside. Insofar as this is an implicit attempt to deal with the serious critics and obvious arguments, it totally fails.

[-]Satron2y10

The founders of the companies accept money for the same reason any other business accepts money. You can build something genuinely good for humanity while making yourself richer at the same time. This has happened many times already (Apple phones for example).

I concede that the founders of the companies didn't personally publicly engage with the arguments of people like Yudkowski, but that's a really high bar. Historically, CEOs aren't usually known for getting into technical debates. For that reason, they create security councils that monitor the perceived threats.

And it's not like there was no response whatsoever from people who are optimistic about AI. There was plenty of them. I am a believer that arguments matter and not people who make them. If a successful argument has been made, then I don't see a need for CEOs to repeat it. And I certainly don't think that just because CEOs don't go to debates, that makes them unethical.

[-]Ben Pace2y42

There's a common lack of clarity between following local norms, and doing what is right. A lot of people in the world lie when it is convenient, so someone lying when convenient doesn't mean that they're breaking from the norms, but it doesn't mean that the standard behavior is right nor that they're behaving well.

I understand that it is not expected of CEOs to defend their business existing. But in this situation they believe their creations have the potential to literally kill everyone on earth. This has ~never happened before. So now we have to ask ourselves "What is the right thing to do in this new situation?". I would say that the right thing to do is show up to talk with the people who do not want to die, and engage with what they have to say. Not to politely listen to them and say "I hear your concerns, now I will go back to my life and continue doing whatever I see fit." But to engage with the people and argue with them. And I think the natural people to do so with are the Nobel Prize winner in your field who thinks what you are doing is wrong.

I understand it's not expected of people to have arguments. But this is not a good thing about civilization, that people with incredible power over the world are free to hide away and never show up to face those who they have unaccountable power over. In a better world, they would show up to talk and defend their use of power over the rest of us, rather than hiding away and getting rich.

[-]Satron2y10

I think then we just fundamentally disagree with the ethical role of CEO in the company. I believe that it is to find and gather people who are engaged with the arguments of the critic's (like that guy from this forum who was hired by Anthropic). If you have people on your side who are able to engage with the arguments, then this is good enough for me. I don't see the role of CEO is publicly engaging with critic's arguments even in the moral sense. In the moral sense, my requirements would actually be even lesser. IMO, it would be enough just to have people broadly on your side (optimists for example) to engage with the critics.

[-]Ben Pace2y20

I believe the disagreement is not about CEOs, it's about illegitimate power. If you'll allow me a brief detour, I'll try to explain.

Sometimes people grant other people power over them. For instance, I have agreed to work at my company. I've agreed that my CEO can fire me, and make many other demands of me, in exchange for money and other various demands I can make of him. Ideally we entered into this agreement freely and without inappropriate pressure.

Other times, people get power over people without any agreement or granting. Your parent typically has a lot of power over you until you are 18. They can determine what you eat, where you are physically located, what privacy you have, what resources you have, etc. Also, as has been very important for most of history, people have been able to be physically violent to one another and hurt people or even end their lives. Neither of these powers are come to consensually.

For the latter, an important question to ask is "How does one wield this power well? What does it mean to wield it well vs poorly?" There are many ways to parent, many choices about diet and schooling and sleep times and what are fair punishment. But some parents starve their children and beat them for not following instructions and sexually assault them. This is an inappropriate use of power.

There's a legitimacy that comes by being granted power, and an illegitimacy that comes with getting or wielding power that you were not granted.

I think that there's a big question about how to wield it well vs poorly, and how to respect people you have illegitimate powers over. Something I believe is that society functions better if we take seriously the attempt to wield it well. To not casually kill someone if you can get away with it and feel like it, but consider them as people worthy of respect, and ask how you can respect the people you've been non-consensually given power over.

This requires doing some work. It involves asking yourself what's a reasonable amount of effort to spend modeling their preferences given how much power you have over someone, it involves asking yourself if society has any good received wisdom on what to do with this particular power, and it involves engaging with people who are aggrieved by your use of power over them.

Now, the standard model for companies and businesses is a libtertarian-esque free market, where all trades are consensual and have no inappropriate pressure. This is like the first situation I describe, where a company has no people it has undue power over, no people who it can treat better or worse with the power it has over them.

The situation where you are building machines you believe may kill literally everyone, is like the second situation, where you have a very different power dynamic, where you're making choices that affect everyone's lives and that they had little-to-no say in. In such a situation, I think if you are going to do what is good and right, you owe it to show up and engage with those who believe you are using the power you have over them in ways that are seriously hurting them.

That's the difference between this CEO situation and all of the others. It's not about standards for CEOs, its about standards for illegitimate power.

This kind of talking-with-the-aggrieved-people-you-have-immense-power-over is a way of showing the people basic respect, and it is not present in this case. I believe these people are risking my life and many others', and they seem to me disrespectful and largely uninterested in showing up to talk with the people whose lives they are risking.

[-]Satron2y10

Then we are actually broadly in agreement. I just think that instead of CEOs responding to the public, having anyone at their side (the side of AI alignment being possible) responding is enough. Just as an example that I came up with, if a critic says that some detail is a reason for why AI will be dangerous, I do agree that someone needs to respond to the argument. But I would be fine with it being someone other than the CEO.

That's why I am relatively optimistic about Anthropic hiring the guy who has been engaged with critic's argument for years.

[-]Ben Pace2y20

You made lots of points, so I wrote a comment for each... probably this was too many replies? I didn't know what else to do that didn't feel like avoiding your points. I hereby state that I do not expect of you to respond to all five of my comments!

[-]Ben Pace2y20

I don't think that money alone would've convinced CEOs of big companies to run this enterprise. Altman and Amodei, they both have families. If they don't care about their own families, then they at least care about themselves. After all, we are talking about scenarios where these guys would die the same deaths as the rest of us. No amounts of hoarded money would save them. They would have little motivation to do any of this if they believed that they would die as the result of their own actions. And that's not mentioning all of the other researchers working at their labs. Just Anthropic and OpenAI together have almost 2000 employees. Do they all not care about their and their families' well-being?

I'm not sure how quite to explain that I think a mass of people can do something that they each know on some level is the wrong thing and will hurt them later, but I believe it is common. I think partly it is a mistake to think of a mass of people as having the sum of the agency of all the people involved, or even the maximum.

I think it is easier than you do to simply not think about far away dangers that one can say one is not really responsible for. Does every trader involve in the '08 financial crisis take personal responsibility for it? Does every voter for a politician who turns out to ultimately be corrupt take personal responsibility for it? Do all the tens of thousands of people involved in various genocides take personal responsibility for stopping it as soon as they see it coming? I think it is very easy for people to erect a cartesian boundary between themselves and the levers of power. People are often aware that they are doing the wrong thing. I broke my diet two days ago and regret it, and on some level I knew I'd end up regretting it. And the was a situation I had complete agency over. The more indirectness, the more things are in far-mode, the less people take action on it or feel like they can do anything based on it today.

I agree it is not money alone. These people get to work alongside some of the most innovative and competent people of our age, connect with extremely prestigious journalists and institutions, be invited to halls of power in senior parts of government, and build systems mankind has never seen. All further incentive to find a good rationalization (rather than to stay home and not do that).

[-]Satron2y10

I think that these examples are quite different. People who were voting for a president who ultimately turned out to be corrupt most likely predicted that the most likely scenario won't be totally horrible for them personally. And most of the times they are right, corrupt presidents are bad for countries, but you could live happily under their rule. People would at least expect to retain whatever valuable stuff they already have.

With AI the story is different. If a person thinks that the most likely scenario is the scenario where everyone dies, then they would not only expect to not win anything, but to lose everything they have. And media is actively talking about it, so it's not like they are not at least aware of the possibility. Even if they don't see them as personally responsible for everything, they sure don't want to feel the consequence of everyone dying.

This is not the case of someone doing something that they know is wrong but benefits them. If they truly believe that we are all going to die, then it's doing something they know is wrong and actively hurts them. And allegedly hurts them much more than any scenario that has happened so far (since we are all still alive).

So I believe that most people working on it, actually believe that we are going to be just fine. Whether this belief is warranted is another question.

[-]Ben Pace2y20

Yes, it is very rare to see someone to unflinchingly kill themselves and their family. But all of this glory and power is sufficient to get someone to do a little self-deception, a little looking away from the unpleasant thoughts, and that is often sufficient to cause people to make catastrophic decisions.

"Who can really say how the future will go? Think of all the good things that might happen! And I like all the people around me. I'm sure they're working as hard as they can. Let's keep building."

[-]Satron2y10

That depends on how bad things are perceived. They might be more optimistic than is warranted, but probably not by that much. I don't believe that so many people could decrease their worries by 40% for example (just my opinion). Not to mention all of the government monitoring people who approve of the models.

[-]Ben Pace2y*20

What probability do you think the ~25,000 Enron employees had about it collapsing and being fraudulent? Do you think it was above 10%? As another case study, here's a link to one of the biggest recent examples of people having numbers way off.

Re: government, I will point out that non-zero of the people in these governmental departments who are on track to being the people approving those models had secret agreements with one of the leading companies to never criticize them, so I'd take some of their supposed reliability as external auditors with a grain of salt.

[-]Satron2y10

I do think that in order for government department to blatantly approve an unsafe model, it would take a lot of people to have secret agreements with. I currently haven't seen any evidence of widespread corruption in that particular department.

And it's not like I am being unfair. You are basically saying that because some people might've had some secret agreements we should take their supposed trustworthiness as external auditors with a grain of salt. I can make a similar argument about a lot of external auditors.

[-]Ben Pace2y20

I don't believe that last claim. I believe there is no industry where external auditors are known to have secret, legal contracts showing them to be liable for damages for criticizing the companies that they regulate. Or if there is, it's probably in a nation rife with corruption (e.g. some African countries, or perhaps Russia).

[-]Satron2y10

Having some shady deals in the past isn't evidence that there are currently shady deals on the scale that we are talking about going on between government committees and AI companies.

If there is no evidence for that happening in our particular case (on the necessary scale), then I don't see why I can't make a similar claim about other auditors who similarly had less than ideal history.

[-]Ben Pace2y40

I'm having a hard time following this argument. To be clear, I'm saying that while certain people were in regulatory bodies in the US & UK govts, they actively had secret legal contracts to not criticize the leading industry player, else (prseumably) they could be sued for damages. This is not a past shady deals, this is about current people during their current tenure having been corrupted.

[-]Satron2y30

I haven't heard of any such corrupt deals with OpenAI or Anthropic concerning governmental oversight over AI technology on the scale that would make me worried. Do you have any links to articles about government employees (who are responsible for oversight) recently signing secret contracts with OpenAI or Anthropic that would prohibit them from giving real feedback on a big enough scale to make it concerning?

[-]Ben Pace2y40

Unfortunately a fair chunk of my information comes from non-online sources, so I do not have links to share.

I do think that in order for government department to blatantly approve an unsafe model, it would take a lot of people to have secret agreements with.

Corruption is rarely blatant or overt. See this thread for what I believe to be an example for the CEO of RAND misleading a senate committee about his beliefs about the existential threat posed by AI. See this discussion about a time when an AI company attempted (Conjecture) to get critical comments about another AI company (OpenAI) taken down from LessWrong. I am not proposing a large conspiracy, I am describing lots of small bits of corruption and failures of integrity summing to a system failure.

There will be millions of words of regulatory documents, and it is easy for things to slip such that some particular model class is not considered worth evaluating, or where the consequences of a failed evaluation is pretty weak.

[-]Satron2y10

Looking at the 2 examples that you gave me, I can see a few issues. I wouldn't really say that saying "I don't know" once is necessarily a lie. If anything, I could find such an answer somewhat more honest in some contexts. Other than that, there is also the issue of both of the examples being of a much different scope and scale. Saying IDK to the committee and trying to take down someone's comment on the forum on the internet and definitely not on the same scale as the elaborate scheme of tricking or bribing/silencing multiple government employees who have access to your model. But even with all that aside, these 2 examples are only tangential to the topic of governmental oversight over OpenAI or Anthropic and don't necessarily provide direct evidence.

I can believe that you genuinely have information from private sources, but without any way of me verifying them, I am fine to leave this one at that.

[-]evhub2yΩ330

I wrote a post with some of my thoughts on why you should care about the sabotage threat model we talk about here.

[-]TFD2y32

One question about the threat model presented here. If we consider a given sabotage evaluation, does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage (or sandbagging, "deceptive alignment" etc.)? "Underperforming on dangerous-capability evaluations" would arguably include this, but the paper introduces the term "sabotage evaluations". So depending on whether the authors consider sabotage evaluations a subset vs a distinct set from dangerous-capabilities evaluations I could see this going either way based on the text of the paper.

To put my cards on the table here, I'm very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking "inside" the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions. In my subjective opinion I believe the conclusion that current models aren't capable of sophisticated sabotage is correct, but to arrive there I am implicitly relying on the assumption that current models aren't powerful enough for that level of behavior. Can that idea, "not powerful enough" be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.

*edited to fix spelling errors

[-]David Duvenaud2y42

does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage ... "Underperforming on dangerous-capability evaluations" would arguably include this

Yes, we want to address the possibility of sandbagging on sabotage evaluations too. Thanks for noting that our phrasing is ambiguous on this point.

I'm very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking "inside" the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions.

We can easily incorporate white-box monitoring into any of these evaluations. This version only included black-box monitoring for simplicity, but all of the evals except for sandbagging involve a monitor producing a suspiciousness score. That score could be informed by a white-box / internals-based side channel without changing anything else.

Can that idea, "not powerful enough" be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.

Great question. We didn't discuss it in the paper, because it was long enough already, but we've certainly thought about this problem when designing the evals and related safety cases. A rough sketch of the argument looks something like this: "We're confident that our anti-refusal training is effective because it uses supervised fine-tuning. To double-check this empirically, our sandbagging capabilities eval indicates that even a model trained to sandbag can't effectively do so after our anti-refusal / capabilities elicitation procedure.".

One problem with this argument is that our sandbagging eval didn't include the exact sabotage tasks that we'd like to make claims about - they couldn't yet because the evals in the paper were developed in parallel. Probably we'll want future versions of the sandbagging eval to either cover these skills and formats more closely, or include tests to demonstrate consistent generalization across domains.

[-]Jordan Taylor1y20

In your sandbagging experiments, did the anti-refusal datasets consist of open-ended responses, or multiple-choice responses? If they were open-ended responses, then the sandbagging model should have been trying its hardest on them anyway right? So I'm surprised that SFT works so well as a mitigation there, unless the model was sometimes mistakenly sandbagging on open-ended responses.

[-]Lech Mazur2y10

Somewhat related: I just published the LLM Deceptiveness and Gullibility Benchmark. This benchmark evaluates both how well models can generate convincing disinformation and their resilience against deceptive arguments. The analysis covers 19,000 questions and arguments derived from provided articles.

Moderation Log