aysja - LessWrong

I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.

Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.

And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

aysja9d6572

Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve.

I think it’s really alarming that this safety framework contains no commitments, and I’m frustrated that concern about this is brushed aside. If DeepMind is aiming to build AGI soon, I think it’s reasonable to expect they have a mitigation plan more concrete than a list of vague IOU’s. For example, consider this description of a plan from the FSF:

When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results.

This isn’t a baseline of reasonable practices upon which better safety practices can evolve, so much as a baseline of zero—the fact that DeepMind will take some unspecified action, rather than none, if an evaluation triggers does not appear to me to be much of a safety strategy at all.

Which is especially concerning, given that DeepMind might create AGI within the next few years. If it’s too early to make commitments now, when do you expect that will change? What needs to happen, before it becomes reasonable to expect labs to make safety commitments? What sort of knowledge are you hoping to obtain, such that scaling becomes a demonstrably safe activity? Because without any sense of when these “best practices” might emerge, or how, the situation looks alarmingly much like the labs requesting a blank check for as long as they deem fit, and that seems pretty unacceptable to me.

AI is a nascent field. But to my mind its nascency is what’s so concerning: we don’t know how to ensure these systems are safe, and the results could be catastrophic. This seems much more reason to not build AGI than it does to not offer any safety commitments. But if the labs are going to do the former, then I think they have a responsibility to provide more than a document of empty promises—they should be able to state, exactly, what would cause them to stop building these potentially lethal machines. So far no lab has succeeded at this, and I think that’s pretty unacceptable. If labs would like to take up the mantle of governing their own safety, then they owe humanity the service of meaningfully doing it.

Zach Stein-Perlman's Shortform

aysja9d3528

I'm sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don't know what the decision process inside of Anthropic will look like if an evaluation indicates something like "yeah, it's excellent at inserting backdoors, and also, the vibe is that it's overall pretty capable." And given that Anthropic is making these decisions on behalf of everyone, Anthropic (like all labs) really owes it to humanity to be more upfront about how it'll make these decisions (imo).

I will also note what I feel is a somewhat concerning trend. It's happened many times now that I've critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: "this wouldn't seem so bad if you knew what was happening behind the scenes." They of course cannot tell me what the "behind the scenes" information is, so I have no way of knowing whether that's true. And, maybe I would in fact update positively about Anthropic if I knew. But I do think the shape of "we're doing something which might be incredibly dangerous, many external bits of evidence point to us not taking the safety of this endeavor seriously, but actually you should think we are based on me telling you we are" is pretty sketchy.

Raemon's Shortform

aysja16d106

Largely agree with everything here.

But, I've heard some people be concerned "aren't basically all SSP-like plans basically fake? is this going to cement some random bureaucratic bullshit rather than actual good plans?." And yeah, that does seem plausible.

I do think that all SSP-like plans are basically fake, and I’m opposed to them becoming the bedrock of AI regulation. But I worry that people take the premise “the government will inevitably botch this” and conclude something like “so it’s best to let the labs figure out what to do before cementing anything.” This seems alarming to me. Afaict, the current world we’re in is basically the worst case scenario—labs are racing to build AGI, and their safety approach is ~“don’t worry, we’ll figure it out as we go.” But this process doesn’t seem very likely to result in good safety plans either; charging ahead as is doesn’t necessarily beget better policies. So while I certainly agree that SSP-shaped things are woefully inadequate, it seems important, when discussing this, to keep in mind what the counterfactual is. Because the status quo is not, imo, a remotely acceptable alternative either.

lukehmiles's Shortform

aysja1mo62

Or independent thinkers try to find new frames because the ones on offer are insufficient? I think this is roughly what people mean when they say that AI is "pre-paradigmatic," i.e., we don't have the frames for filling to be very productive yet. Given that, I'm more sympathetic to framing posts on the margin than I am to filling ones, although I hope (and expect) that filling-type work will become more useful as we gain a better understanding of AI.

Linch's Shortform

aysja1mo5823

I think this letter is quite bad. If Anthropic were building frontier models for safety purposes, then they should be welcoming regulation. Because building AGI right now is reckless; it is only deemed responsible in light of its inevitability. Dario recently said “I think if [the effects of scaling] did stop, in some ways that would be good for the world. It would restrain everyone at the same time. But it’s not something we get to choose… It’s a fact of nature… We just get to find out which world we live in, and then deal with it as best we can.” But it seems to me that lobbying against regulation like this is not, in fact, inevitable. To the contrary, it seems like Anthropic is actively using their political capital—capital they had vaguely promised to spend on safety outcomes, tbd—to make the AI arms race counterfactually worse.

The main changes that Anthropic has proposed—to prevent the formation of new government agencies which could regulate them, to not be held accountable for unrealized harm—are essentially bids to continue voluntary governance. Anthropic doesn’t want a government body to “define and enforce compliance standards,” or to require “reasonable assurance” that their systems won’t cause a catastrophe. Rather, Anthropic would like for AI labs to only be held accountable if a catastrophe in fact occurs, and only so much at that, as they are also lobbying to have their liability depend on the quality of their self-governance: “but if a catastrophe happens in a way that is connected to a defect in a company’s SSP, then that company is more likely to be liable for it.” Which is to say that Anthropic is attempting to inhibit the government from imposing testing standards (what Anthropic calls “pre-harm”), and in general aims to inhibit regulation of AI before it causes mass casualty.

I think this is pretty bad. For one, voluntary self-governance is obviously problematic. All of the labs, Anthropic included, have significant incentive to continue scaling, indeed, they say as much in this document: “Many stakeholders reasonably worry that this [agency]... might end up… impeding innovation in general.” And their attempts to self-govern are so far, imo, exceedingly weak—their RSP commits to practically nothing if an evaluation threshold triggers, leaving all of the crucial questions, such as “what will we do if our models show catastrophic inclinations,” up to Anthropic’s discretion. This is clearly unacceptable—both the RSP in itself, but also Anthropic’s bid for it to continue to serve as the foundation of regulation. Indeed, if Anthropic would like for other companies to be safer, which I believed to be one of their main safety selling points, then they should be welcoming the government stepping in to ensure that.

Afaict their rationale for opposing this regulation is that the labs are better equipped to design safety standards than the government is: “AI safety is a nascent field where best practices are the subject of original scientific research… What is needed in such a new environment is iteration and experimentation, not prescriptive enforcement. There is a substantial risk that the bill and state agencies will simply be wrong about what is actually effective in preventing catastrophic risk, leading to ineffective and/or burdensome compliance requirements.” But there is also, imo, a large chance that Anthropic is wrong about what is actually effective at preventing catastrophic risk, especially so, given that they have incentive to play down such risks. Indeed, their RSP strikes me as being incredibly insufficient at assuring safety, as it is primarily a reflection of our ignorance, rather than one built from a scientific understanding, or really any understanding, of what it is we’re creating.

I am personally very skeptical that Anthropic is capable of turning our ignorance into the sort of knowledge capable of providing strong safety guarantees anytime soon, and soon is the timeframe by which Dario aims to build AGI. Such that, yes, I expect governments to do a poor job of setting industry standards, but only because I expect that a good job is not possible given our current state of understanding. And I would personally rather, in this situation where labs are racing to build what is perhaps the most powerful technology ever created, to err on the side of the government guessing about what to do, and beginning to establish some enforcement about that, than to leave it for the labs themselves to decide.

Especially so, because if one believes, as Dario seems to, that AI has a significant chance of causing massive harm, that it could “destroy us,” and that this might occur suddenly, “indications that we are in a pessimistic or near-pessimistic scenario may be sudden and hard to spot,” then you shouldn’t be opposing regulation which could, in principle, stop this from happening. We don’t necessarily get warning shots with AI, indeed, this is one of the main problems with building it “iteratively,” one of the main problems with Anthropic’s “empirical” approach to AI safety. Because what Anthropic means by “a pessimistic scenario” is that “it’s simply an empirical fact that we cannot control or dictate values to a system that’s broadly more intellectually capable than ourselves.” Simply an empirical fact. And in what worlds do we learn this empirical fact without catastrophic outcomes?

I have to believe that Anthropic isn’t hoping to gain such evidence by way of catastrophes in fact occurring. But if they would like for such pre-harm evidence to have a meaningful impact, then it seems like having pre-harm regulation in place would be quite helpful. Because one of Anthropic’s core safety strategies rests on their ability to “sound the alarm,” indeed, this seems to account for something like ~33% of their safety profile, given that they believe “pessimistic scenarios” are around as likely as good, or only kind of bad scenarios. And in “pessimistic” worlds, where alignment is essentially unsolvable, and catastrophes are impending, their main fallback is to alert the world of this unfortunate fact so that we can “channel collective effort” towards some currently unspecified actions. But the sorts of actions that the world can take, at this point, will be quite limited unless we begin to prepare for them ahead of time.

Like, the United States government usually isn’t keen on shutting down or otherwise restricting companies on the basis of unrealized harm. And even if they were keen, I’m not sure how they would do this—legislation likely won’t work fast enough, and even if the President could sign an executive order to e.g. limit OpenAI from releasing or further creating their products, this would presumably be a hugely unpopular move without very strong evidence to back it up. And it’s pretty difficult for me to see what kind of evidence this would have to be, to take a move this drastic and this quickly. Anything short of the public witnessing clearly terrible effects, such as mass casualty, doesn’t seem likely to pass muster in the face of a political move this extreme.

But in a world where Anthropic is sounding alarms, they are presumably doing so before such catastrophes have occurred. Which is to say that without structures in place to put significant pressure on or outright stop AI companies on the basis of unrealized harm, Anthropic’s alarm sounding may not amount to very much. Such that pushing against regulation which is beginning to establish pre-harm standards makes Anthropic’s case for “sounding the alarm”—a large fraction of their safety profile—far weaker, imo. But I also can’t help but feeling that these are not real plans; not in the beliefs-pay-rent kind of way, at least. It doesn’t seem to me that Anthropic has really gamed out what such a situation would look like in sufficient detail for it to be a remotely acceptable fallback in the cases where, oops, AI models begin to pose imminent catastrophic risk. I find this pretty unacceptable, and I think Anthropic’s opposition to this bill is yet another case where they are at best placing safety second fiddle, and at worst not prioritizing it meaningfully at all.

Zach Stein-Perlman's Shortform

aysja1mo60

I agree that scoring "medium" seems like it would imply crossing into the medium zone, although I think what they actually mean is "at most medium." The full quote (from above) says:

In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.

I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they're in the "medium zone" and they can’t deploy. But if they’re all medium, then they're in the "below medium zone" and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.

Zach Stein-Perlman's Shortform

aysja1mo70

Maybe I'm missing the relevant bits, but afaict their preparedness doc says that they won't deploy a model if it passes the "medium" threshold, eg:

Only models with a post-mitigation score of "medium" or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place.

The threshold for further developing is set to "high," though. I.e., they can further develop so long as models don't hit the "critical" threshold.

Optimistic Assumptions, Longterm Planning, and "Cope"

aysja1mo42

But you could just as easily frame this as "abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments".

But doesn't this argument hold with the opposite conclusion, too? E.g. "abstract reasoning about unfamiliar domains is hard therefore you should distrust arguments about good-by-default worlds".

80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)

aysja2mo128

But whether an organization can easily respond is pretty orthogonal to whether they’ve done something wrong. Like, if 80k is indeed doing something that merits a boycott, then saying so seems appropriate. There might be some debate about whether this is warranted given the facts, or even whether the facts are right, but it seems misguided to me to make the strength of an objection proportional to someone’s capacity to respond rather than to the badness of the thing they did.

LESSWRONG
LW

Posts

Wiki Contributions

Comments