Akash

Sequences

Leveling Up: advice & resources for junior alignment researchers

Wiki Contributions

Comments

Sorted by
Akash222

Recent Senate hearing includes testimony from Helen Toner and William Saunders

  • Both statements are explicit about AGI risks & emphasize the importance of transparency & whistleblower mechanisms. 
  • William's statement acknowledges that he and others doubt that OpenAI's safety work will be sufficient.
    • "OpenAI will say that they are improving. I and other employees who resigned doubt they will be ready in time. This is true not just with OpenAI; the incentives to prioritize rapid development apply to the entire industry. This is why a policy response is needed."
  • Helen's statement provides an interesting paragraph about China at the end.
    • "A closing note on China: The specter of ceding U.S. technological leadership to China is often treated as a knock-down argument against implementing regulations of any kind. Based on my research on the Chinese AI ecosystem and U.S.-China technology competition more broadly, I think this argument is not nearly as strong as it seems at first glance. We should certainly be mindful of how regulation can affect the pace of innovation at home, and keep a close eye on how our competitors and adversaries are developing and using AI. But looking in depth at Chinese AI development, the AI regulations they are already imposing, and the macro headwinds they face leaves me with the conclusion that they are far from being poised to overtake the United States.6 The fact that targeted, adaptive regulation does not have to slow down U.S. innovation—and in fact can actively support it—only strengthens this point."

Full hearing here (I haven't watched it yet.)

Akash73

Hm, good question. I think it should be proportional to the amount of time it would take to investigate the concern(s).

For this, I think 1-2 weeks seems reasonable, at least for an initial response.

Akash2724

Hendryks had ample opportunity after initial skepticism to remove it, but chose not to.

IMO, this seems to demand a very immediate/sudden/urgent reaction. If Hendrycks ends up being wrong, I think he should issue some sort of retraction (and I think it would be reasonable to be annoyed if he doesn't.)

But I don't think the standard should be "you need to react to criticism within ~24 hours" for this kind of thing. If you write a research paper and people raise important concerns about it, I think you have a duty to investigate them and respond to them, but I don't think you need to fundamentally change your mind within the first few hours/days.

I think we should afford researchers the time to seriously evaluate claims/criticisms, reflect on them, and issue a polished statement (and potential retraction).

(Caveat that there are some cases where immediate action is needed– like EG if a company releases a product that is imminently dangerous– but I don't think "making an intellectual claim about LLM capabilities that turns out to be wrong" would meet my bar.)

Akash7-3

I think it's bad for discourse for us to pretend that discourse doesn't have impacts on others in a democratic society.

I think I agree with this in principle. Possible that the crux between us is more like "what is the role of LessWrong."

For instance, if Bob wrote a NYT article titled "Anthropic is not publishing its safety research", I would be like "meh, this doesn't seem like a particularly useful or high-priority thing to be bringing to everyone's attention– there are like at least 10+ topics I would've much rather Bob spent his points on."

But LW generally isn't a place where you're going to get EG thousands of readers or have a huge effect on general discourse (with the exception of a few things that go viral or AIS-viral).

So I'm not particularly worried about LW discussions having big second-order effects on democratic society. Whereas LW can be a space for people to have a relatively low bar for raising questions, being curious, trying to understand the world, offering criticism/praise without thinking much about how they want to be spending "points", etc.

Akash52

Is this where we think our pressuring-Anthropic points are best spent ? 

I think if someone has a 30-minute meeting with some highly influential and very busy person at Anthropic, it makes sense for them to have thought in advance about the most important things to ask & curate the agenda appropriately. 

But I don't think LW users should be thinking much about "pressuring-Anthropic points". I see LW primarily as a platform for discourse (as opposed to a direct lobbying channel to labs), and I think it would be bad for the discourse if people felt like they had to censor questions/concerns about labs on LW unless it met some sort of "is this one of the most important things to be pushing for" bar.

Akash90

It seems to me like the strongest case for SB1047 is that it's a transparency bill. As Zvi noted, it's probably good for governments and for the world to be able to example the Safety and Security Protocols of frontier AI companies. 

But there are also some pretty important limitations. I think a lot of the bill's value (assuming it passes) will be determined by how it's implemented and whether or not there are folks in government who are able to put pressure on labs to be specific/concrete in their SSPs. 

More thoughts below:

Transparency as an emergency preparedness technique

I often think in an emergency preparedness frame– if there was a time-sensitive threat, how would governments be able to detect the threat & make sure information about the threat was triaged/handled appropriately? It seems like governments are more likely to notice time-sensitive threats in a world where there's more transparency, and forcing frontier AI companies to write/publish SSPs seems good from that angle. 

In my model, a lot of risk comes from the government taking too long to react– either so long that an existential catastrophe actually occurs or so long that by the time major intervention occurs, ASL-4+ models have been developed with poor security, and now it's ~impossible to do anything except continue to race ("otherwise the other people with ASL4+ models will cause a catastrophe".) Efforts to get the government to understand the state of risks and intervene before ASL4+ models seem very important from that perspective. It seems to me like SSPs could accomplish this by (a) giving the government useful information and (b) making it "someone's job" to evaluate the state of SSPs + frontier AI risks. 

Limitation: Companies can write long and nice-sounding documents that avoid specificity and concreteness

The most notable limitation, IMO, is that it's generally pretty easy for powerful companies to evade being fully transparent. Sometimes, people champion things like RSPs or the Seoul Commitments as these major breakthroughs in transparency. Although I do see these as steps in the right direction, their value should not be overstated. For example, even the "best" RSPs (OpenAI's and Anthropic's) are rather vague about how decisions will actually be made. Anthropic's RSP essentially says "Company leadership will ultimately determine whether something is too risky and whether the safeguards are adequate" (with the exception of some specifics around security). OpenAI's does a bit better IMO (from a transparency perspective) by spelling out the kinds of capabilities that they would consider risky, but they still provide company leadership ~infinite freedom RE determining whether or not safeguards are adequate.

Incentives for transparency are relatively weak, and the costs of transparency can be high. In Sam Bowman's recent post, he mentions that detailed commitments (and we can extend this to detailed SSPs) can commit companies to "needlessly costly busy work." A separate but related frame is that race dynamics mean that companies can't afford to make detailed commitments. If I'm in charge of an AI company, I'd generally like to have some freedom/flexibility/wiggle room in how I make decisions, interpret evidence, conduct evaluations, decide whether or not to keep scaling, and make judgments around safety and security.

In other words, we should expect that at least some (maybe all) of the frontier AI companies will try to write SSPs that sound really nice but provide minimal concrete details. The incentives to be concrete/specific are not strong, and we already have some evidence from seeing RSPs/PFs (and note again that I think that the other companies were even less detailed and concrete in their documents.)

Potential solutions: Government capacity & whistleblower mechanisms

So what do we do about this? Are there ways to make SSPs actually promote transparency? If the government is able to tell that some companies are being vague/misleading in their SSPs, this could inspire further investigations/inquiries. We've already seen several Congresspeople send letters to frontier AI companies requesting more details about security procedures, whistleblower protections, and other safety/security topics.

So I think there are two things that can help: government capacity and whistleblower mechanisms.  

Government capacity. The FMD was cut, but perhaps the Board of Frontier Models could provide this oversight. At the very least, the Board could provide an audience for the work of people like @Zach Stein-Perlman and @Zvi– people who might actually read through a complicated 50+ page SSP with corporate niceties but be able to distill what's really going on, what's missing, what's misleading, etc.

Whistleblower mechanisms. SB1047 provides a whistleblower mechanism & whistleblower protections (note: I see these as separate things and I personally think mechanisms are more important). Every frontier AI company has to have a platform through which employees (and contractors, I think?) are able to report if they believe the company is being misleading in its SSPs. This seems like a great accountability tool (though of course it relies on the whistleblower mechanism being implemented properly & relies on some degree of government capacity RE knowing how to interpret whistleblower reports.)

The final thing I'll note is that I think the idea of full shutdown protocols is quite valuable. From an emergency preparedness standpoint, it seems quite good for governments to be asking "under what circumstances do you think a full shutdown is required" and "how would we actually execute/verify a full shutdown."

Akash65

@ryan_greenblatt one thing I'm curious about is when/how the government plays a role in your plan.

I think Sam is likely correct in pointing out that the influence exerted by you (as an individual), Sam (as an individual), or even Anthropic (as an institution) likely goes down considerably if/once governments get super involved.

I still agree with your point about how having an exit plan is still valuable (and indeed I do expect governments to be asking technical experts about their opinions RE what to do, though I also expect a bunch of DC people who know comparatively little about frontier AI systems but have long-standing relationships in the national security world will have a lot of influence.)

My guess is that you think heavy government involvement should occur for before/during the creation of ASL-4 systems, since you're pretty concerned about risks from ASL-4 systems being developed in non-SL5 contexts.

In general, I'd be interested in seeing more about how you (and Buck) are thinking about policy stuff + government involvement. My impression is that you two have spent a lot of time thinking about how AI control fits into a broader strategic context, with that broader strategic context depending a lot on how governments act/react. 

And I suspect readers will be better able to evaluate the AI control plan if some of the assumptions/expectations around government involvement are spelled out more clearly. (Put differently, I think it's pretty hard to evaluate "how excited should I be about the AI control agenda" without understanding who is responsible for doing the AI control stuff, what's going on with race dynamics, etc.)

Akash286

I liked this post (and think it's a lot better than official comms from Anthropic.) Some things I appreciate about this post:

Presenting a useful heuristic for RSPs

Relatedly, we should aim to pass what I call the LeCun Test: Imagine another frontier AI developer adopts a copy of our RSP as binding policy and entrusts someone who thinks that AGI safety concerns are mostly bullshit to implement it. If the RSP is well-written, we should still be reassured that the developer will behave safely—or, at least, if they fail, we should be confident that they’ll fail in a very visible and accountable way.

Acknowledging the potential for a pause

For our RSP commitments to function in a worst-case scenario where making TAI systems safe is extremely difficult, we’ll need to be able to pause the development and deployment of new frontier models until we have developed adequate safeguards, with no guarantee that this will be possible on any particular timeline. This could lead us to cancel or dramatically revise major deployments. Doing so will inevitably be costly and could risk our viability in the worst cases, but big-picture strategic preparation could make the difference between a fatal blow to our finances and morale and a recoverable one. More fine-grained tactical preparation will be necessary for us to pull this off as quickly as may be necessary without hitting technical or logistical hiccups.

Sam wants Anthropic to cede decision-making to governments at some point

[At ASL-5] Governments and other important organizations will likely be heavily invested in AI outcomes, largely foreclosing the need for us to make major decisions on our own. By this point, in most possible worlds, the most important decisions that the organization is going to make have already been made. I’m not including any checklist items below, because we hope not to have any.

Miscellenaous things I like

  • Generally just providing a detailed overview of "the big picture"– how Sam actually sees Anthropic's work potentially contributing to good outcomes. And not sugarcoating what's going on– being very explicit about the fact that these systems are going to become catastrophically dangerous, and EG "If we haven’t succeeded decisively on the big core safety challenges by this point, there’s so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now."
  • Striking a tone that feels pretty serious/straightforward/sober. (In contrast, many Anthropic comms have a vibe of "I am a corporation trying to sell you on the fact that I am a Good Guy.")

Some limitations

  • "Nothing here is a firm commitment on behalf of Anthropic."
  • Not much about policy or government involvement, besides a little bit about scary demos. (To be fair, Sam is a technical person.Though I think the "I'm just a technical person, I'm going to leave policy to the policy people" attitude is probably bad, especially for technical people who are thinking/writing about macrostratgy.)
  • Not much about race dynamics, how to make sure other labs do this, whether Anthropic would actually do things that are costly or if race dynamics would just push them to cut corners. (Pretty similar to the previous concern but a more specific set of worries.)
  • Still not very clear what kinds of evidence would be useful for establishing safety or establishing risk. Similarly, not very clear what kinds of evidence would trigger Sam to think that Anthropic should pause or should EG invest ~all of its capital into getting governments to pause. (To be fair, no one really has great/definitive answers on this. But on the other hand, I think it's useful for people to start spelling out best-guesses RE what this would involve & just acknowledge that our ideas will hopefully get better over time.)

All in all, I think this is an impressive post and I applaud Sam for writing it. 

Akash20

Thanks for sharing! Why do you think the CA legislators were more OK pissing off Big Tech & Pelosi? (I mean, I guess Pelosi's statement didn't come until relatively late, but I believe there was still time for people in at least one chamber to change their votes.)

To me, the most obvious explanation is probably something like "Newsom cares more about a future in federal government than most CA politicians and therefore relies more heavily on support from Big Tech and approval from national Democratic leaders"– is this what's driving your model?

Akash111

Why do people think there's a ~50% chance that Newsom will veto SB1047?

The base rate for vetoes is about 15%. Perhaps the base rate for controversial bills is higher. But it seems like SB1047 hasn't been very controversial among CA politicians.

Is the main idea here that Newsom's incentives are different than those of state politicians because Newsom has national ambitions? So therefore he needs to cater more to the Democratic Party Establishment (which seems to oppose SB1047) or Big Tech? (And then this just balances out against things like "maybe Newsom doesn't want to seem soft on Big Tech, maybe he feels like he has more to lose by deviating from what the legislature wants, the polls support SB1047, and maybe he actually cares about increasing transparency into frontier AI companies?)

Or are there other factors that are especially influential in peoples' models here?

(Tagging @ryan_greenblatt, @Eric Neyman, and @Neel Nanda because you three hold the largest No positions. Feel free to ignore if you don't want to engage.)  

Load More