LESSWRONG
is fundraising!
LW

davekasten's Shortform — LessWrong

davekasten's Shortform

1st May 2024

1 min read

2

This is a special post for quick takes by davekasten. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

58AI #86: Just Think of the Potential

28Whether governments will control AGI is important and neglected

davekasten's Shortform

175davekasten

25gjm

3StanislavKrym

1Sheikh Abdur Raheem Ali

2Alexander Gietelink Oldenziel

109 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:08 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]davekasten3mo17543

Heads up -- if you're 1. on a H1-B visa AND 2. currently outside the US, there is VERY IMPORTANT, EXTREMELY TIME SENSITIVE stuff going on that might prevent you from getting back into the US after 21 September.

If this applies to you, immediately stop looking at LessWrong and look at the latest news. (I'm not providing a summary of it here because there are conflicting stories about who it will apply to and it's evolving hour by hour and I don't want this post to be out of date)

[-]gjm3mo250

USCIS says this does not apply to existing H1-B visas, only to new applications: https://www.uscis.gov/sites/default/files/document/memos/H1B_Proc_Memo_FINAL.pdf

(I am not a lawyer nor a spokesperson for the US government and cannot advise on how likely it is that they will somehow backtrack on this.)

3StanislavKrym3mo

So what does it mean about the AI race and the harm to the AI companies caused by Trump's decisions? My quick take about the AI-2027 scenario shows that the modifiers will need to reconsider the forecasts related to compute available, timelines when the superhuman coder will be invented and the security forecast. Unfortunately, I lack the data necessary to understand how the timelines of creating superhuman coders will be affected and the understanding of American politics. I would guess that even the wargame is to be reconsidered in order to account for changes related to compute, to human geniuses and to the difficulty of creating factories at home... Arguably the worst-case scenarios would be the USA having dumb researchers, but more compute than China, rushing into the intelligence explosion, forcing China to race and misaligning both the American AI (stolen from China or invented later and doing automated research faster) and the Chinese AI as a result of the race. Or China being far closer to the USA in terms of compute and stealing the American supercoder. So, if there was no crisis in the USA or if China was the monopolist in the AI race, then mankind's chances of survival would be far from zero, but a special set of circumstances could arguably nullify them.

1Sheikh Abdur Raheem Ali3mo

For some H-1B folks with strong evidence (publications, awards, press, critical roles), O-1 may be an option—talk to your employer/immigration counsel. Not legal advice. (Previous version of this comment which got downvoted) I think that most people reading this would be eligible for an O1 visa. Please feel free to send me a lesswrong DM if you need help finding a lawyer or a connection to someone that can write a support letter for you.

3AnnaJo3mo

a lot of h1b folks are engineers, not researchers. and a lot of the O-1 requirements are academic/research-oriented, or at least it's marginally easier if you're a researcher

2[comment deleted]3mo

[-]davekasten9mo*13011

The Trump administration (or, more specifically, the White House Office of Science and Technology Policy, but they are in the lead on most AI policy, it seems), are asking for comment on what their AI Action Plan should include. Literally anyone can comment on it. You should consider commenting on it, comments are due Saturday at 8:59pm PT/11:59pm ET via an email address. These comments will actually be read, and a large number of comments on an issue usually does influence any White House's policy. I encourage you to submit comments!

regulations.gov/document/NSF_FRDOC_0001-3479… (Note that all submissions are public and will be published)

(Disclosure: I am working on a submission for this for my dayjob but this particular post is in my personal capacity)

(Edit note: I originally said this was due Friday; I cannot read a calendar, it is in fact due 24 hours later. Consider this a refund that we have all received for being so good at remembering the planning fallacy all these years.)

[-]peterslattery9mo*139

In the future, there should be some organization or some group of individuals in the LW community who raise awareness about these sorts of opportunities and offer content and support to ensure submissions from the most knowledgeable and relevant actors. This seems like a very low-hanging fruit and is something several groups I know are doing.

6Seth Herd9mo

Edit: I finished that post on this topic: Whether governments will control AGI is important and neglected. I'm hoping for discussion on that post, and quite ready to change my draft comment, or not submit one, based on those arguments. After putting a bunch of thought into it, my planned comment will recommend forming a committee that can work in private to investigate the opportunities and risks of AI development, to inform future policy. I will note that this was Roosevelt's response to Einstein's letter on the potential of nuclear weaponry. I hope that such a committee will conclude that yeah, there are some big dangers on expectation. I will emphasize the disagreement among experts, and suggest that the sane thing to do is put real effort into sorting out the many conflicting claims and possibilities, while also pursuing our current best guesses. I think any request for a slowdown is wasted, given the request's note about reducing regulatory barriers. But I will note that there are dangers to both our economy from potential rapid job loss, and large security risks from adversaries stealing or copying our AI, such that we may be currently building tools and weapons that will be used against us. I think I will not emphasize x-risk, and may not even include it. But I will probably mention that predictions of reaching human-level autonomous operation are very mixed, so we're not sure how far we are from creating what's effectively a new intelligent species. I'm hoping that triggers the right intuitions of danger. Again, I'm highly uncertain and very open to changing my mind on what to say. Original comment: This raises the question: what should we say? Fortunately, I've almost finished a post about this. It analyzes many aspects of the question "do we want governments to recognize the potential of AGI?". Unfortunately, it doesn't answer the question. There are strong points on both sides, and it needs more careful thought. Nonetheless, I'll probably get it

[-]davekasten9mo149

I think on net, there are relatively fewer risks related to getting governments more AGI-pilled vs. them continuing on their current course; governments are broadly AI-pilled even if not AGI/ASI-pilled and are doing most of the accelerating actions an AGI-accelerator would want.

9Seth Herd9mo

I wasn't able to finish that post in the few minutes I've got so far today, so here's the super short version. I remain highly uncertain whether my comments will include any mention of AGI. (Edit: I finally finished it: Whether governments will control AGI is important and neglected) I think whether AGI-pilling governments is a good idea is quite complex. Pushing the government to become aware of AGI x-risks will probably decelerate progress, but it could even accelerate it if the conclusion is "build it first, don't worry we'll be super careful when we get close". Even if it does help with alignment, it's not necessarily net good. If governments take control early enough to prevent proliferation of AGI, that helps a lot with the risks of misalignment and catastrophic misuse. The US could even cooperate with China to prevent proliferation to other countries and to nongovermental groups, just as the US cooperated with Russia on nuclear nonproliferation. But government control also raises the risks of power concentration. Intent-aligned AGI in untrustworthy hands could create a permanent dictatorship and unbreakable police state. The current governments of both the US and China don't seem like the best types to control the future. So it's a matter of balancing Fear of centralized power vs. fear of misaligned AGI. This also needs to be balanced agains the possibility of misuse of intent-aligned AGI if it does proliferate broadly; see If we solve alignment, do we die anyway? If I had a firm estimate of how hard technical alignment is, I'd have a better answer. But I don't, and I think the best objective conclusion, taking in all of the arguments made to date and the very wide variance in opinion even among those who've thought deeply about it, is that nobody has a very good estimate. (Edit: I mean estimates between very very hard and modestly tricky. I don't know of anyone who's addressed the hard parts and concluded that it happens by default.) Neither do we h

[-]davekasten2y4511

Epistemic status: not a lawyer, but I've worked with a lot of them.

As I understand it, an NDA isn't enforceable against a subpoena (though the former employer can seek a protective order for the testimony). Someone should really encourage law enforcement or Congress to subpoena the OpenAI resigners...

[-]metachirality2y103

A subpoena for what?

[-]davekasten1y4413

Okay, I spent much more time with the Anthropic RSP revisions today. Overall, I think it has two big thematic shifts for me:

1. It's way more "professionally paranoid," but needs even more so on non-cyber risks. A good start, but needs more on being able to stop human intelligence (i.e., good old fashioned spies)

2. It really has an aggressively strong vibe of "we are actually using this policy, and We Have Many Line Edits As A Result." You may not think that RSPs are sufficient -- I'm not sure I do, necessarily -- but I am heartened slightly that they genuinely seem to take the RSP seriously to the point of having mildly-frustrated-about-process-hiccup footnoes about it. (Free advice to Anthropic PR: interview a bunch of staff about this on camera, cut it together, and post it, it will be lovely and humanizing and great recruitment material, I bet).

[-]davekasten2y3513

I think one thing that is poorly-understood by many folks outside of DC is just how baseline the assumption is that China is a by-default faithless negotiating partner and that by-default China will want to pick a war with America in 2027 or later.

(I am reporting, not endorsing. For example, it is deeply unclear to me why we should take another country's statements about the year they're gonna do a war at surface level)

[-]Thomas Kwa2y*1611

"want to pick a war with America" is really strange wording because China's strategic goals are not "win a war against nuclear-armed America", but things like "be able to control its claims in the South China Sea including invading Taiwan without American interference". Likewise Russia doesn't want to "pick a war with the EU" but rather annex Ukraine; if they were stupid enough to want the former they would have just bombed Paris. I don't know whether national security people relate to the phrasing the same way but they do understand this.

1davekasten2y

I totally understand your point, agree that many folks would use your phrasing, and nonetheless think there is something uniquely descriptively true about the phrasing I chose and I stand by it.

3William Riker2y

Has China has made a statment about starting a war in 2027 or later? Who exactly is the belief that "by-default China will want to pick a war with America in 2027 or later" held by and how confident are you that they hold it?

2Garrett Baker2y

It is supposedly their goal for when they will have modernized their military.

1William Riker2y

Thanks for the link! The one mention of starting war was a quote from this 2006 white paper: Is this what you're referring to or did I miss something?

4davekasten2y

The general belief in Washington is that Xi Jinping has ordered his military to be ready to invade Taiwan by then. (See, e.g., https://www.reuters.com/world/china/logistics-war-how-washington-is-preparing-chinese-invasion-taiwan-2024-01-31/ )

2Nathan Helm-Burger2y

Sufficient AI superiority will mean overwhelming military superiority. If we remain ahead in AI it won't matter what other countries do. I expect this effect will dominate the strategic landscape by 2027.

1davekasten2y

Say more ?

4Garrett Baker2y

No, the belief is that China isn’t going to start a war before it has a modernized military, and they plan to have a modernized military by 2027. Therefore they won’t start a war before 2027. China has also been drooling over Taiwan for the past 100 years. Thus, if you don’t think diplomatic or economic ties mean much to them, and they’ll contend with the US’s military might before 2027, and neither party will use nukes in such a conflict, then you expect a war after 2027.

1William Riker2y

Ah, I misread your comment. Thanks for clarifying!

2Garrett Baker2y

I don't think they have stated they'll to to war after 2027. 2027 is the year of their "military modernization" target.

[-]davekasten1y288

It's a small but positive sign that Anthropic sees taking 3 days beyond their RSP's specified timeframe to conduct a process without a formal exception as an issue. Signals that at least some members of the team there are extremely attuned to normalization of deviance concerns.

[-]davekasten10mo254

Ok, so it seems clear that we are, for better or worse, likely going to try to get AGI to do our alignment homework.

Who has thought through all the other homework we might give AGI that is as good of an idea, assuming a model that isn't an instant-game-over for us? E.G., I remember @Buck rattling off a list of other ideas that he had in his The Curve talk, but I feel like I haven't seen the list of, e.g., "here are all the ways I would like to run an automated counterintelligence sweep of my organization" ideas.

(Yes, obviously, if the AI is sneakily misaligned, you're just dead because it will trick you into firing all your researchers, etc.; this is written in a "playing to your outs" mentality, not an "I endorse this as a good plan" mentality.)

[-]Buck10mo142

@ryan_greenblatt is working on a list of alignment research applications. For control applications, you might enjoy the long list of control techniques in our original post.

5[anonymous]10mo

How to build a lie detector app/program to release to the public (preferably packaged with advice/ideas on ways to use and strategies for marketing the app, e.g. packaging it with an animal body-language to english translator).

4Thane Ruthenis10mo

Technology for efficient human uploading. Ideally backed by theory we can independently verify as correct and doing what it's intended to do (rather than e. g. replacing the human upload with a copy of the AGI who developed this technology).

3Ebenezer Dukakis10mo

I think unlearning could be a good fit for automated alignment research. Unlearning could be a very general tool to address a lot of AI threat models. It might be possible to unlearn deception, scheming, manipulation of humans, cybersecurity, etc. I challenge you to come up with an AI safety failure story that can't, in principle, be countered through targeted unlearning in some way, shape, or form. Relative to some other kinds of alignment research, unlearning seems easy to automate, since you can optimize metrics for how well things have been unlearned. I like this post.

2Noosphere8910mo

The big one probably has to do with being able to corrupt the metrics so totally that whatever you think you made them unlearn actually didn't happen, or just being able to relearn the knowledge so fast that unlearning doesn't matter, but yes unlearning is a very underrated direction for AI automation, because it targets so many threat models. It also satisfies the property of addressing a bottleneck (in this case, capabilities being so dangerous as to threaten any test), and while I wouldn't call it the best, it's still quite underrated how much unlearning will be useful. Similarly, domain-limiting AIs would be quite useful for control of AI.

3Ebenezer Dukakis10mo

I favor proactive approaches to unlearning which prevent the target knowledge from being acquired in the first place. E.g. for gradient routing, if you can restrict "self-awareness and knowledge of how to corrupt metrics" to a particular submodule of the network during learning, then if that submodule isn't active, you can be reasonably confident that the metrics aren't currently being corrupted. (Even if that submodule sandbags and underrates its own knowledge, that should be fine if the devs know to be wary of it. Just ablate that submodule whenever you're measuring something that matters, regardless of whether your metrics say it knows stuff!) Some related thoughts here Unlearning techniques should probably be battle-tested in low-stakes "model organism" type contexts, where metrics corruption isn't expected. Curious what areas you are most excited about!

3Noosphere8910mo

I basically agree with this, and on this question: My big areas of excitement are AI control (in a broad sense) and synthetic dataset making for AI alignment of successors.

2Quinn10mo

I'm working on making sure we get high quality critical systems software out of early AGI. Hardened infrastructure buys us a lot in the slightly crazy story of "self-exfiltrated model attacks the power grid", but buys us even more in less crazy stories about all the software modules adjacent to AGI having vulnerabilities rapidly patched at crunchtime.

1yams10mo

Preliminary thoughts from Ryan Greenblatt on this here.

[-]davekasten1y246

At LessOnline, there was a big discussion one night around the picnic tables with @Eliezer_Yudkovsky , @habryka , and some interlocutors from the frontier labs (you'll momentarily see why I'm being vague on the latter names).

One question was: "does DC actually listen to whistleblowers?" and I contributed that, in fact, DC does indeed have a script for this, and resigning in protest is a key part of it, especially ever since the Nixon years.

Here is a usefully publicly-shareable anecdote on how strongly this norm is embedded in national security decision-making, from the New Yorker article "The U.S. Spies Who Sound the Alarm About Election Interference" by David Kirkpatrick, Oct 21, 2024:
(https://archive.ph/8Nkx5)

The experts’ chair insisted that in this cycle the intelligence agencies had not withheld information “that met all five of the criteria”—and did not risk exposing sources and methods. Nor had the leaders’ group ever overruled a recommendation by the career experts. And if they did? It would be the job of the chair of the experts’ group to stand up or speak out, she told me: “That is why we pick a career civil servant who is retirement-eligible.” In other words, she can resign in protest.

[-]gwern1y*146

Also of relevance is the wave of resignations from the DC newspaper The Washington Post the past few days over Jeff Bezos suddenly exerting control.

6davekasten1y

Yup. The fact that the profession that writes the news sees "I should resign in protest" as their own responsibility in this circumstance really reveals something.

[-]davekasten4mo201

I'm about to embark on the classic exercise of "think a bunch about AI policy."

Does anyone actually have an up to date collection of "here are all the existing AI safety policy proposals out there"?

(Yes, I know, your existing proposal is already great and we should just implement it as-is. Think of the goal of this exercise being to convince someone else who needs to see a spreadsheet of "here are all the ideas, here is why idea number three is the best one")

[-]davekasten9mo205

The elites do want you to know it: you can just email a Congressional office and get a meeting

6Seth Herd9mo

Would I have to go to DC? Because I hate going to DC. Not that I wouldn't to save the world, but I'd want to be sure it was necessary. Only partly kidding. Maybe if people got a rationalist enclave in DC going we'd be less averse?

[-]davekasten9mo147

You can definitely meet your own district's staff locally (e.g., if you're in Berkeley, Congresswoman Simon has an office in Oakland, Senator Padilla has an office in SF, and Senator Schiff's offices look not to be finalized yet but undoubtedly will include a Bay Area Office).

You can also meet most Congressional offices' staff via Zoom or phone (though some offices strongly prefer in-person meetings).

There is also indeed a meaningful rationalist presence in DC, though opinions vary as to whether the enclave is in Adams Morgan-Columbia Heights, Northern Virginia, or Silver Spring.*

*This trichotomy is funny, but hard to culturally translate unless you want a 15,000 word thesis on DC-area housing and federal office building policy since 1945 and its related cultural signifiers. Just...just trust me on this.

2Alexander Gietelink Oldenziel9mo

The people require it, sir.

[-]davekasten1y150

Basic Q: has anyone written much down about what sorts of endgame strategies you'd see just-before-ASI from the perspective of "it's about to go well, and we want to maximize the benefits of it" ?

For example: if we saw OpenPhil suddenly make a massive push to just mitigate mortality at the cost of literally every other development goal they have, I might suspect that they suspect that we're about to all be immortal under ASI, and they're trying to get as many people possible to that future...

7Linch1y

My guess is that we wouldn't actually know with high confidence before (and likely even some time after) things-will-definitely-be-fine. E.g. 3 months after safe ASI people might still be publishing their alignment takes.

1davekasten1y

Oh, to be clear I'm not sure this is at all actually likely, but I was curious if anyone had explored the possibility conditional on it being likely

5Seth Herd1y

Endgame strategies from who? A lot of powerful people would focus on being the ones to control it when it happens, so they'd control the future - and not be subject to some else's control of the future. OpenPhil is about the only org that would think first of the public benefit and not the dangers of other humans controlling it. And not a terribly powerful org, particularly relative to governments.

4davekasten1y

I was being intentionally broad, here. I am probably less interested for purposes of this particular post only in the question of "who controls the future" swerves and more about "what else would interested, agentic actors do" questions. It is not at all clear to me that OpenPhil is the only org who feels this way -- I can think of several non-EA-ish charities that if they genuinely 100% believed "none of the people you care for will die of the evils you fight if you can just keep them alive for the next 90 days" would plausibly do some interestingly agentic stuff.

[-]davekasten1y152

A random observation from a think tank event last night in DC -- the average person in those rooms is convinced there's a problem, but that it's the near-term harms, the AI ethics stuff, etc. The highest-status and highest-rank people in those rooms seem to be much more concerned about catastrophic harms.

This is a very weird set of selection effects. I'm not sure what to make of it, honestly.

6habryka1y

Random psychologizing explanation that resonates most with me: Claiming to address big problems requires high-status. A low-rank person is allowed to bring up minor issues, but they are not in a position to bring up big issues that might reflect on the status of many high-status people. This is a pretty common phenomenon that I've observed. Many people react with strong social slap-down motions if you (for example) call in question whether the net-effect of a whole social community or economic sector is negative, where the underlying cognitive reality seems similar to "you are not high status enough to bring forward this grievance".

1davekasten1y

I think this is plausibly describing some folks! But I also think there's a separate piece -- I observe, with pretty high odds that it isn't just an act, that at least some people are trying to associate themselves with the near-term harms and AI ethics stuff because they think that is the higher-status stuff, despite direct obvious evidence that the highest-status people in the room disagree.

6Dagon1y

There are (at least) two models which could partially explain this: 1) The high-status/high-rank people have that status because they're better at abstract and long-term thinking, and their role is more toward preventing catastrophe rather than nudging toward improvements. They leave the lesser concerns to the underlings, with the (sometimes correct) belief that it'll come out OK without their involvement. 2) The high-status/high-rank people are rich and powerful enough to be somewahat insulated from most of the prosaic AI risks, while the average member can legitimately be hurt by such things. So everyone is just focusing on the things most likely to impact themselves. edit: to clarify, these are two models that do NOT imply the obvious "smarter/more powerful people are correctly worried about the REAL threats, and the average person's concerns are probably unimportant/uninformed". It's quite possible that this division doesn't tell us much about the relative importance of those different risks.

1davekasten1y

Yup1 I think those are potentially very plausible, and similar things were on my short list of possible explanations. I would be very not shocked if those are the true reasons. I just don't think I have anywhere near enough evidence yet to actually conclude that, so I'm just reporting the random observation for now :)

1ZY1y

Does "highest status" here mean highest expertise in a domain generally agreed by people in that domain, and/or education level, and/or privileged schools, and/or from more economically powerful countries etc? It is also good to note that sometimes the "status" is dynamic, and may or may not imply anything causal with their decision making or choice on priorities. One scenario is "higher status" might correlates with better resources to achieve those statuses, and a possibility is as a result they haven't experienced or they are not subject to many near-term harms. In other words, it is not really about the difference between "average" and "high status"'s people's intelligence, but more about what kind of world they are exposed to. I do think it is good to hear all different perspectives to stay curious/open-minded. edit: I just saw Dragon nicely listed two potential reasons, with scenario 2 mentioning something similar with my comment here. But something slightly specific in my thinking, is that these choices made by "average" and "high status" people may or may not be conscious, but rather from the experience from their lives and the world they are exposed to.

1davekasten1y

I mean, functionally all of those things. (Well, minus the country dynamic. Everyone at this event I talked to was US, UK, or Canadian, which is all sorta one team for purposes of status dynamics at that event)

2[comment deleted]1y

[-]davekasten8mo140

We're hiring at ControlAI for folks who walk to work on UK and US policy advocacy. Come talk to Congress and Parliament and stop risks from unsafe superintelligences! controlai.com/careers

(Admins: I don't tend to see many folks posting this sort of thing here, so feel free to nuke this post if not the sort of content you're going for. But given audience here, figured might be of interest)

[This comment is no longer endorsed by its author]Reply

1Felix Choussat8mo

Thank you for posting this. Are there any opportunities for students about to graduate to apply themselves, particularly without a C.S background? My undergraduate experience was focused on Business and IR (Cold War history, Sino-U.S relations) before I pivoted my long term focus to AI safety policy, and it's been difficult to find good entry points for EA work in this field as a new grad. I've been monitoring 80,000 hours and applying to research fellowships where I can so far, but I'm always looking for new positions. If you or anyone else knows an org looking to onboard some fresh talent, I'd be happy to help. Edit: Application submitted.

[-]davekasten2mo110

Zach Stein-Perlman's recent quick take is confusing. It just seems like an assertion, followed by condemnation of Anthropic conditioned on us accepting his assertion blindly as true.

It is definitely the case that "insider threat from a compute provider" is a key part of Anthropic's threat model! They routinely talk about it in formal and informal settings! So what precisely is his threat model here that he thinks they're not defending adequately against?

(He has me blocked from commenting on his posts for some reason, which is absolutely his right, but insofar as he hasn't blocked me from seeing his posts, I wanted to explicitly register in public my objection to this sort of low-quality argument.)

[-]Zach Stein-Perlman2mo*210

I agree that whether Anthropic has handled insider threat from compute providers is a crux. My guess is that Anthropic and humans-at-Anthropic wouldn't claim to have handled this (outside of the implicit claim for ASL-3) and they would say something more like that's out of scope for ASL-3 or oops.

Separately, I just unblocked you. (I blocked you because I didn't like this thread in my shortform, not directly to stifle dissent. I have not blocked anyone else. I mention this because hearing about disagreement being hidden/blocked should make readers suspicious but that's mostly not correct in this case.)

Edit: also, man, I tried to avoid "condemnation" and I think I succeeded. I was just making an observation. I don't really condemn Anthropic for this.

2davekasten2mo

To be clear, I think people should feel free to block freely and for any reason, including literally no reason at all. I'm open to ways of describing people's block decisions in the future that better convey that, but I definitely didn't think others reading this would assume "oh, Zach's the bad guy here" as opposed to the reverse.

[-]Buck2mo*1311

I basically agree with Zach that based on public information it seems like it would be really hard for them to be robust to this and it seems implausible that they have justified confidence in such robustness.

I agree that he doesn't say the argument in very much depth. Obviously, I think it'd be great if someone made the argument in more detail. I think Zach's point is a positive contribution even though it isn't that detailed.

[-]habryka2mo1112

I am a bit confused why it's an assertion and not an "argument"? The argument is relatively straightforward:

Anthropic is currently shipping its weights to compute providers for inference. Those compute providers almost certainly do not comply with Anthropic's ASL-3 security standard, and the inference setup is likely not structured in a way that makes it impossible for the compute provider somehow get access to the weights if they really wanted. This means Anthropic is violating its RSP, as their ASL-3 security standard required them to be robust against this kind of attack.

It is true that "insider threat from a compute provider" is a key part of Anthropic's threat model! Anthropic is clearly not unaware of this attack chain. Indeed, in the whitepaper linked in Zach's shortform they call for various changes that would need to happen at compute providers to enable a zero-trust relationship here, but also implicitly in calling for these changes they admit that they are very likely not currently in place!

My guess is what happened here is that at least some people at Anthropic are probably aware that their RSP commits them to a higher level of security than they can currently rea... (read more)

2davekasten2mo

Where's the "almost certainly" coming from? I feel like everyone responding to this is seeing something I'm not seeing.

7habryka2mo

I mean, computer security without physical access to servers is already an extremely hard problem. Defending against attackers with basically unlimited physical access to devices with your weights on, where those weights need to be at least temporarily decrypted for inference is close to impossible. You are welcome to go and ask security professionals in the field whether they would have any hope defending against a dedicated insider at a major compute provider. Beyond that, Anthropic has also released a report where they specify in a lot of detail what would be necessary for a zero trust relationship between compute provider and developer here. They call out many things that current compute providers are missing in order to establish such a relationship. Beyond that, I am sure you can probably also just ask someone senior at Anthropic if you run into them about whether they think Anthropic is robust to attacks like this. My guess is they will just straightforwardly tell you they are not.

2davekasten2mo

I'm really not following your argument here. Of course in many instances compute providers don't offer zero trust relationships with those running on their systems. This is just not news. There's a reason why we have an entire universe of compensating technical and non-technical controls to mitigate risk in such circumstances. You have done zero analysis to identify any reason to believe that those compensating controls are insufficient. You could incredibly easily get me to flip sides in this discussion if you offered any of that, but simply saying that someone's running zero trust isn't sufficient. As a hypothetical, if Anthropic is expending meaningful effort to be highly confident that they can ensure that Amazon's own security processes are securing against insiders, they would have substantial risk reduction (as long as they can have high confidence said processes are continuing to be executed. Separately, though it probably cuts against my argument above [1], I would politely disagree with the perhaps-unintended implication in your comment above that "implement zero trust" is a sufficient definition of defenses to defend against compute providers like Amazon, MSFT, etc. After all, Anthropic's proper threat modeling of them should include things like, "Amazon, Microsoft, etc. employ former nation-state hackers who considered attacking zero trust networks to be part of the cost of doing business." [1] Scout mindset, etc.

[-]habryka2mo*132

Anthropic has released their own whitepaper where they call out what kind of changes would need to be required. Can you please engage with my arguments?

I have now also heard from 1-2 Anthropic employees about this. The specific people weren't super up-to-date on what Anthropic is doing here, and didn't want to say anything committal, but nobody I talked to had a reaction that suggested that they thought it was likely that Anthropic is robust to high-level insider threats at compute providers.

Like, if you want you can take a bet with me here, I am happy to offer you 2:1 odds on the opinion of some independent expert we both trust on whether Anthropic is likely robust to that kind of insider threat. I can also start going into all the specific technical reasons, but that would require restating half of the state of the art of computer security, which would be a lot of work. I really think that not being robust here is a relatively straightforward inference that most people in the field will agree with (unless Anthropic has changed operating procedure substantially from what is available to other consumers in their deals with cloud providers, which I currently think is unlikely, but not impossible).

2davekasten2mo

By the "whitepaper," are you referring to the RSP v2.2 that Zach linked to, or something else? If so, I don't understand how a generic standard can "call out what kind of changes would need to be required" to their current environment if they're also claiming they meet their current standard. Also, just to cut a little more to brass tacks here, can you describe the specific threat model that you think they are insufficiently responding to? By that, I don't mean just the threat actor (insiders within their compute provider) and their objective to get weights, but rather the specific class or classes of attacks that you expect to occur, and why you believe that existing technical security + compensating controls are insufficient given Anthropic's existing standards. For example, AIUI the weights aren't just sitting naively decrypted at inference, they're running inside a fairly locked down trusted execution environment, with keys provided only as-needed (probably with an ephemeral keying structure?) from a HSM, and those trusted execution environments are operating inside a physical security perimeter of a data center that already is designed to mitigate insider risk. Which parts of this are you worried are attackable? To what degree are organizational boundaries between Anthropic and its compute providers salient to increasing this risk? Why should we expect that the compute providers don't already have sufficient compensatory controls here, given that, e.g., these compute providers also provide classified compute to the US government that is secured at the Top Secret / SCI level and presumably therefore have best-in-class anti-insider-threat capabilities? I'm extremely willing to buy into a claim that they're not doing enough, but I would actually need to have an argument here that's more specific.

6habryka2mo

No, I am referring to this, which Zach linked in his shortform: https://assets.anthropic.com/m/c52125297b85a42/original/Confidential_Inference_Paper.pdf I don't have a complaint that Anthropic is taking insufficient defenses here. I have a complaint that Anthropic said they would do something in their RSP that they are not doing. The threat is really not very complicated, it's basically: * A high-level Amazon executive decides they would like to have a copy of Claude's weights * They physically go to an inference server and manipulate the hardware to dump the weights unencoded to storage (or read out the weights from a memory bus directly, or use one of the many attacks available to you if you have unlimited hardware access) Nobody in computer security I have ever talked to thinks that data in memory in normal cloud data centers would be secure against physical access from high-level executives who run the datacenter. One could build such a datacenter, but the normal ones are not that! The top-secret cloud you linked to might be one of those, but my understanding is that Claude weights are deployed just to like normal Amazon datacenters, not only the top-secret governance ones.

2MichaelDickens2mo

Perhaps I'm misunderstanding your objection but I think the issue is that Claude is hosted on AWS servers (among other places), which means Amazon could steal Claude's model weights if it wanted to, and ASL-3 states that Claude needs to be secure against theft by other companies (including Amazon). I don't know for sure that Zach's assertion is true, but I'm reasonably confident that a dedicated Amazon security team could steal the contents of any AWS server if they really wanted to.

2davekasten2mo

Huh? Simply using someone else's hosting doesn't mean that Amazon has a threat-modeled ability to steal Claude's model weights. For example, it could be the case (not saying it is, this is just illustrative) that Amazon has given Anthropic sufficient surveillance capabilities inside their data centers that combined with other controls the risk is low.

[-]davekasten1y108

I really dislike the term "warning shot," and I'm trying to get it out of my vocabulary. I understand how it came to be a term people use. But, if we think it might actually be something that happens, and when it happens, it plausibly and tragically results the deaths of many folks, isn't the right term "mass casualty event" ?

[-]habryka1y103

I think many mass casualty events would be warning shots, but not all warning shots would be mass casualty events. I think an agentic AI system getting most of the way towards escaping containment or a major fraud being perpetrated by an AI system would both be meaningful warning shots, but wouldn't involve mass casualties.

I do agree with what I think you are pointin at, which is that there is something Orwellian about the "warning shot" language. Like, in many of these scenarios we are talking about large negative consequences, and it seems good to have a word that owns that (in-particular in as much as people are thinking about making warning shots more likely before an irrecoverable catastrophe occurs).

4davekasten1y

I totally think it's true that there are warning shots that would be non-mass-casualty events, to be clear, and I agree that the scenarios you note could maybe be those. (I was trying to use "plausibly" to gesture at a wide range of scenarios, but I totally agree the comment as written isn't clearly meaning that). I don't think folks intended anything Orwellian, just sort of something we stumbled into, and heck, if we can both be less Orwellian and be more compelling policy advocates at the same time, why not, I figure.

5robo1y

I think a lot of people losing their jobs would probably do the trick, politics-wise. For most people the crux is "will AIs will be more capable than humans", not "might AIs more capable than humans be dangerous".

2davekasten1y

You know, you're not the first person to make that argument to me recently. I admit that I find it more persuasive than I used to. Put another way: "will AI take all the jobs" is another way of saying* "will I suddenly lose the ability to feed and protect those I love." It's an apocalypse in microcosm, and it's one that doesn't require a lot of theory to grasp. *Yes, yes, you could imagine universal basic income or whatever. Do you think the average person is Actually Expecting to Get That ?

[-]davekasten1y98

Has anyone thought about the things that governments are uniquely good at when it comes to evaluating models?

Here are at least 3 things I think they have as benefits:
1. Just an independent 3rd-party perspective generally

2. The ability to draw insights across multiple labs' efforts, and identify patterns that others might not be able to

3. The ability to draw on classified threat intelligence to inform its research (e.g., Country X is using model Y for bad behavior Z) and to test the model for classified capabilities (bright line example: "can you design an accurate classified nuclear explosive lensing arrangement").

Are there others that come to mind?

[-]davekasten11mo85

One point that maybe someone's made, but I haven't run across recently: if you want to turn AI development into a Manhattan Project, you will by-default face some real delays from the reorganization of private efforts into one big national effort. In a close race, you might actually see pressures not to do so, because you don't want to give up 6 months to a year on reorg drama -- so in some possible worlds, the Project is actually a deceleration move in the short term, even if it accelerates in the long term!

3Nathan Helm-Burger11mo

This is a point that's definitely come up in private discussions I've been a part of. I don't remember if I saw it said publicly somewhere.

3davekasten11mo

I am (sincerely!) glad that this is obvious to other people too and that they are talking about it already!

[-]davekasten1y70

It seems like the current meta is to write a big essay outlining your opinions about AI (see, e.g., Gladstone Report, Situational Awareness, various essays recently by Sam Altman and Dario Amodei, even the A Narrow Path report I co-authored).

Why do we think this is the case?
I can imagine at least 3 hypotheses:
1. Just path-dependence; someone did it, it went well, others imitated

2. Essays are High Status Serious Writing, and people want to obtain that trophy for their ideas

3. This is a return to the true original meaning of an essay, under Mont... (read more)

[-]gwern1y*1910

Well, what's the alternative? If you think there is something weird enough and suboptimal about essay formats that you are reaching for 'random chance' or 'monkey see monkey do' level explanations, that implies you think there is some much superior format they ought to be using instead. But I can't see what. I think it might be helpful to try to make the case for doing these things via some of the alternatives:

a peer-reviewed Nature paper which would be published 2 years from now, maybe, behind a paywall
a published book, published 3 years from starting the first draft now, which some people might get around to reading a year or two after that, and dropping halfway through (assuming you finish and didn't burn out writing it)
a 1 minute Tiktok video by an AI person with non-supermodel looks
a 5-minute heavily-excerpted interview on CNN
a 750-word WSJ or NYT op-ed
a 10-page Arxiv paper in the standard LaTeX template
a Twitter thread of 500 tweets (which can only be read by logged-in users)
a Medium post (which can't be read because it is written in a light gray font illegible to anyone over the age of 20. Also, it's paywalled 90% of the time.)
a 6 hour Lex Fridman podcast interview

... (read more)

[-]Seth Herd1y142

I think those are the meta because they have just enough space to not only give opinions but to mention reasons for those opinions and expertise/background to support the many unstated judgment calls.

Note that the essays by Altman and Amodei are popular because their positions are central beyond the others because they have not only demonstrable backgrounds in AI but lots of name recognition (we're mostly assuming Altman has bothered learning a lot about how Transformers work even if we don't like him). And that the Gladstone report got itself commissioned by at least a little piece of the government.

A Narrow Path just demonstrates in the text that you and your co-authors have thought deeply about the topic. Shorter essays leave more guesswork on the authors' expertise and depth of consideration.

[-]davekasten1y50

Incidentally, spurred by @Mo Putera's posting of Vernor Vinge's A Fire Upon The Deep annotations, I want to remind folks that Vinge's Rainbows End is very good and doesn't get enough attention, and will give you a less-incorrect understanding of how national security people think.

[-]davekasten1y50

I'm pretty sure that I think "infohazard" is a conceptual dead-end concept that embeds some really false understandings of how secrets are used by humans. It is an orphan of a concept -- it doesn't go anywhere. Ok, the information's harmful. You need humans to touch that info anyways to do responsible risk-mitigation. So now what ?

[-]ABlue1y136

That "so now what" doesn't sound like a dead end to me. The question of how to mitigate risk when normal risk-mitigation procedures are themselves risky seems like an important one.

7Dagon1y

I agree that it's not terribly useful beyond identifying someone's fears. Using almost any taxonomy to specify what the speaker is actually worried about lets you stop saying "infohazard" and start talking about "bad actor misuse of information" or "naive user tricked by partial (but true) information". These ARE often useful, even though the aggregate term "infohazard" is limited.

3Zac Hatfield-Dodds1y

See e.g. Table 1 of https://nickbostrom.com/information-hazards.pdf

1davekasten1y

Yeah, that's a useful taxonomy to be reminded of. I think it's interesting how the "development hazard", item 8, with maybe a smidge of "adversary hazard", is the driver of people's thinking on AI. I'm pretty unconvinced that good infohazard doctrine, even for AI, can be written based on thinking mainly about that!

7Shankar Sivarajan1y

I suggest there is a concept distinct enough to warrant the special term, but if it's expansive enough to include secrets, beneficial information that some people prefer others not know, that renders it worthless. "Infohazard" ought to be reserved for information that harms the mind that contains it, with spoilers as the most mild examples, SCP-style horrors as the extreme fictional examples.

5habryka1y

I think within a bayesian framework where in-general you assume information has positive value, it's useful to have an explicit term when that is not the case. It's a relatively rare occurrence, and as such your usual ways of dealing with information will probably backfire. The obvious things to do is to not learn about that information in the first place (i.e. avoid dangerous research), understand and address the causes for why this information is dangerous (because e.g. you can't coordinate on not building dangerous technology), or as a last resort, silo the information and limit the spread of it. I do think that it would be useful to have different words that distinguish between "infohazard to the average individual" and "societal infohazard". The first one is really exceedingly rare. The second one is still rare but more common because society has a huge distribution of beliefs and enough crazy people that if information can be used dangerously, there is a non-trivial chance it will.

2tailcalled1y

I still like the term "recipe for destruction" when limiting it to stuff similar to dangerous technology.

1davekasten1y

I think a lot of my underlying instinctive opposition to this concept boils down to thinking that we can and do coordinate on this stuff quite a lot. Arguably, AI is the weird counterexample of a thought that wants to be thunk -- I think modern Western society is very nearly tailor-made to seek a thing that is abstract, maximizing, systematizing of knowledge, and useful, especially if it fills a hole left by the collapse of organized religion. I think for most other infohazards, the proper approach requires setting up an (often-government) team that handles them, which requires those employees to expose themselves to the infohazard to manage it. And, yeah, sometimes they suffer real damage from it. There's no way to analyze ISIS beheading videos to stop their perpetrators without seeing some beheading videos; I think that's the more-common varietal of infohazard I'm thinking of.

3[anonymous]1y

I think one of the points is that you should now focus on selective rather than corrective or structural means to figure out who is nonetheless allowed to work on the basis of this information. Calling something an infohazard, at least in my thinking, generally implies both that: * any attempts to devise galaxy-brained incentive structures that try to get large groups of people to nonetheless react in socially beneficial ways when they access this information are totally doomed and should be scrapped from the beginning. * you absolutely should not give this information to anyone that you have doubts would handle it well; musings along the lines of "but maybe I can teach/convince them later on what the best way to go about this is" are generally wrong and should also be dismissed. So what do you do if you nonetheless require that at least some people are keeping track of things? Well, as I said above, you use selective methods instead. More precisely, you carefully curate a very short list of human beings that are responsible people and likely also share your meta views on how dangerous truths ought to be handled, and you do your absolute best to make sure the group never expands beyond those you have already vetted as capable of handling the situation properly.

1davekasten1y

I think at the meta level I very much doubt that I am responsible enough to create and curate a list of human beings for the most dangerous hazards. For example, I am very confident that I could not 100% successfully detect a foreign government spy inside my friend group, because even the US intelligence community can't do that... you need other mitigating controls, instead.

[-]davekasten1y42

I have a few weeks off coming up shortly, and I'm planning on spending some of it monkeying around AI and code stuff. I can think of two obvious tacks: 1. Go do some fundamentals learning on technical stuff I don't have hands-on technical experience with or 2. go build on new fun stuff.

Does anyone have particular lists of learning topics / syllabi / similar things like that that would be a good fit for "fairly familiar with the broad policy/technical space, but his largest shipped chunk of code is a few hundred lines of python" person like me?

3Joseph Miller1y

The ARENA curriculum is very good.

[-]davekasten3mo30

If you're someone who has^[1], or will have, read If Anyone Builds It, Everyone Dies, I encourage you to post your sincere and honest review of the book on Amazon once you have read it -- I think it would be useful to the book's overall reputation.

But be a rationalist! Give your honest opinion.

When:

If you've already read it: Once Amazon accepts reviews, likely starting on the book launch date tomorrow.

If you haven't read it: Once you've read it. Especially if you've ordered a copy from Amazon so they know the review is coming from a ... (read more)

[-]davekasten1y20

Preregistering intent to write "Every Bay Area Walled Compound" (hat tip: Emma Liddell)

Unrelatedly, I am in Berkeley through Wednesday afternoon, let me know if you're around and would like to chat

1davekasten1y

I suspect this won't get published until November at the earliest, but I am already delightfully pleased with this bit:

[-]davekasten1y21

I feel like one of the trivially most obvious signs that AI safety comms hasn't gone actually mainstream yet is that we don't say, "yeah, superintelligent AI is very risky. No, I don't mean Terminator. I'm thinking more Person of Interest, you know, that show with the guy from the Sound of Freedom and the other guy who was on Lost and Evil?"

2[anonymous]1y

I agree (minor spoilers below).

1davekasten1y

(This is not an endorsement of Jim Caviezel's beliefs, in case anyone somehow missed my point here.)

[-]davekasten1y10

I'll be in Berkeley Weds evening through next Monday, would love to chat with, well, basically anyone who wants to chat. (I'll be at The Curve Fri-Sun, so if you're already gonna be there, come find me there between the raindrops!)

[-]davekasten1y10

Why are we so much more worried about LLMs having CBRN risk than super-radicalization risk, precisely ?

(or is this just a expected-harm metric rather than a probability metric ?)

[-]davekasten1y10

I am (speaking personally) pleasantly surprised by Anthropic's letter. https://cdn.sanity.io/files/4zrzovbb/website/6a3b14a98a781a6b69b9a3c5b65da26a44ecddc6.pdf

[-]davekasten2y10

I'll be at LessOnline this upcoming weekend -- would love to talk to folks about what things they wish someone would write about to explain how DC policy stuff and LessWrong-y themes could be better connected.

[-]davekasten2y10

Hypothesis, super weakly held and based on anecdote:
One big difference between US national security policy people and AI safety people is that the "grieving the possibility that we might all die" moment happened, on average, more years ago for the national security policy person than the AI safety person.

This is (even more weakly held) because the national security field has existed for longer, so many participants literally had the "oh, what happens if we get nuked by Russia" moment in their careers in the Literal 1980s...

[-]davekasten1y-10

Ok, so Anthropic's new policy post (explicitly NOT linkposting it properly since I assume @Zac Hatfield-Dodds or @Evan Hubinger or someone else from Anthropic will, and figure the main convo should happen there, and don't want to incentivize fragmenting of conversation) seems to have a very obvious implication.

Unrelated, I just slammed a big AGI-by-2028 order on Manifold Markets.

[+][comment deleted]1y10

Moderation Log