Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
The original title was "Small batches and the mythical single piece flow", which I guess communicates that better, but "single piece flow" is big word and I like imperative titles for this principle series better.
I might be able to come up with something else. Suggestions also welcome!
Edit: I updated the title! Maybe it's too hard to parse, but I think it's better.
Cool, that's not a crazy view. I might engage with it more, but I feel like I understand where you are coming from now.
Jason Clinton very briefly responded saying that the May 14th update, which excluded sophisticated insiders from the threat model, addresses the concerns in this post, plus some short off-the-record comments.
Based on this and Holden's comment, my best interpretation of Anthropic's position on this issue is that they currently consider employees at compute providers to be "insiders" and executives "sophisticated insiders", and the latter hereby excluded from Anthropic's security commitments. They also likely furthermore think that compute providers do not have the capacity to execute attacks like this without very high chance of getting caught and so that this is not a threat model they are concerned about.
As I argue for in the post, defining basically all employees at compute providers to be "insiders" feels like an extreme stretch of the word, and I think has all kinds of other tricky implications for the RSP, but it's not wholly inconsistent!
I think to bring Anthropic back into compliance with what I think is a common sense reading, I would suggest (in descending goodness):
I separately seem to disagree with Anthropic on this being a thing that executives at Google/Amazon/Microsoft could be motivated to do, and a thing that they would succeed at if they tried, but given the broad definition of "Insider" it doesn't appear to be load bearing for the thesis in the OP. I have written a bit more about it anyways in this comment thread.
Oh, yeah, I should write a future principles memo on falsifying the most load bearing assumptions early. Agree that that is a really important aspect of doing small batches well!
Appreciate the empirical data! This aligns with my models in the space.
Sorry, I think I am failing to parse this comment. I agree that financial fraud is a thing people don't want. This post is telling them about one dynamic that tends to cause a bunch of it. I agree that of course all the difficulty of stopping fraud lies in the difficulty of distinguishing fraud from non-fraud. This post tries to help you distinguish fraud from non-fraud, and e.g. the FAQ section addresses some specific ways in which the dynamic here can be distinguished from entrepreneurship and marketing.
You might disagree that this is possible, or have some other logical issue with the post, but I feel like you are largely just saying things that are true and said in the post, but then say "if you are calling for normal speculative investment to be banned", which like, I am of course not doing and the post is not implying, and I feel like I have a bunch of paragraphs in there in clarifying that I am not calling for speculative investment to be banned.
not the "the AI may indeed understand that this is not what we meant" part. (Pretend the latter part doesn't exist.)
Ok, but the latter part does exist! I can't ignore it. Like, it's a sentence that seems almost explicitly designed to clarify that Bostrom thinks the AI will understand what we mean. So clearly, Bostrom is not saying "the AI will not understand what we mean". Maybe he is making some other error in the book about how when the AI understands the way it does, it has to be corrigible, or that "happiness" is a confused kind of model of what an AI might want to optimize, but clearly that sentence is an atrocious sentence for demonstrating that "Bostrom said that the AI will not understand what we mean". Like, he literally said the opposite right there, in the quote!
the threat model you're describing is out of scope for our RSP, as I think the May 14 update (last page) makes clear
I disagree. I address that section explicitly here:
From the Anthropic RSP:
When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.
We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees, and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).
[...]
We will implement robust controls to mitigate basic insider risk, but consider mitigating risks from sophisticated or state-compromised insiders to be out of scope for ASL-3. We define “basic insider risk” as risk from an insider who does not have persistent or time-limited access to systems that process model weights. We define “sophisticated insider risk” as risk from an insider who has persistent access or can request time-limited access to systems that process model weights.
My best understanding is the RSP commits Anthropic to being robust to attackers from corporate espionage teams (as included in the list above).
The RSP mentions "insiders" as a class of person Anthropic is promising less robustness from, but doesn't fully define the term. Everyone I have talked to about the RSP interpreted "insiders" to mean "people who work at Anthropic". I think it would be a big stretch for "insiders" to include "anyone working at any organization we work with that has persistent access to systems that process model weights". As such, I think it's pretty clear that "insiders" is not intended to include e.g. Amazon AWS employees, or Google employees.
Claude agrees with this interpretation: https://claude.ai/share/b7860f42-bef1-4b28-bf88-8ca82722ce82
One could potentially make the argument that Google, Microsoft and Amazon should be excluded on the basis of the "highly sophisticated attacker" carve-out:
The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.
Multiple people I talked to thought this was highly unlikely. These attacks would not be nation-state backed, would not require developing novel attack chains that use 0-day attacks, and if you include Amazon and Google in this list of non-state actors, it seems very hard to limit the total list of organizations with that amount of cyber security offense capacity or more to "~10".
Again, Claude agrees: https://claude.ai/share/a6068000-0a82-4841-98e9-457c05379cc2
The Claude transcripts are all using pdfs of the latest version of the RSP. I also ran this by multiple other people and none of them thought it was reasonable for a definition of "Insiders" to apply to employees at major datacenter providers. "Insiders" I think pretty clearly means "employees or at most contractors of Anthropic". If you end up including employees at organizations Anthropic is working with, you quickly run into a bunch of absurdities and contradictions within the RSP that I think clearly show it must have a definition as narrow as this.
Therefore, I do not see how "This update excludes both sophisticated insiders and state-compromised insiders from the ASL-3 Security Standard." could exclude employees and executives at Microsoft, Google, or Amazon, unless you define "Insider" to mean "anyone with persistent physical access to machines holding model weights" in which case I would dispute that that is a reasonable definition of "insider". If Anthropic ships their model weights to another organization, clearly employees of that organization do not by default become "insiders" and executives do not become "sophisticated insiders". If Anthropic ships their models to another organization and then one of their executives steals the weights, I think Anthropic violated their RSP as stated.
If you are clarifying that Anthropic, according to its RSP, could send unencrypted weights to the CEOs of arbitrary competing tech companies, but with a promise to please not actually use them for anything, and this would not constitute a breach of its RSP because competing tech companies CEOs are "high level insiders" then I think this would be really good to clarify! I really don't think that would be a natural interpretation of the current RSP (and Claude and multiple people I've talked to about this, e.g. @Buck and @Zach Stein-Perlman agree with me here).
(Less importantly, I will register confusion about your threat model here - I don't think there are teams at these companies whose job is to steal from partners with executive buy-in? Nor do I think this is likely for executives to buy into in general, at least until/unless AI capabilities are far beyond today's.)
I don't think this is a particularly likely threat model, but also not an implausible one. My position for a long time has been that Anthropic's RSP security commitments have been unrealistically aggressive (but people around Anthropic have been pushing back on that and saying that security is really important and so Anthropic should make commitments as aggressive as this).
I think it would be a scandal at roughly the scale of the Volkswagen emission scandal if a major lab decided to do something like this, i.e. a really big deal, but not unprecedented. My current guess is that it's like 50% likely that one of Google, Amazon or Microsoft has corporate espionage teams that would be capable of doing this kind of work, and something like 6% likely that any Microsoft, Google, or Amazon would consider it worth the risk to make an attempt at exfiltrating model weights of a competing organization via some mechanism like this within the next year.
I have written a bit about this in a Twitter discussion:
Max Hodak:
> This means given executive buy-in by a high-level Amazon, Microsoft or Google executive, their corporate espionage team would have virtually unlimited physical access to Claude inference machines that host copies of the weights.microsoft is not sending a sanctioned team of spies physically into a data center to steal anthropic weights what are you talking about man
Look, I didn't make Anthropic's RSP! I agree that this is currently pretty unlikely, but it's also not completely implausible. Like, actually think about this. We are talking about hundreds of billions of dollars in market value depending on who can make the best models. Having access to a leading competitor's model for the purpose of trading your own really helps. Organizational integrity has been violated for much less.
@maxhodak_
The risk to Microsoft‘s market cap for getting caught far exceeds any possible upside here
@ohabryka
I mean, it depends on the likelihood of getting caught. I agree it doesn't look like a great EV bet right now, but I don't think it's implausible for it to become one within a few months.Like, imagine a world really quite close to this one where Microsoft's stock price is 30%+ dependent on whether they have the best frontier model, they think they can make this 10%+ more likely by stealing Anthropic's weights, and think there is less than a 10% chance that they get caught.
In that world, which really doesn't seem implausible to materialize in the coming months, this proposition is looking pretty good! Like, Microsoft's stock price probably wouldn't literally go to zero if they got caught. Even if they get caught they probably would have the ability to pin it on someone, etc.
And look, I agree with you that this is overall unlikely, but Anthropic's RSP really isn't about "we will be robust to anyone who has very strong motivations to hack us".
It makes a lot of quite specific claims about the strength of their defenses against specific classes of attackers, and does not talk about only being able to defend against the subset of attackers who are particularly motivated to hack Anthropic. Like, if you explicitly promise robustness against corporate espionage teams, then it's quite weird to exclude corporate espionage teams at three of the world's largest corporations, who also happen to be, as things go, more motivated than others at having your weights (like, what is Toyota going to do with your weights, clearly if anyone is going to steal your weights, it's going to be another AI lab or big tech company).
@maxhodak_
This would almost certainly be a felony, maybe under the Defend Trade Secrets Act (which offers plaintiffs some crazy tools) if not other statutes and would result in prison time for executives and probably a criminal prosecution of the corporation. This is not some business EV calculation. Microsoft as a whole would be permanently damaged. The chance that they actually have any kind of corporate espionage team attacking domestic US companies who are customers of theirs is ~0
@ohabryka
I mean, yes, and business executives sometimes commit felonies and go to jail when they experience enough pressure.Like, think through what must have happened for stuff like the Volkswagen emission scandal. That was straightforward fraud. It had huge consequences for Volkswagen. They nevertheless did this. As far as I can tell they had much less to win than Microsoft would have in this scenario.
I don't think papers were set up for wide distribution at all. Like, how would Newton, Darwin and Maxwell have published 10+ papers and distributed them all to their target audience?
Papers and books are aimed at different audiences. Papers are aimed at a small community of experts which has a lot shared epistemic prerequisites, so the average contribution is short. Books are aimed at larger audiences.
The modern internet enables you to write in small batches to both audiences (or any audience) really. I am more talking about the ability to write things like blogpost series, or have people follow your Youtube channel, or follow your Twitter, etc. (Like, I am not saying Youtube and Twitter are bastions of intellectual progress, but they enable distribution mechanisms that I think generally outperform previous ones).
Ok, I... think this makes sense? Honestly, I think I would have to engage with this for a long time to see whether this makes sense with the actual content of e.g. Bostrom's text, but I can at least see the shape of an argument that I could follow if I wanted to! Thank you!
(To be clear, this is of course not a reasonable amount of effort ask to put into understanding a random paragraph from a blogpost, at least without it being flagged as such, but writing is hard and it's sometimes hard to bridge inferential distance)