CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
If we are ever arguing on LessWrong and you feel like it's kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I'll probably be willing to call to discuss briefly.
I agree that many AI safety papers aren't that replicable.
In some cases this is because the papers are just complete trash and the authors should be ashamed of themselves. I'm aware of at least one person in the AI safety community who is notorious for writing papers that are quite low quality but that get lots of attention for other reasons. (Just to clarify, I don't mean the Anthropic interp team; I do have lots of problems with their research and think that they often over-hype it, but I'm thinking of someone who is worse than that.)
In many cases, papers only sort of replicate, and whether this is a problem depends on what the original paper said.
For example, two papers I was involved with:
Some unstructured thoughts:
I think it's sort of a type error to refer to Anthropic as something that one could trust or not. Anthropic is a company which has a bunch of executives, employees, board members, LTBT members, external contractors, investors, etc, all of whom have influence over different things the company does.
I think the main case where people are tempted to use the word "trust" in connection with Anthropic is when they are trying to decide how good it is to make Anthropic generically more powerful, e.g. by working there on AI capabilities.
I do think that many people (including most Anthropic staff) are well described as trusting Anthropic too much. For example, some people are trustworthy in the sense that things they say make it pretty easy to guess what they're going to do in the future in a wide variety of situations that might come up; I definitely don't think that this is the case for Anthropic. This is partially because it's generally hard to take companies literally when they say things, and partially because Anthropic leadership aren't as into being truthful as, for example, rationalists are. I think that many Anthropic staff take Anthropic leadership at its word to an extent that degrades their understanding of AI-risk-relevant questions.
But is that bad? It's complicated by the fact that it's quite challenging to have enough context on the AI risk situation that you can actually second-guess Anthropic leadership in a way that overall makes the situation better. Most AI-safety-concerned people who work at Anthropic spend most of their time trying to do their job instead of thinking a lot about e.g. what should happen on state legislation; I think it would take a lot of time for them to get confident enough that Anthropic was behaving badly that it would add value for them to try to pressure Anthropic (except by somehow delegating this judgement call to someone who is less COI-ed and who can amortize this work).
I think that in some cases in the past, Anthropic leadership did things that safety-concerned staff wouldn't have liked, and where Anthropic leadership looks like they made the right call in hindsight. For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it's IMO sometimes been good that Anthropic leadership hasn't been very pressured by their staff.
On the general topic of whether it's good for Anthropic to be powerful, I think that it's also a big problem that Anthropic leadership is way less worried than I am about AIs being egregiously misaligned; I think it's plausible that in the future they'll take actions that I think are very bad for AI risk. (For example, I think that in the face of ambiguous evidence about AI misalignment that I think we're likely to get, they are much more likely than I would be to proceed with building more powerful models.) This has nothing to do with whether they're honest.
I also recommend Holden Karnofsky's notes on trusting AI companies, summarized here.
Out of curiosity about usage, I ctrl-f'd through the Securing Model Weights report to see how they use the word "insider". I found:
I'd add that a common reason to choose not to act against someone is that many of those factors are combined.
I think situations where it's (e.g.) purely "they have power to hurt you" or "you lack legible evidence" are much rarer than situations where it's an awkward combination of those with other things, and so it's hard to even know whether you should take on the project of acting against someone carefully and well.
People who work on politics often have to deal with adversaries who are openly sneering internet trolls (or similar), and sometimes run across valuable opportunities that require cooperating with them.
When faced with tradeoffs, you should value the ones that mean you still get to make more trades. Never put that on the line.
What about this: you can press button A or button B. If you press button A, you get a million dollars but then have to sit out round two. In button B, you play a game of chess against someone and the winner gets $10. Surely you should press A?
The heuristic you've described is probably good in a lot of situations but it's definitely not universally applicable.
Sure; I think extra speed from practicing it (and e.g. more instantly knowing that 100M is 1e8) is worth it.
This is a great list!
Here's some stuff that isn't in your list that I think comes up often enough that aspiring ML researchers should eventually know it (and most of this is indeed universally known). Everything in this comment is something that I've used multiple times in the last month.
And some stuff I'm personally very glad to know:
I think it's worth drilling your halfish-power-of-ten times tables, by which I mean memorizing the products of numbers like 1, 3, 10, 30, 100, 300, etc, while pretending that 3x3=10.
For example, 30*30=1k, 10k times 300k is 3B, etc.
I spent an hour drilling these on a plane a few years ago and am glad I did.
I don't think I have a moral obligation not to do that. I'm a guy who wants to do good in the world and I try to do stuff that I think is good, and I try to follow policies such that I'm easy to work with and so on. I think it's pretty complicated to decide how averse you should be to taking on the risk of being eaten by some kind of process.
When I was 23, I agreed to work at MIRI on a non-public project. That's a really risky thing to do for your epistemics etc. I knew that it was a risk at the time, but decided to take the risk anyway. I think it is sensible for people to sometimes take risks like this. (For what it's worth, MIRI was aware that getting people to work on secret projects is a kind of risky thing to do, and they put some effort into mitigating the risks.)
I think it's probably good that Anthropic has pushed the capabilities frontier, and I think a lot of the arguments that this is unacceptable are kind of wrong. If Anthropic staff had pushed back on this more, I think probably the world would be a worse place. (I do think Anthropic leadership was either dishonest or negligently-bad-at-self-modeling about whether they'd push the capabilities frontier.)
I didn't understand your last paragraph.