CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
If we are ever arguing on LessWrong and you feel like it's kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I'll probably be willing to call to discuss briefly.
One thing I notice when reading 20th century history is that people in the 1900s-1970s had much higher priors than modern people do that the future might be radically different, in either great or terrible ways. For example:
I really feel like the ambient cultural sense among educated Americans is: the future will be kind of like the present, treating it as if something radical will happen is naive. (They sort of say that they think climate change will be apocalyptic, but it feels to me like what they're really imagining is that the world is "enshittified" further, in the same way that it sucks that DoorDash is now expensive, and maybe poor people elsewhere die.)
I think this is probably mostly because there's an important sense in which world has been changing more slowly (at least from the perspective of Americans), and the ways in which it's changing feel somehow less real. Someone who was 50 in 1945 had seen the collapse of empires that had lasted centuries, unprecedented wars, the sudden shocking rise of Communism, the invention and mass adoption of cars, radio, tanks, etc. That's just way way crazier than anything that 50 year old Americans have seen. And the main technological advances--phones, internet, social media, and recently AI--seem somehow subtler and easier to ignore, even though they have an objectively large effect on people's experience of life and on how society functions.
I think that people of the past might have reacted with more credulity to some of our claims about transformative AI.
I often feel like people I'm talking to are demonstrating an embarrassing lack of historical context when they implicitly imagine that states will be stable and that technology won't drastically change the world. (Or sometimes they say "usually it works better to trade with people than to overpower them", and my response is "that is really not a historical universal!")
Eliezer sometimes talks about how people are ruined by modern culture, in a way only fixable by reading 1950s sci-fi (or something like this, I don't remember). I wonder how much of what he's talking about is related to this.
fixed, thanks
I don't really disagree with anything you said here. (Edit to add: except that I don’t agree with the OP’s interpretation of all the evidence listed.)
you have a moral obligation not to be eaten by the sort of process that would eat people
I don't think I have a moral obligation not to do that. I'm a guy who wants to do good in the world and I try to do stuff that I think is good, and I try to follow policies such that I'm easy to work with and so on. I think it's pretty complicated to decide how averse you should be to taking on the risk of being eaten by some kind of process.
When I was 23, I agreed to work at MIRI on a non-public project. That's a really risky thing to do for your epistemics etc. I knew that it was a risk at the time, but decided to take the risk anyway. I think it is sensible for people to sometimes take risks like this. (For what it's worth, MIRI was aware that getting people to work on secret projects is a kind of risky thing to do, and they put some effort into mitigating the risks.)
For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it's IMO sometimes been good that Anthropic leadership hasn't been very pressured by their staff.
Could you give a more specific example, that's among the strongest such examples?
I think it's probably good that Anthropic has pushed the capabilities frontier, and I think a lot of the arguments that this is unacceptable are kind of wrong. If Anthropic staff had pushed back on this more, I think probably the world would be a worse place. (I do think Anthropic leadership was either dishonest or negligently-bad-at-self-modeling about whether they'd push the capabilities frontier.)
I didn't understand your last paragraph.
I agree that many AI safety papers aren't that replicable.
In some cases this is because the papers are just complete trash and the authors should be ashamed of themselves. I'm aware of at least one person in the AI safety community who is notorious for writing papers that are quite low quality but that get lots of attention for other reasons. (Just to clarify, I don't mean the Anthropic interp team; I do have lots of problems with their research and think that they often over-hype it, but I'm thinking of someone who is worse than that.)
In many cases, papers only sort of replicate, and whether this is a problem depends on what the original paper said.
For example, two papers I was involved with:
Some unstructured thoughts:
I think it's sort of a type error to refer to Anthropic as something that one could trust or not. Anthropic is a company which has a bunch of executives, employees, board members, LTBT members, external contractors, investors, etc, all of whom have influence over different things the company does.
I think the main case where people are tempted to use the word "trust" in connection with Anthropic is when they are trying to decide how good it is to make Anthropic generically more powerful, e.g. by working there on AI capabilities.
I do think that many people (including most Anthropic staff) are well described as trusting Anthropic too much. For example, some people are trustworthy in the sense that things they say make it pretty easy to guess what they're going to do in the future in a wide variety of situations that might come up; I definitely don't think that this is the case for Anthropic. This is partially because it's generally hard to take companies literally when they say things, and partially because Anthropic leadership aren't as into being truthful as, for example, rationalists are. I think that many Anthropic staff take Anthropic leadership at its word to an extent that degrades their understanding of AI-risk-relevant questions.
But is that bad? It's complicated by the fact that it's quite challenging to have enough context on the AI risk situation that you can actually second-guess Anthropic leadership in a way that overall makes the situation better. Most AI-safety-concerned people who work at Anthropic spend most of their time trying to do their job instead of thinking a lot about e.g. what should happen on state legislation; I think it would take a lot of time for them to get confident enough that Anthropic was behaving badly that it would add value for them to try to pressure Anthropic (except by somehow delegating this judgement call to someone who is less COI-ed and who can amortize this work).
I think that in some cases in the past, Anthropic leadership did things that safety-concerned staff wouldn't have liked, and where Anthropic leadership looks like they made the right call in hindsight. For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it's IMO sometimes been good that Anthropic leadership hasn't been very pressured by their staff.
On the general topic of whether it's good for Anthropic to be powerful, I think that it's also a big problem that Anthropic leadership is way less worried than I am about AIs being egregiously misaligned; I think it's plausible that in the future they'll take actions that I think are very bad for AI risk. (For example, I think that in the face of ambiguous evidence about AI misalignment that I think we're likely to get, they are much more likely than I would be to proceed with building more powerful models.) This has nothing to do with whether they're honest.
I also recommend Holden Karnofsky's notes on trusting AI companies, summarized here.
Out of curiosity about usage, I ctrl-f'd through the Securing Model Weights report to see how they use the word "insider". I found:
I'd add that a common reason to choose not to act against someone is that many of those factors are combined.
I think situations where it's (e.g.) purely "they have power to hurt you" or "you lack legible evidence" are much rarer than situations where it's an awkward combination of those with other things, and so it's hard to even know whether you should take on the project of acting against someone carefully and well.
People who work on politics often have to deal with adversaries who are openly sneering internet trolls (or similar), and sometimes run across valuable opportunities that require cooperating with them.
I agree there's been a lot of scientific progress, and real GDP per capita, which is maybe the most canonical single metric, continues to rise steadily.
But yeah, I think that this feels underwhelming to people compared to earlier qualitative changes. I think this is some combination of them noting that tech advances affect their lives less, and the tech advances feeling more opaque.