CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
If we are ever arguing on LessWrong and you feel like it's kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I'll probably be willing to call to discuss briefly.
Notably, I think I disagree with Eliezer on what his moat is! I think he thinks that he's much better at coming to correct conclusions or making substantial intellectual progress than I think he is.
This doesn't feel that surprising to me. I guess my model is that different skills are correlated, and then if you pick someone who's extremely capable at a couple of skills, it's not that surprising if no one Pareto dominates them.
I agree that my point isn't really responding to whether it's surprising that there's no one who Pareto dominates him.
(Hopefully it's not rude to state my personal impression of Eliezer as a thinker. I think he's enough of a public figure that it's acceptable for me to comment on it. I'd like to note that I have benefited in many important ways from Eliezer's writing and ideas, and I've generally enjoyed interacting with him in person, and I'm sad that as a result of some of our disagreements our interactions are tense.)
Yeah, I agree that there's no one who Pareto dominates Eliezer at his top four most exceptional traits. (Which I guess I'd say are: taking important weird ideas seriously, writing compelling/moving/insightful fiction (for a certain audience), writing compelling/evocative/inspiring stuff about how humans should relate to rationality (for a certain audience), being broadly knowledgeable and having clever insights about many different fields.)
(I don't think that he's particularly good at thinking about AI; at the very least he is nowhere near as exceptional as he is at those other things.)
I'm not trying to disagree with you. I'm just going to ruminate unstructuredly a little on this:
I know a reasonable number of exceptional people. I am involved in a bunch of conversations about what fairly special people should do. In my experience, when you're considering two people who might try to achieve a particular goal, it's usually the case that each has some big advantages over the other in terms of personal capabilities. So, they naturally try to approach it fairly differently. We can think about this in the case where you are hiring CEOs for a project or speculating about what will happen when companies headed by different CEOs compete.
For example, consider the differences between Sam Altman and Dario Amodei (I don't know either that well, nor do I understand the internal workings of OpenAI/Anthropic, so I'm sort of speculating here):
Both of them have done pretty well for themselves in similar roles.
As a CEO, it does feel pretty interesting how non-interchangeable most people are. And it's interesting how in a lot of cases, it's possible to compensate for one weakness with a strength that seems almost unrelated.
If Eliezer had never been around, my guess is that the situation around AI safety would be somewhat but not incredibly different (though probably overall substantially worse):
Maybe a relevant underlying belief of mine is that Eliezer is very good at coming up with terms for things and articulating why something is important, and he also had the important strength of realizing how important AI was before that many other people had done so. But I don't think his thinking about AI is actually very good on the merits. Most of the ideas he's spread were originally substantially proposed by other people; his contribution was IMO mostly his reframings and popularizations. And I don't think his most original ideas actually look that good. (See here for an AI summary.)
I think Eliezer underestimates other people because he evaluates them substantially based on how much they agree with him, and, as a consequence of him having a variety of dumb takes, smart people usually disagree with him about a bunch of stuff.
I'd be really interested in someone trying to answer the question: what updates on the a priori arguments about AI goal structures should we make as a result of empirical evidence that we've seen? I'd love to see a thoughtful and comprehensive discussion of this topic from someone who is both familiar with the conceptual arguments about scheming and also relevant AI safety literature (and maybe AI literature more broadly).
Maybe a good structure would be, from the a priori arguments, identifying core uncertainties like "How strong is the imitative prior?" And "How strong is the speed prior?" And "To what extent do AIs tend to generalize versus learn narrow heuristics?" and tackling each. (Of course, that would only make sense if the empirical updates actually factor nicely into that structure.)
I feel like I understand this very poorly right now. I currently think the only important update that empirical evidence has given me, compared to the arguments in 2020, is that the human-imitation prior is more powerful than I expected. (Though of course it's unclear whether this will continue (and basic points like the expected increasing importance of RL suggest that it will be less powerful over time.)) But to my detriment, I don't actually read the AI safety literature very comprehensively, and I might be missing empirical evidence that really should update me.
That's correct. Ryan summarized the story as:
Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it's been a long journey...) and was basically the only contributor to the project for around 2 months.
By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.
After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.
This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.
So Anthropic was indeed very accommodating here; they gave Ryan an unprecedented level of access for this work, and we're grateful for that. (And obviously, individual Anthropic researchers contributed a lot to the paper, as described in its author contribution statement. And their promotion of the paper was also very helpful!)
My objection is just that this paragraph of yours is fairly confused:
We don’t want to shoot the messenger — they went looking. They didn’t have to do that. They told us the results, and they didn’t have to do that. Anthropic finding these results is Anthropic being good citizens. And you want to be more critical of the A.I. companies that didn’t go looking.
This paper wasn't a consequence of Anthropic going looking, it was a consequence of Ryan going looking. If Anthropic hadn't wanted to cooperate, then Ryan would have just published his results without Anthropic's help, which would have been a moderately worse paper that would have probably gotten substantially less attention, but Anthropic didn't have the opportunity to not publish (a crappier version of) the core results.
Just to be clear, I don't think this is that big a deal. It's a bummer that Redwood doesn't get as much credit for this paper as we deserve, but this is pretty unavoidable given how much more famous Anthropic is; my sense is that it's worth the effort for safety people to connect the paper to Redwood/Ryan when discussing it, but it's no big deal. I normally don't bother to object to that credit misallocation. But again, the story of the paper conflicted with these sentences you said, which is why I bothered bringing it up.
Alignment faking, and the alignment faking research was done at Anthropic.
And we want to give credit to Anthropic for this. We don’t want to shoot the messenger — they went looking. They didn’t have to do that. They told us the results, and they didn’t have to do that. Anthropic finding these results is Anthropic being good citizens. And you want to be more critical of the A.I. companies that didn’t go looking.
It would be great if Eliezer knew that (or noted, if he knows but is just phrasing it really weirdly) the alignment faking paper research was initially done at Redwood by Redwood staff; I'm normally not prickly about this but it seems directly relevant to what Eliezer said here.
It really depends on what you mean by "most of the time when people say this". I don't think my experience matches yours.
My understanding is that voter ID laws are probably net helpful for Democrats at this point.