Richard_Ngo — LessWrong

I feel confused about how to engage with this post. I agree that there's a bunch of evidence here that Anthropic has done various shady things, which I do think should be collected in one place. On the other hand, I keep seeing aggressive critiques from Mikhail that I think are low-quality (more context below), and I expect that a bunch of this post is "spun" in uncharitable ways.

That is, I think of the post as primarily trying to do the social move of "lower trust in Anthropic" rather than the epistemic move of "try to figure out what's up with Anthropic". The latter would involve discussion of considerations like: sometimes lab leaders need to change their minds. To what extent are disparities in their statements and actions evidence of deceptiveness versus changing their minds? Etc. More generally, I think of good critiques as trying to identify standards of behavior that should be met, and comparing people or organizations to those standards, rather than just throwing accusations at them.

EDIT: as one salient example, "Anthropic is untrustworthy" is an extremely low-resolution claim. Someone who was trying to help me figure out what's up with Anthropic should e.g. help me calibrate what they mean by "untrustworthy" by comparison to other AI labs, or companies in general, or people in general, or any standard that I can agree or disagree with. Whereas someone who was primarily trying to attack Anthropic is much more likely to use that particular term as an underspecified bludgeon.

My overall sense is that people should think of the post roughly the way they think of a compilation of links, and mostly discard the narrativizing attached to it (i.e. do the kind of "blinding yourself" that Habryka talks about here).

Context: I'm thinking in particular of two critiques. The first was of Oliver Habryka. I feel pretty confident that this was a bad critique, which overstated its claims on the basis of pretty weak evidence. The second was Red Queen Bio. Again, it seemed like a pretty shallow critique: it leaned heavily on putting the phrases "automated virus-producing equipment" and "OpenAI" in close proximity to each other, without bothering to spell out clear threat models or what he actually wanted to happen instead (e.g. no biorisk companies take money from OpenAI? No companies that are capable of printing RNA sequences use frontier AI models?)

In that case I didn't know enough about the mechanics of "virus-producing equipment" to have a strong opinion, but I made a mental note that Mikhail tended to make "spray and pray" critiques that lowered the standard of discourse. (Also, COI note: I'm friends with the founders of Red Queen Bio, and was one of the people encouraging them to get into biorisk in the first place. I'm also friends with Habryka, and have donated recently to Lightcone. EDIT to add: about 2/3 of my net worth is in OpenAI shares, which could become slightly more valuable if Red Queen Bio succeeds.)

Two (even more) meta-level considerations here (though note that I don't consider these to be as relevant as the stuff above, and don't endorse focusing too much on them):

For reference, the other person I've drawn the most similar conclusion about was Alexey Guzey (e.g. of his critiques here, here, and in some internal OpenAI docs). I notice that he and Mikhail are both Russian. I do have some sympathy for the idea that in Russia it's very appropriate to assume a lot of bad faith from power structures, and I wonder if that's a generator for these critiques.
I'm curious if this post was also (along with the Habryka critique) one of Mikhail's daily Inkhaven posts. If so it seems worth thinking about whether there are types of posts that should be written much more slowly, and which Inkhaven should therefore discourage from being generated by the "ship something every day" process.

A Pragmatic Vision for Interpretability

Richard_Ngo8hΩ204915

Thanks for writing this up. While I don't have much context on what specifically has gone well or badly for your team, I do feel pretty skeptical about the types of arguments you give at several points: in particular focusing on theories of change, having the most impact, comparative advantage, work paying off in 10 years, etc. I expect that this kind of reasoning itself steers people away from making important scientific contributions, which are often driven by open-ended curiosity and a drive to uncover deep truths.

(A provocative version of this claim: for the most important breakthroughs, it's nearly impossible to identify a theory of change for them in advance. Imagine Newton or Darwin trying to predict how understanding mechanics/evolution would change the world. Now imagine them trying to do that before they had even invented the theory! And finally imagine if they only considered plans that they thought would work within 10 years, and the sense of scarcity and tension that would give rise to.)

The rest of my comment isn't directly about this post, but close enough that this seems like a reasonable place to put it. I get the sense that there was a "generation" of AI safety researchers who have ended up with a very marginalist mindset about AI safety. Some examples:

the evals that Beth Barnes (and maybe Dan Hendrycks?) are focusing on
the scenarios that Daniel Kokotajlo is focusing on
the models of misalignment that Evan Hubinger is focusing on
the forecasting that the OpenPhil worldview investigations team focused on
scary demos
safety cases
policy approaches like SB-1047

In other words, whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment. In the terminology of this excellent post, they are all trying to attack a category I problem not a category II problem. Sometimes it feels like almost the entire field is Goodharting on the subgoal of "write a really persuasive memo to send to politicians". Pragmatic interpretability feels like another step in that direction.

This is all related to something Buck recently wrote: "I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget". I'm sure Buck has thought a lot about his strategy here, and I'm sure that you've thought a lot about your strategy as laid out in this post, and so on. But a part of me is sitting here thinking: man, everyone sure seems to have given up. (And yes, I know it doesn't feel like giving up from the inside, but from my perspective that's part of the problem.)

Now, a lot of the "old guard" seems to have given up too. But they at least know what they've given up on. There was an ideal of fundamental scientific progress that MIRI and Paul and a few others were striving towards; they knew at least what it would feel like (if not what it would look like) to actually make progress towards understanding intelligence. Eliezer and various others no longer think that's plausible. I disagree. But aside from the object-level disagreement, I really want people to be aware that this is a thing that's at least possible in principle to aim for, lest the next generation of the AI safety community ends up giving up on it before they even know what they've given up on.

(I'll leave for another comment/post the question of what went wrong in my generation. The "types of arguments" I objected to above all seem quite EA-flavored, and so one salient possibility is just that the increasing prominence of EA steered my generation away from the type of mentality in which it's even possible to aim towards scientific breakthroughs. But even if that's one part of the story, I expect it's more complicated than that.)

Richard Ngo's Shortform

Richard_Ngo2dΩ230

Thinking more about the cellular automaton stuff: okay, so Game of Life is Turing complete. But the question is whether we can pin down properties that GoL has that Turing machines don't have.

I have a vague recollection that parallel Turing Machines are a thing, but this paper claims that the actual formalisms are disappointing. One nice thing about Game of Life is that the way that different programs interact internally (via game of life physics) is also how they interact with each other. Whereas any multi-tape Turing Machine (even one with clever rules about how to integrate inputs from multiple tapes) wouldn't have that property.

I feel like I'm not getting beyond the original idea that Game of Life could have adversarial robustness in a way that Turing Machines don't. But it feels like you'd need to demonstrate this with some construction that's actually adversarially robust, which seems difficult.

Unless its governance changes, Anthropic is untrustworthy

Richard_Ngo3d173

Sometimes, conclusions don’t need to be particularly nuanced. Sometimes, a system is built of many parts, and yet a valid, non-misleading description of that system as a whole is that it is untrustworthy.

The central case where conclusions don't need to be particularly nuanced is when you're engaged in a conflict and you're trying to attack the other side.

In other cases, when you're trying to figure out how the world works and act accordingly, nuance typically matters a lot.

Calling an organization "untrustworthy" is like calling a person "unreliable". Of course some people are more reliable than others, but when you smuggle in implicit binary standards you are making it harder in a bunch of ways to actually model the situation.

I sent Mikhail the following via DM, in response to his request for "any particular parts of the post [that] unfairly attack Anthropic":

I think that the entire post is optimized to attack Anthropic, in a way where it's very hard to distinguish between evidence you have, things you're inferring, standards you're implicitly holding them to, standards you're explicitly holding them to, etc.

My best-guess mental model here is that you were more careful about this post than about the other posts, but that there's a common underlying generator to all of them, which is that you're missing some important norms about how healthy critique should function.

I don't expect to be able to convey those norms or their importance to you in this exchange, but I'll consider writing up a longform post about them.

I think Situational Awareness is a pretty good example of what it looks like for an essay to be optimized for a given outcome at the expense of epistemic quality. In Situational Awareness, it's less that any given statement is egregiously false, and more that there were many choices made to try to create a conceptual frame that promoted racing. I have critiqued this at various points (and am writing up a longer critique) but what I wanted from Leopold was something more like "here are the key considerations in my mind, here's how I weigh them up, here's my nuanced conclusion, here's what would change my mind". And that's similar to what I want from posts like yours too.

Explosive Skill Acquisition

Richard_Ngo3d54

I find this terrifying, that I might be incompetent in many ways, and that if I had a little more awareness, a little more “oomph” I could be much better.

Consider whether the awareness of the terror is itself one of the key steps towards becoming more competent.

That is, much incompetence is caused by suppressed fear, which thereby becomes a self-fulfilling prophecy.

(Apologies for the vagueness here, though I guess my sequence on this elaborates.)

Richard_Ngo3d6118

Someone on the EA forum asked why I've updated away from public outreach as a valuable strategy. My response:

I used to not actually believe in heavy-tailed impact. On some gut level I thought that early rationalists (and to a lesser extent EAs) had "gotten lucky" in being way more right than academic consensus about AI progress. I also implicitly believed that e.g. Thiel and Musk and so on kept getting lucky, because I didn't want to picture a world in which they were actually just skillful enough to keep succeeding (due to various psychological blockers).

Now, thanks to dealing with a bunch of those blockers, I have internalized to a much greater extent that you can actually be good not just lucky. This means that I'm no longer interested in strategies that involve recruiting a whole bunch of people and hoping something good comes out of it. Instead I am trying to target outreach precisely to the very best people, without compromising much.

Relatedly, I've updated that the very best thinkers in this space are still disproportionately the people who were around very early. The people you need to soften/moderate your message to reach (or who need social proof in order to get involved) are seldom going to be the ones who can think clearly about this stuff. And we are very bottlenecked on high-quality thinking.

(My past self needed a lot of social proof to get involved in AI safety in the first place, but I also "got lucky" in the sense of being exposed to enough world-class people that I was able to update my mental models a lot—e.g. watching the OpenAI board coup close up, various conversations with OpenAI cofounders, etc. This doesn't seem very replicable—though I'm trying to convey a bunch of the models I've gained on my blog, e.g. in this post.)

Richard_Ngo3d*10344

Richard_Ngo3d40

Hmm, I don't have anything substantive out on this specifically; the closest is probably this talk (though note that some of my arguments in it were a bit sloppy, e.g. as per the top comment).

Richard_Ngo5dΩ220

if there are sufficiently many copies, it becomes impossible to corrupt them all at once.
So I don't love this model because escaping corruption is 'too easy'.

I really like the cellular automaton model. But I don't think it makes escaping corruption easy! Even if most of the copies are non-corrupt, the question is how you can take a "vote" of the corrupt vs non-corrupt copies without making the voting mechanism itself be easily corrupted. That's why I was talking about the non-corrupt copies needing to "overpower" the corrupt copies above.

Richard_Ngo5d20

A few responses:

As per my post on underdog bias, the question of which group is actually weaker and which group is stronger is often a pretty subjective call. I even discuss in the post the example of Israel, where you could see it as the "stronger" group (vs Palestine in particular) or the "weaker" group (vs all the Muslim countries surrounding it).
There are plenty of cases where leftists support the stronger group against the weaker group—most notably Soviet and Chinese repression of dissidents and minorities. E.g. it took Solzhenitsyn publishing Gulag Archipelago to finally get leftists (even fairly "mainstream" leftists) to stop lionizing the USSR.
Even insofar as leftists tend to support the weaker group, there are almost no cases where they do so as strongly as in Israel vs Palestine. So there's still something important to be explained here even accepting your claims.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments