LESSWRONG
LW

1693
Richard_Ngo
20235Ω287317111380
Message
Dialogue
Subscribe

Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Twitter threads
Understanding systematization
Stories
Meta-rationality
Replacing fear
Shaping safer goals
AGI safety from first principles
Please, Don't Roll Your Own Metaethics
Richard_Ngo1hΩ8108

"Please don't roll your own crypto" is a good message to send to software engineers looking to build robust products. But it's a bad message to send to the community of crypto researchers, because insofar as they believe you, then you won't get new crypto algorithms from them.

In the context of metaethics, LW seems much more analogous to the "community of crypto researchers" than the "software engineers looking to build robust products". Therefore this seems like a bad message to send to LessWrong, even if it's a good message to send to e.g. CEOs who justify immoral behavior with metaethical nihilism.

Reply1
The Charge of the Hobby Horse
Richard_Ngo1h64

FWIW, in case this is helpful, my impression is that:

  • It is accurate to describe Wei as doing a "charge of the hobby-horse" in his initial comment, and this should be considered a mild norm violation. I'm also surprised and a bit disappointed that it got so many upvotes.
  • By the time that Tsvi announced the ban, Wei had already acknowledged that his original comments had been partly based on a misunderstanding. In my culture, I would expect more of an apology for doing so than the "ok...but to be fair" follow-up Wei actually gave. However, the phrase "Also, another part of my motivation is still valid and I think it would be interesting to try to answer" is a clear enough acknowledgement of a distinct line of inquiry that I no longer consider that comment to be a continuation of the "charge of the hobby-horse".
  • Tsvi banning Wei for "grossly negligent reading comprehension" after Wei had acknowledged that he was mistaken seems like a mild norm violation. It wouldn't have been a norm violation if Wei's comment hadn't made that acknowledgement; however, it would have been a stronger norm violation if Wei's comment had included an actual apology.
Reply
Wei Dai's Shortform
Richard_Ngo2h20

This has pretty low argumentative/persuasive force in my mind.

Note that my comment was not optimized for argumentative force about the overarching point. Rather, you asked how they "can" still benefit the world, so I was trying to give a central example.

In the second half of this comment I'll give a couple more central examples of how virtues can allow people to avoid the traps you named. You shouldn't consider these to be optimized for argumentative force either, because they'll seem ad-hoc to you. However, they might still be useful as datapoints.

Figuring out how to describe the underlying phenomenon I'm pointing at in a compelling, non-ad-hoc way is one of my main research focuses. The best I can do right now is to say that many of the ways in which people produce outcomes which are harmful (by their own lights) seem to arise from a handful of underlying dynamics. I call this phenomenon pessimization. One way in which I'm currently thinking about virtues is as a set of cognitive tools for preventing pessimization. As one example, kindness and forgiveness help to prevent cycles of escalating conflict with others, which is a major mechanism by which people's values get pessimized. This one is pretty obvious to most people; let me sketch out some less obvious mechanisms below.

what if someone isn't smart enough to come up with a new line of illegible research, but does see some legible problem with an existing approach that they can contribute to? What would cause them to avoid this?

This actually happened to me: when I graduated from my masters I wasn't cognitively capable of coming up with new lines of illegible alignment research, in part because I was too status-seeking. Instead I went to work at DeepMind, and ended up spending a lot of my time working on RLHF, which is a pretty central example of a "legible" line of research.

However, I also wasn't cognitively capable of making much progress on RLHF, because I couldn't see how it addressed the core alignment problem, and so it didn't seem fundamental enough to maintain my interest. Instead I spent most of my time trying to understand the alignment problem philosophically (resulting in this sequence) at the expense of my promotion prospects.

In this case I think I had the virtue of deep curiosity, which steered my attention towards illegible problems even though my top-down plan was to contribute to alignment by doing RLHF research. These days, whatever you might think of my research, few people complain that it's too legible.

There are other possible versions of me who had that deep curiosity but weren't smart enough to have generated a research agenda like my current one; however, I think they would still have left DeepMind, or at least not been very productive on RLHF.

And even the hypothetical virtuous person who starts doing illegible research on their own, what happens when other people catch up to him and the problem becomes legible to leaders/policymakers? How would they know to stop working on that problem and switch to another problem that is still illegible?

When a field becomes crowded, there's a pretty obvious inference that you can make more progress by moving to a less crowded field. I think people often don't draw that inference because moving to a less crowded field loses them prestige, is emotionally/financially risky, etc. Virtues help remove those blockers.

Reply
Paranoia: A Beginner's Guide
Richard_Ngo2h*42

though I think you don't need to invoke knightian uncertainty. I think it's simply enough to model there being a very large attack surface combined with a more intelligent adversary.

One of the problems I'm pointing to is that you don't know what the attack surface is. This puts you in a pretty different situation than if you have a known large attack surface to defend, even against a smarter adversary (e.g. the whole length of a border; or every possible sequence of Go moves).

Separately, I may be being a bit sloppy by using "Knightian uncertainty" as a broad handle for cases where you have important "unknown unknowns", aka you don't even know what ontology to use. But it feels close enough that I'm by default planning to continue describing the research project outlined above as trying to develop a theory of Knightian uncertainty in which Bayesian uncertainty is a special case.

Reply
Paranoia: A Beginner's Guide
Richard_Ngo3h20

I also have a short story about (some aspects of) paranoia from the inside.

Reply
Paranoia: A Beginner's Guide
Richard_Ngo3h42

Fair point. Let me be more precise here.

Both the market for lemons in econ and adverse selection in trading are simple examples of models of adversarial dynamics. I would call these non-central examples of paranoia insofar as you know the variable about which your adversary is hiding information (the quality of the car/the price the stock should be). This makes them too simple to get at the heart of the phenomenon.

I think Habyrka is gesturing at something similar in his paragraph starting "All that said, in reality, navigating a lemon market isn't too hard." And I take him to be gesturing at a more central description of paranoia in his subsequent description: "What do you do in a world in which there are not only sketchy used car salesmen, but also sketchy used car inspectors, and sketchy used car inspector rating agencies, or more generally, competent adversaries who will try to predict whatever method you will use to orient to the world, and aim to subvert it for their own aims?"

This is similar to my criticism of maximin as a model of paranoia: "It's not actually paranoid in a Knightian way, because what if your adversary does something that you didn't even think of?"

Here's a gesture at making this more precise: what makes something a central example of paranoia in my mind is when even your knowledge of how your adversary is being adversarial is also something that has been adversarially optimized. Thus chess is not a central example of paranoia (except insofar as your opponent has been spying on your preparations, say) and even markets for lemons aren't a central example (except insofar as buyers weren't even tracking that dishonesty was a strategy sellers might use—which is notably a dynamic not captured by the economic model).

Reply2
Paranoia: A Beginner's Guide
Richard_Ngo6h*152

Great post. I'm going to riff on it to talk about what it would look like to have an epistemology which formally explains/predicts the stuff in this essay.

Paranoia is a hard thing to model from a Bayesian perspective, because there's no slot to insert an adversary who might fuck you over in ways you can't model (and maybe this explains why people were so confused about the Market for Lemons paper? Not sure). However, I think it's a very natural concept from a Knightian perspective. My current guess is that the correct theory of Knightian uncertainty will be able to formulate the concept of paranoia in a very "natural" way (and also subsume Bayesian uncertainty as a special case where you need zero paranoia because you're working in a closed domain which you have a mechanistic understanding of).

The worst-case assumption in infra-Bayesianism (and the maximin algorithm more generally, e.g. as used in chess engines) is one way of baking in a high level of paranoia. However, two drawbacks of that approach:

  1. There's no good way to "dial down" the level of paranoia. I.e. we don't have an elegant version of maximin to apply to settings where your adversary isn't always choosing the worst possibility for you.
    1. The closest I have is the Hurwicz criterion, which basically sets the ratio of focusing on the worst outcome to focusing on the best outcome. But this is very hacky—the thing you actually care about is all the intermediate outcomes.
  2. It's not actually paranoid in a Knightian way, because what if your adversary does something that you didn't even think of?

Another way of being paranoid is setting large bid-ask spreads. I assume that finance people have a lot to say about how to set bid-ask spreads, but I haven't heard of any very elegant theory.

I think of Sahil's live theory as being a theory of anti-paranoia. It's the approach you take in a world which is fundamentally "friendly" to you. It's still not very pinned-down, though.

I think your three approaches to dealing with an adversarial world all gestures to valuable directions for formal investigation. I think of "blinding yourself" in terms of maintaining a boundary. The more paranoid you are, the stronger a boundary you need to have between yourself and the outside world. Boundaries are Knightian in the sense that they allow you to get stuff done without actually knowing much about what's in the external world. My favorite example here (maybe from Sahil?) is the difference between a bacterium and a cell inside a human body. A bacterium is in a hostile world and therefore needs to maintain strong boundaries. Conversely, a cell inside a body can mostly assume that the chemicals in its environment are there for its benefit, and so can be much more permeable to them. We want to be able to make similar adjustments on an information level (and also on a larger-scale physical level).

I think of "purging the untrustworthy" in terms of creating/maintaining group identity. I expect that this can be modeled in terms of creating commitments to behave certain ways. The "healthy" version is creating a reputation which you don't want to undermine because it's useful for coordination (as discussed e.g. here). The unhealthy version is to traumatize people into changing their identities, by inducing enough suffering to rearrange their internal coalitions (I have a long post coming up on how this explains the higher education system; the short version is here).

I think of "become unpredictable" in terms of asymmetric strategies which still work against entities much more intelligent than you. Ivan has a recent essay about encryption as an asymmetric weapon which is robust to extremely powerful adversaries. I'm reminded of an old Eliezer essay about how, if you're using noise in an algorithm, you're doing it wrong. That's true from a Bayesian perspective, but it's very untrue from a (paranoid) Knightian perspective. Another example of an asymmetric weapon: no matter how "clever" your drone is, it probably can't figure out a way to fly directly towards a sufficiently powerful fan (because the turbulence is too chaotic to exploit).

I think that the good version of "become vindictive" is something to do with virtue ethics. I think of virtue ethics as a strategy for producing good outcomes even when dealing with entities (particularly collectives) that are much more capable than you. This is also true of deontology (see passage in HPMOR where Hermione keeps getting obliviated). I think consequentialism works pretty well in low-adversarialness environments, virtue ethics works in medium-adversarialness environments, and then deontology is most important in the most adversarial environments, because as you go from the former to the latter you are making decisions in ways which have fewer and fewer degrees of freedom to exploit.

Hopefully much more on all of this soon, but thank you for inspiring me to get out at least a rough set of pointers.

Reply1
Wei Dai's Shortform
Richard_Ngo8d92

Can you explain how someone who is virtuous, but missing the crucial consideration of "legible vs. illegible AI safety problems" can still benefit the world? I.e., why would they not be working on some highly legible safety problem that actually is negative EV to work on?

If a person is courageous enough to actually try to solve a problem (like AI safety), and high-integrity enough to avoid distorting their research due to social incentives (like incentives towards getting more citations), and honest enough to avoid self-deception about how to interpret their research, then I expect that they will tend towards doing "illegible" research even if they're not explicitly aware of the legible/illegible distinction. One basic mechanism is that they start pursuing lines of thinking that don't immediately make much sense to other people, and the more cutting-edge research they do the more their ontology will diverge from the mainstream ontology.

Reply
Wei Dai's Shortform
Richard_Ngo10d165

I'm taking the dialogue seriously but not literally. I don't think the actual phrases are anywhere near realistic. But the emotional tenor you capture of people doing safety-related work that they were told was very important, then feeling frustrated by arguments that it might actually be bad, seems pretty real. Mostly I think people in B's position stop dialoguing with people in A's position, though, because it's hard for them to continue while B resents A (especially because A often resents B too).

Some examples that feel like B-A pairs to me include: people interested in "ML safety" vs people interested in agent foundations (especially back around 2018-2022); people who support Anthropic vs people who don't; OpenPhil vs Habryka; and "mainstream" rationalists vs Vassar, Taylor, etc.

Reply1
Wei Dai's Shortform
Richard_Ngo10d4123

This observation should make us notice confusion about whether AI safety recruiting pipelines are actually doing the right type of thing.

In particular, the key problem here is that people are acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)—a motivation which then behaves coercively towards their other motivations. But as per this dialogue, such a system is pretty fragile.

A healthier approach is to prioritize cultivating traits that are robustly good—e.g. virtue, emotional health, and fundamental knowledge. I expect that people with such traits will typically benefit the world even if they're missing crucial high-level considerations like the ones described above.

For example, an "AI capabilities" researcher from a decade ago who cared much more about fundamental knowledge than about citations might well have invented mechanistic interpretability without any thought of safety or alignment. Similarly, an AI capabilities researcher at OpenAI who was sufficiently high-integrity might have whistleblown on the non-disparagement agreements even if they didn't have any "safety-aligned" motivations.

Also, AI safety researchers who have those traits won't have an attitude of "What?! Ok, fine" or "WTF! Alright you win" towards people who convince them that they're failing to achieve their goals, but rather an attitude more like "thanks for helping me". (To be clear, I'm not encouraging people to directly try to adopt a "thanks for helping me" mentality, since that's liable to create suppressed resentment, but it's still a pointer to a kind of mentality that's possible for people with sufficiently little internal conflict.) And in the ideal case, they will notice that there's something broken about their process for choosing what to work on, and rethink that in a more fundamental way (which may well lead them to conclusions similar to mine above).

Reply
Load More
6Richard Ngo's Shortform
Ω
6y
Ω
457
68Book Announcement: The Gentle Romance
6d
0
3821st Century Civilization curriculum
26d
10
169Underdog bias rules everything around me
3mo
53
61On Pessimization
3mo
3
64Applying right-wing frames to AGI (geo)politics
4mo
25
35Well-foundedness as an organizing principle of healthy minds and societies
7mo
7
99Third-wave AI safety needs sociopolitical thinking
8mo
23
97Towards a scale-free theory of intelligent agency
Ω
8mo
Ω
46
92Elite Coordination via the Consensus of Power
8mo
15
253Trojan Sky
8mo
39
Load More