Richard_Ngo

Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com

Sequences

Stories
Meta-rationality
Replacing fear
Shaping safer goals
AGI safety from first principles

Wiki Contributions

Comments

I'm not sure who you've spoken to, but at least among the people who I talk to regularly who I consider to be doing "serious AI policy work" (which admittedly is not everyone who claims to be doing AI policy work), I think nearly all of them have thought about ways in which regulation + regulatory capture could be net negative. At least to the point of being able to name the relatively "easy" ways (e.g., governments being worse at alignment than companies).

I don't disagree with this; when I say "thought very much" I mean e.g. to the point of writing papers about it, or even blog posts, or analyzing it in talks, or basically anything more than cursory brainstorming. Maybe I just haven't seen that stuff, idk.

This is particularly weird because your indexical probability then depends on what kind of bet you're offered. In other words, our marginal utility of money differs from our marginal utility of other things, and which one do you use to set your indexical probability? So this seems like a non-starter to me...

It seems pretty weird to me too, but to steelman: why shouldn't it depend on the type of bet you're offered? Your indexical probabilities can depend on any other type of observation you have when you open your eyes. E.g. maybe you see blue carpets, and you know that world A is 2x more likely to have blue carpets. And hearing someone say "and the bet is denominated in money not time" could maybe update you in an analogous way.

I mostly offer this in the spirit of "here's the only way I can see to reconcile subjective anticipation with UDT at all", not "here's something which makes any sense mechanistically or which I can justify on intuitive grounds".

My own interpretation of how UDT deals with anthropics (and I'm assuming ADT is similar) is "Don't think about indexical probabilities or subjective anticipation. Just think about measures of things you (considered as an algorithm with certain inputs) have influence over."

(Speculative paragraph, quite plausibly this is just nonsense.) Suppose you have copies A and B who are both offered the same bet on whether they're A. One way you could make this decision is to assign measure to A and B, then figure out what the marginal utility of money is for each of A and B, then maximize measure-weighted utility. Another way you could make this decision, though, is just to say "the indexical probability I assign to ending up as each of A and B is proportional to their marginal utility of money" and then maximize your expected money. Intuitively this feels super weird and unjustified, but it does make the "prediction" that we'd find ourselves in a place with high marginal utility of money, as we currently do.

(Of course "money" is not crucial here, you could have the same bet with "time" or any other resource that can be compared across worlds.)

I would say that under UDASSA, it's perhaps not super surprising to be when/where we are, because this seems likely to be a highly simulated time/scenario for a number of reasons (curiosity about ancestors, acausal games, getting philosophical ideas from other civilizations).

Fair point. By "acausal games" do you mean a generalization of acausal trade? (Acausal trade is the main reason I'd expect us to be simulated a lot.)

I don't actually think proponents of anti-x-risk AI regulation have thought very much about the ways in which regulatory capture might in fact be harmful to reducing AI x-risk. At least, I haven't seen much writing about this, nor has it come up in many of the discussions I've had (except insofar as I brought it up).

In general I am against arguments of the form "X is terrible but we have to try it because worlds that don't do it are even more doomed". I'll steal Scott Garrabrant's quote from here:

"If you think everything is doomed, you should try not to mess anything up. If your worldview is right, we probably lose, so our best out is the one where your your worldview is somehow wrong. In that world, we don't want mistaken people to take big unilateral risk-seeking actions.

Until recently, people with P(doom) of, say, 10%, have been natural allies of people with P(doom) of >80%. But the regulation that the latter group thinks is sufficient to avoid xrisk with high confidence has, on my worldview, a significant chance of either causing x-risk from totalitarianism, or else causing x-risk via governments being worse at alignment than companies would have been. How high? Not sure, but plausibly enough to make these two groups no longer natural allies.

A tension that keeps recurring when I think about philosophy is between the "view from nowhere" and the "view from somewhere", i.e. a third-person versus first-person perspective—especially when thinking about anthropics.

One version of the view from nowhere says that there's some "objective" way of assigning measure to universes (or people within those universes, or person-moments). You should expect to end up in different possible situations in proportion to how much measure your instances in those situations have. For example, UDASSA ascribes measure based on the simplicity of the computation that outputs your experience.

One version of the view from somewhere says that the way you assign measure across different instances should depend on your values. You should act as if you expect to end up in different possible future situations in proportion to how much power to implement your values the instances in each of those situations has. I'll call this the ADT approach, because that seems like the core insight of Anthropic Decision Theory. Wei Dai also discusses it here.

In some sense each of these views makes a prediction. UDASSA predicts that we live in a universe with laws of physics that are very simple to specify (even if they're computationally expensive to run), which seems to be true. Meanwhile the ADT approach "predicts" that we find ourselves at an unusually pivotal point in history, which also seems true.

Intuitively I want to say "yeah, but if I keep predicting that I will end up in more and more pivotal places, eventually that will be falsified". But.... on a personal level, this hasn't actually been falsified yet. And more generally, acting on those predictions can still be positive in expectation even if they almost surely end up being falsified. It's a St Petersburg paradox, basically.

Very speculatively, then, maybe a way to reconcile the view from somewhere and the view from nowhere is via something like geometric rationality, which avoids St Petersburg paradoxes. And more generally, it feels like there's some kind of multi-agent perspective which says I shouldn't model all these copies of myself as acting in unison, but rather as optimizing for some compromise between all their different goals (which can differ even if they're identical, because of indexicality). No strong conclusions here but I want to keep playing around with some of these ideas (which were inspired by a call with @zhukeepa).

This was all kinda rambly but I think I can summarize it as "Isn't it weird that ADT tells us that we should act as if we'll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don't have a story for why these things are related but it does seem like a suspicious coincidence."

Suppose we replace "unconditional love" with "unconditional promise". E.g. suppose Alice has promised Bob that she'll make Bob dinner on Christmas no matter what. Now it would be clearly confused to say "Alice promised Bob Christmas dinner unconditionally, so presumably she promised everything else Christmas dinner as well, since it is only conditions that separate Bob from the worms".

What's gone wrong here? Well, the ontology humans use for coordinating with each other assumes the existence of persistent agents, and so when you say you unconditionally promise/love/etc a given agent, then this implicitly assumes that we have a way of deciding which agents are "the same agent". No theory of personal identity is fully philosophically robust, of course, but if you object to that then you need to object not only to "I unconditionally love you" but also any sentence which contains the word "you", since we don't have a complete theory of what that refers to.

A woman who leaves a man because he grew plump and a woman who leaves a man because he committed treason both possessed ‘conditional love’.

This is not necessarily conditional love, this is conditional care or conditional fidelity. You can love someone and still leave them; they don't have to outweigh everything else you care about.

But also: I think "I love you unconditionally" is best interpreted as a report of your current state, rather than a commitment to maintaining that state indefinitely.

The thing that distinguishes the coin case from the wind case is how hard it is to gather additional information, not how much more information could be gathered in principle. In theory you could run all sorts of simulations that would give you informative data about an individual flip of the coin, it's just that it would be really hard to do so/very few people are able to do so. I don't think the entropy of the posterior captures this dynamic.

The variance over time depends on how you gather information in the future, making it less general. For example, I may literally never learn enough about meteorology to update my credence about the winds from 0.5. Nevertheless, there's still an important sense in which this credence is more fragile than my credence about coins, because I could update it.

I guess you could define it as something like "the variance if you investigated it further". But defining what it means to investigate further seems about as complicated as defining the reference class of people you're trading against. Also variance doesn't give you the same directional information—e.g. OP would bet on doom at 2% or bet against it at 16%.

Overall though, as I said above, I don't know a great way to formalize this, and would be very interested in attempts to do so.

Answer by Richard_NgoApr 19, 2024196

I don't think there's a very good precise way to do so, but one useful concept is bid-ask spreads, which are a way of protecting yourself from adverse selection of bets. E.g. consider the following two credences, both of which are 0.5.

  1. My credence that a fair coin will land heads.
  2. My credence that the wind tomorrow in my neighborhood will be blowing more northwards than southwards (I know very little about meteorology and have no recollection of which direction previous winds have mostly blown).

Intuitively, however, the former is very difficult to change, whereas the latter might swing wildly given even a little bit of evidence (e.g. someone saying "I remember in high school my teacher mentioned that winds often blow towards the equator.")

Suppose I have to decide on a policy that I'll accept bets for or against each of these propositions at X:1 odds (i.e. my opponent puts up $X for every $1 I put up). For the first proposition, I might set X to be 1.05, because as long as I have a small edge I'm confident I won't be exploited.

By contrast, if I set X=1.05 for the second proposition, then probably what will happen is that people will only decide to bet against me if they have more information than me (e.g. checking weather forecasts), and so they'll end up winning a lot of money for me. And so I'd actually want X to be something more like 2 or maybe higher, depending on who I expect to be betting against, even though my credence right now is 0.5.

In your case, you might formalize this by talking about your bid-ask spread when trading against people who know about these bottlenecks.

I think the two things that felt most unhealthy were:

  1. The "no forgiveness is ever possible" thing, as you highlight. Almost all talk about ineradicable sin should, IMO, be seen as a powerful psychological attack.
  2. The "our sins" thing feels like an unhealthy form of collective responsibility—you're responsible even if you haven't done anything. Again, very suspect on priors.

Maybe this is more intuitive for rationalists if you imagine a SJW writing a song about how, even millions of years in the future, anyone descended from westerners should still feel guilt about slavery: "Our sins can never be undone. No single death will be forgiven." I think this is the psychological exploit that's screwed up leftism so much over the last decade, and feels very analogous to what's happening in this song.

Load More