Richard_Ngo

Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com

Sequences

Stories
Meta-rationality
Replacing fear
Shaping safer goals
AGI safety from first principles

Wiki Contributions

Comments

The "average" is interpreted with respect to quality. Imagine that your only option is to create low-quality squiggles, or not to do so. In isolation, you'd prefer to produce them than not to produce them. But then you find out that the rest of the multiverse is full of high-quality squiggles. Do you still produce the low-quality squiggles? A total squigglean would; an average squigglean wouldn't.

The idea that maximally-coherent agents look like squiggle-maximizers raises the question: what would it look like for humans to become maximally coherent?

One answer, which Yudkowsky gives here, is that conscious experiences are just a "weird and more abstract and complicated pattern that matter can be squiggled into".

But that seems to be in tension with another claim he makes, that there's no way for one agent's conscious experiences to become "more real" except at the expense of other conscious agents—a claim which, according to him, motivates average utilitarianism across the multiverse.

Clearly a squiggle-maximizer would not be an average squigglean. So what's the disanalogy here? It seems like @Eliezer Yudkowsky is basically using SSA, but comparing between possible multiverses—i.e. when facing the choice between creating agent A or not, you look at the set of As in the multiverse where you decided yes, and compare it to the set of As in the multiverse where you decided no, and (if you're deciding for the good of A) you pick whichever one gives A a better time on average.

Yudkowsky has written before (can't find the link) that he takes this approach because alternatives would entail giving up on predictions about his future experiences—e.g. constantly predicting he's a Boltzmann brain and will dissolve in the next second. But this argument by Wei Dai shows that agents which reason in this way can be money-pumped by creating arbitrarily short-lived copies of them. Based on this I claim that Yudkowsky's preferences are incoherent, and that the only coherent thing to do here is to "expect to be" a given copy in proportion to the resources it will have available, as anthropic decision theory claims. (Incidentally, this also explains why we're at the hinge of history.)

But this is just an answer, it doesn't dissolve the problem. What could? Some wild guesses:

  1. You are allowed to have preferences about the external world, and you are allowed to have preferences about your "thread of experience"—you're just not allowed to have both. The incoherence comes from trying to combine the two; the coherent thing to do would be to put them into different agents, who will then end up in very different parts of the multiverse.
  2. Another way of framing this: you are allowed to be a decision-maker, and you are allowed to be a repository of welfare, but you're not allowed to be both (on pain of incoherence/being dutch-booked).
  3. Something totally different: the problem here is that we don't have intuitive experience of being agents which can copy themselves, shut down copies, re-merge, etc. If we did, then maybe SSA would seem as silly as expecting to end up in a different universe whenever we went to sleep.
  4. Actually, maybe the operative thing we lack experience with is not just splitting into different subagents, but rather merging together afterwards. What does it feel like to have been thousands of different parallel agents, and now be a single agent with their unified experiences? What sort of identity would one construct in that situation? Maybe this is an important part of dissolving the problem.

But if you think TAI is coming within 10 years (for example, if you think that the current half-life on worlds surviving is 10 years; if you think 10 years is the amount of time in which half of worlds are doomed)

Note that these are very different claims, both because the half-life for a given value is below its mean, and because TAI doesn't imply doom. Even if you do have very high P(doom), it seems odd to just assume everyone else does too.

then depending on your distribution-over-time you should absolutely not wait 5 years before doing research, because TAI could happen in 9 years but it could also happen in 1 year

So? Your research doesn't have to be useful in every possible world. If a PhD increases the quality of your research by, say, 3x (which is plausible, since research is heavy-tailed) then it may well be better to do that research for half the time.

(In general I don't think x-risk-motivated people should do PhDs that don't directly contribute to alignment, to be clear; I just think this isn't a good argument for that conclusion.)

"Well, since it's too late there," said the Scientist, "would you maybe agree with me that 'eternal returns' is a prediction derived by looking at observations in a simple way, and then doing some pretty simple reasoning on it; and that's, like, cool?  Even if that coolness is not the single overwhelming decisive factor in what to believe?"

"Depends exactly what you mean by 'cool'," said the Epistemologist.

"Okay, let me give it a shot," said the Scientist. "Suppose you model me as having a bunch of subagents who make trades on some kind of internal prediction market. The whole time I've been watching Ponzi Pyramid Incorporated, I've had a very simple and dumb internal trader who has been making a bunch of money betting that they will keep going up by 20%. Of course, my mind contains a whole range of other traders too, so this one isn't able to swing the market by itself, but what I mean by 'cool' is that this trader does have a bunch of money now! (More than others do, because in my internal prediction markets, simpler traders start off with more money.)"

"The problem," said the Epistemologist, "is that you're in an adversarial context, where the observations you're seeing have been designed to make that particular simple trader rich. In that context, you shouldn't be giving those simple traders so much money to start off with; they'll just continue being exploited until you learn better."

"But is that the right place to intervene? After all, my internal prediction market is itself an adversarial process. And so the simple internal trader who just predicts that things will continue going up the same amount every year will be exploited by other internal traders as soon as it dares to venture a bet on, say, the returns of the previous company that our friend the Spokesperson worked at. Indeed, those savvier traders might even push me to go look up that data (using, perhaps, some kind of internal action auction), in order to more effectively take the simple trader's money."

"You claim," said the Epistemologist, "to have these more sophisticated internal traders. Yet you started this conversation by defending the coolness, aka wealth, of the trader corresponding to the Spokesperson's predictions. So it seems like these sophisticated internal traders are not doing their work so well after all."

"They haven't taken its money yet," said the Scientist, "But they will before it gets a chance to invest any of my money. Nevertheless, your point is a good one; it's not very cool to only have money temporarily. Hmmm, let me muse on this."

The Scientist thinks for a few minutes, then speaks again.

"I'll try another attempt to describe what I mean by 'cool'. Often-times, clever arguers suggest new traders to me, and point out that those traders would have made a lot of money if they'd been trading earlier. Now, if I were an ideal Garrabrant inductor I would ignore these arguments, and only pay attention to these new traders' future trades. But I have not world enough or time for this; so I've decided to subsidize new traders based on how they would have done if they'd been trading earlier. Of course, though, this leaves me vulnerable to clever arguers inventing overfitted traders. So the subsidy has to be proportional to how likely it is that the clever arguer could have picked out this specific trader in advance. And for all Spokesperson's flaws, I do think that 5 years ago he was probably saying something that sounded reasonably similar to '20% returns indefinitely!' That is the sense in which his claim is cool."

"Hmm," said the Epistemologist. "An interesting suggestion, but I note that you've departed from the language of traders in doing so. I feel suspicious that you're smuggling something in, in a way which I can't immediately notice."

"Right, which would be not very cool. Alas, I feel uncertain about how to put my observation into the language of traders. But... well, I've already said that simple traders start off with more money. So perhaps it's just the same thing as before, except that when evaluating new traders on old data I put extra weight on simplicity when deciding how much money they start with—because now it also helps prevent clever arguers from fooling me (and potentially themselves) with overfitted post-hoc hypotheses."

("Parenthetically," added the Scientist, "there are plenty of other signals of overfitting I take into account when deciding how much to subsidize new traders—like where I heard about them, and whether they match my biases and society's biases, and so on. Indeed, there are enough such signals that perhaps it's best to think of this as a process of many traders bidding on the question of how easy/hard it would have been for the clever arguer to have picked out this specific trader in advance. But this is getting into the weeds—the key point is that simplicity needs to be extra-strongly-prioritized when evaluating new traders on past data.")

I'm sorry you regret reading it. A content warning seems like a good idea, I've added one now.

I feel kinda frustrated whenever "shard theory" comes up in a conversation, because it's not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

This is a particular pity because I think there's a version of the "shard" framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in "subagents" which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly "goal-directed" as you go up the hierarchy. This is an old idea, FWIW; e.g. it's how Minsky frames intelligence in Society of Mind. And it's also somewhat consistent with the claim made in the original shard theory post, that "shards are just collections of subshards".

The problem is the "just". The post also says "shards are not full subagents", and that "we currently estimate that most shards are 'optimizers' to the extent that a bacterium or a thermostat is an optimizer." But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from "heuristic" to "agent", and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral "shards" (like caring about other people's welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the "rational EU maximizer" picture that they're ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?

(I make a similar point in the appendix of my value systematization post.)

Reply65543321

Is ECL the same thing as acausal trade?

Typically, no. “Acausal trade” usually refers to a different mechanism: “I do this thing for you if you do this other thing for me.” Discussions of acausal trade often involve the agents attempting to simulate each other. In contrast, ECL flows through direct correlation: “If I do this, I learn that you are more likely to also do this.” For more, see Christiano (2022)’s discussion of correlation versus reciprocity and Oesterheld, 2017, section 6.1.

I'm skeptical about the extent to which these are actually different things. Oesterheld says "superrationality may be seen as a special case of acausal trade in which the agents’ knowledge implies the correlation directly, thus avoiding the need for explicit mutual modeling and the complications associated with it". So at the very least, we can think of one as a subset of the other (though I think I'd actually classify it the other way round, with acausal trade being a special case of superrationality).

But it's not just that. Consider an ECL model that concludes: "my decision is correlated with X's decision, therefore I should cooperate". But this conclusion also requires complicated recursive reasoning—specifically, reasons for thinking that the correlation holds even given that you're taking the correlation into account when making your decision.

(E.g. suppose that you know that you were similar to X, except that you are doing ECL and X isn't. But then ECL might break the previous correlation between you and X. So actually the ECL process needs to reason "the outcome of the decision process I'm currently doing is correlated with the outcome of the decision process that they're doing", and I think realistically finding a fixed point probably wouldn't look that different from standard descriptions of acausal trade.)

This may be another example of the phenomenon Paul describes in his post on why EDT > CDT: although EDT is technically more correct, in practice you need to do something like CDT to reason robustly. (In this case ECL is EDT and acausal trade is more CDTish.)

FWIW for this sort of research I support a strong prior in favor of publishing.

The thing I'm picturing here is a futures contract where charizard-shirt-guy is obligated to deliver 3 trillion paperclips in exchange for one soul. And, assuming a reasonable discount rate, this is a better deal than only receiving a handful of paperclips now in exchange for the same soul. (I agree that you wouldn't want to invest in a current-market-price paperclip futures contract.)

Damn, MMDoom is a good one. New lore: it won the 2055 technique award.

Load More