All of Anthony DiGiovanni's Comments + Replies

I think it's pretty unclear that MSR is action-guiding for real agents trying to follow functional decision theory, because of Sylvester Kollin's argument in this post.

Tl;dr: FDT says, "Supposing I follow FDT, it is just implied by logic that any other instance of FDT will make the same decision as me in a given decision problem." But the idealized definition of "FDT" is computationally intractable for real agents. Real agents would need to find approximations for calculating expected utilities, and choose some way of mapping their sense data to the abstra... (read more)

Sorry, to be clear, I'm familiar with the topics you mention. My confusion is that ROSE bargaining per se seems to me pretty orthogonal to decision theory.

I think the ROSE post(s) are an answer to questions like, "If you want to establish a norm for an impartial bargaining solution such that agents following that norm don't have perverse incentives, what should that norm be?", or "If you're going to bargain with someone but you didn't have an opportunity for prior discussion of a norm, what might be a particularly salient allocation [because it has some nice properties], meaning that you're likely to zero-shot coordinate on that allocation?"

Yeah, mostly agreed. My main subquestion (that led me to write the review, besides this post being referenced in Leake's work) was/sort-of-still-is "Where do the ratios in value-handshakes come from?". The default (at least in the tag description quote from SSC) is uncertainty in war-winning, but that seems neither fully-principled nor nice-things-giving (small power differences can still lead to huge win-% chances, and superintelligences would presumably be interested in increasing accuracy). And I thought maybe ROSE bargaining could be related to that. The relation in my mind was less ROSE --> DT, and more ROSE --?--> value-handshakes --> value-changes --?--> DT.

Can you say more about what you think this post has to do with decision theory? I don't see the connection. (I can imagine possible connections, but don't think they're relevant.)

So there's a sorta-crux about how much DT alignment researchers would have to encode into the-AI-we-want-to-be-aligned, before that AI is turned on. Right now I'm leaning towards "an AI that implements CEV well, would either turn-out-to-have or quickly-develop good DT on its own", but I can see it going either way. (This was especially true yesterday when I wrote this review.) And I was trying to think through some of the "DT relevance to alignment" question, and I looked at relevant posts by [Tamsin Leake]( (whose alignment research/thoughts I generally agree with). And that led me to thinking more about value-handshakes, son-of-CDT (see Arbital), and systems like ROSE bargaining. Any/all of which, under certain assumptions, could determine (or hint at) answers to the "DT relevance" thing.

I agree with the point that we shouldn’t model the AI situation as a zero-sum game. And the kinds of conditional commitments you write about could help with cooperation. But I don’t buy the claim that "implementing this protocol (including slowing down AI capabilities) is what maximizes their utility."

Here's a pedantic toy model of the situation, so that we're on the same page: The value of the whole lightcone going towards an agent’s values has utility 1 by that agent’s lights (and 0 by the other’s), and P(alignment success by someone) = 0 if both speed u... (read more)

It seems that what I was missing here was: mrcSSA disputes my premise that the evidence in fact is "*I* am in a white room, [created by God in the manner described in the problem setup], and have a red jacket"!

Rather, mrcSSA takes the evidence to be: "Someone is in a white room, [created by God in the manner described in the problem setup], and has a red jacket." Which is of course certain to be the case given either heads or tails.

(h/t Jesse Clifton for helping me see this)

Is God's coin toss with equal numbers a counterexample to mrcSSA?

I feel confused as to whether minimal-reference-class SSA (mrcSSA) actually fails God's coin toss with equal numbers (where "failing" by my lights means "not updating from 50/50"):

  • Let H = "heads world", W_{me} = "I am in a white room, [created by God in the manner described in the problem setup]", R_{me} = "I have a red jacket."
  • We want to know P(H | W_{me}, R_{me}).
  • First, P(R_{me} | W_{me}, H) and P(R_{me} | W_{me}, ~H) seem uncontroversial: Once I've already conditioned on my own existence i
... (read more)
1Anthony DiGiovanni3mo
It seems that what I was missing here was: mrcSSA disputes my premise that the evidence in fact is "*I* am in a white room, [created by God in the manner described in the problem setup], and have a red jacket"! Rather, mrcSSA takes the evidence to be: "Someone is in a white room, [created by God in the manner described in the problem setup], and has a red jacket." Which is of course certain to be the case given either heads or tails. (h/t Jesse Clifton for helping me see this)

I enjoyed this post and think it should help reduce confusion in many future discussions, thanks!

Some comments on your remarks about anthropics:

Different anthropic theories partially rely on metaphysical intuitions/stories about how centered worlds or observer moments are 'sampled', and have counterintuitive implications (e.g., the Doomsday argument for SSA and the Presumptuous philosopher for SIA).

I'm not sure why this is an indictment of "anthropic reasoning" per se, as if that's escapable. It seems like all anthropic theories are trying to answer a ques... (read more)

in earlier sections you argue that CDT agents might not adopt LDT-recommended policies and so will have problems with bargaining

That wasn’t my claim. I was claiming that even if you're an "LDT" agent, there's no particular reason to think all your bargaining counterparts will pick the Fair Policy given you do. This is because:

  1. Your bargaining counterparts won’t necessarily consult LDT.
  2. Even if they do, it’s super unrealistic to think of the decision-making of agents in high-stakes bargaining problems as entirely reducible to “do what [decision theory X]
... (read more)


It's true that you usually have some additional causal levers, but none of them are the exact same as be the kind of person who does X.

Not sure I understand. It seems like "being the kind of person who does X" is a habit you cultivate over time, which causally influences how people react to you. Seems pretty analogous to the job candidate case.

if CDT agents often modify themselves to become an LDT/FDT agent then it would broadly seem accurate to say that CDT is getting outcompeted

See my replies to interstice's comment—I don't think "modifying themse... (read more)

Attempting to cultivate a habit is not the same as directly being that kind of person. The distinction may seem slight, but it’s worth keeping track of.

It seems plausible that there is no such thing as "correct" metaphilosophy, and humans are just making up random stuff based on our priors and environment and that's it and there is no "right way" to do philosophy, similar to how there are no "right preferences"

If this is true, doesn't this give us more reason to think metaphilosophy work is counterfactually important, i.e., can't just be delegated to AIs? Maybe this isn't what Wei Dai is trying to do, but it seems like "figure out which approaches to things (other than preferences) that don't have 'right ... (read more)

2Connor Leahy6mo
I think this is not an unreasonable position, yes. I expect the best way to achieve this would be to make global coordination and epistemology better/more coherent...which is bottlenecked by us running out of time, hence why I think the pragmatic strategic choice is to try to buy us more time. One of the ways I can see a "slow takeoff/alignment by default" world still going bad is that in the run-up to takeoff, pseudo-AGIs are used to hypercharge memetic warfare/mutation load to a degree basically every living human is just functionally insane, and then even an aligned AGI can't (and wouldn't want to) "undo" that.

You said "Bob commits to LDT ahead of time"

In the context of that quote, I was saying why I don't buy the claim that following LDT gives you advantages over committing to, in future problems, do stuff that's good for you to commit to do ex ante even if it would be bad for you ex post had you not been committed.

What is selected-for is being the sort of agent who, when others observe you, they update towards doing stuff that's good for you. This is distinct from being the sort of agent who does stuff that would have helped you if you had been able to shape o... (read more)

Yes, but isn't this essentially the same as LDT? It seems to me that different sections of your essay are inconsistent with each other, in that in earlier sections you argue that CDT agents might not adopt LDT-recommended policies and so will have problems with bargaining, but in the last section, you say that CDT agents are not at a competitive disadvantage because they can simply commit to act like LDT agents all the time. But if they so commit, the problems with bargaining won't come up. I think it would make more sense to argue that empirically, situations selecting for LDT simply won't arise(but then will arise and be important later). I don't quite understand what you mean here - are you saying that CDT agents will only cooperate if they think it will be causally beneficial, by causing them to have a good reputation with other agents? But we were discussing a case(counterfactual mugging) where they would want to pre-commit to act in ways that would be non-causally beneficial. So I think there would be selection to act non-causally in such cases(unless, again, you just think such situations will never arise, but that's a different argument) I don't see why you have to assume that your counterpart is strongly logically-linked with you, there are other reasons that you might not want to demand too much. Maybe you know their source code and can simulate that they will not accept a too-high demand. Or perhaps you think, based on empirical evidence or a priori reasoning that most agents you might encounter will only accept a roughly fair allocation.

Yeah, this is a complicated question. I think some things can indeed safely be deferred, but less than you’re suggesting. My motivations for researching these problems:

  1. Commitment races problems seem surprisingly subtle, and off-distribution for general intelligences who haven’t reflected about them. I argued in the post that competence at single-agent problems or collective action problems does not imply competence at solving commitment races. If early AGIs might get into commitment races, it seems complacent to expect that they’ll definitely be better at
... (read more)

The key point is that "acting like an LDT agent" in contexts where your commitment causally influences others' predictions of your behavior, does not imply you'll "act like an LDT agent" in contexts where that doesn't hold. (And I would dispute that we should label making a commitment to a mutually beneficial deal as "acting like an LDT agent," anyway.) In principle, maybe the simplest generalization of the former is LDT. But if doing LDT things in the latter contexts is materially costly for you (e.g. paying in a truly one-shot Counterfactual Mugging), se... (read more)

You said "Bob commits to LDT ahead of time" in the paragraph I quoted, I was referring to that. I think a CDT agent would pre-commit to paying in a one-off Counterfactual Mugging since they have a 50% chance of gaining $10000 and a 50% chance of losing $100. Or if they don't know that a Counterfactual Mugging is going to happen, they'd have an incentive to broadly pre-commit to pay out in similar situations(essentially, acting like an LDT agent). Or if they won't do either of those things, they will get less future expected resources than an LDT agent. Same as above, I think it's either the case that CDT agents would tend to make pre-commitments to act LDT-like in such situations, or will lose expected resources compared to LDT agents. You can't have your CDT cake and eat it too!

Exploitation means the exploiter benefits. If you are a rock, you can't be exploited. If you are an agent who never gives in to threats, you can't be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won't benefit them, then you might get nasty things done to you. You wouldn't be exploited, but you'd still be very unhappy.

Cool, I think we basically agree on this point then, sorry for misunderstanding. I just wanted to emphasize th... (read more)

It also has a deontological or almost-deontological constraint that prevents it from getting exploited.

I’m not convinced this is robustly possible. The constraint would prevent this agent from getting exploited conditional on the potential exploiters best-responding (being "consequentialists"). But it seems to me the whole heart of the commitment races problem is that the potential exploiters won’t necessarily do this, indeed depending on their priors they might have strong incentives not to. (And they might not update those priors for fear of losing barga... (read more)

2Daniel Kokotajlo7mo
Exploitation means the exploiter benefits. If you are a rock, you can't be exploited. If you are an agent who never gives in to threats, you can't be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won't benefit them, then you might get nasty things done to you. You wouldn't be exploited, but you'd still be very unhappy. So no, I don't think the constraint I proposed would only work if the opponent agents were consequentialists. Adopting the strategy does not assume one's bargaining counterparts will be consequentialists. However, if you are a consequentialist, then you'll only adopt the strategy if you think that sufficiently few of the agents you will later encounter are of the aforementioned nasty sort--which, by the logic of commitment races, is not guaranteed; it's plausible that at least some of the agents you'll encounter are 'already committed' to being nasty to you unless you surrender to them, such that you'll face much nastiness if you make yourself inexploitable. This is my version of what you said above, I think. And yeah to put it in my ontology, some exploitation-resistant strategies might be wasteful/clumsy/etc. and depending on how nasty the other agents are, maybe most or even all exploitation-resistant strategies are more trouble than they are worth (from a consequentialist perspective; note that nonconsequentialists might have additional reasons to go for exploitation-resistant strategies. Also note that even consequentialists might assign intrinsic value to justice, fairness, and similar concepts.) But like I said, I'm overall optimistic -- not enough to say "there's no problem here," it's enough of a problem that it's one of my top priorities (and maybe my top priority?) but I still do expect the sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-di

The second mover ALREADY had the option not to commit - they could just swerve or crash, according to their decision theory.

The premise here is that the second-mover decided to commit soon after the first-mover did, because the proof of the first-mover's initial commitment didn't reach the second-mover quickly enough. They could have not committed initially, but they decided to do so because they had a chance of being first.

I'm not sure exactly what you mean by "according to their decision theory" (as in, what this adds here).

if it doesn't change the seque

... (read more)
That's a very critical deviation from the standard problem statement, which should be made very clear.  Also, re-reading the timeline, it appears to introduce side-payments (at 0:37 in the timeline), which is also a MAJOR deviation from the standard problem. These two things (speed of information and ability to negotiate outside of the given payoff matrix) should be separated - both are fairly easy to model, and there will be much simpler solutions to integrate each of them into the decisions, which will be better than the combination of the two limited to a revocation window.

better off having a "real" commitment than a revocable commitment that Bob can talk her out of

I'm confused what you mean here. In principle Alice can revoke her commitment before the freeze time in this protocol, but Bob can't force her to do so. And if it's common knowledge that Alice's freeze time comes before Bob's, then: Since Alice knows that there will be a window after her freeze time where Bob knows Alice's commitment is frozen, and Bob has a chance to revert, then there would be no reason (barring some other commitment mechanism, including Bob bei... (read more)

The second mover ALREADY had the option not to commit - they could just swerve or crash, according to their decision theory.  The revocation period doesn't actually change payouts or decision mechanisms, and if it doesn't change the sequence of commitment, I don't see how it makes any difference at all.  If it DOES change the sequence of commitment, then the first-mover would prefer not to lose their advantage, and will just use a non-revocable commitment. It seems like this is introducing some sort of information or negotiation into the decisions, but I don't see how. In MANY such games, allowing side-payments or outside-of-game considerations can find better outcomes.  This doesn't do that, as far as I can see.

I'd recommend checking out this post critiquing this view, if you haven't read it already. Summary of the counterpoints:

  • (Intent) alignment doesn't seem sufficient to ensure an AI makes safe decisions about subtle bargaining problems in a situation of high competitive pressure with other AIs. I don't expect the kinds of capabilities progress that is incentivized by default to suffice for us to be able to defer these decisions to the AI, especially given path-dependence on feedback from humans who'd be pretty naïve about this stuff. (C.f. this post—you need
... (read more)

Claims about counterfactual value of interventions given AI assistance should be consistent

A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).

I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose ... (read more)

primarily because models will understand the base goal first before having world modeling

Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the "what does the overseer want" after that, because that's how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.

Admittedly, I got that from Deceptive alignment is <1% likely post. Even if you don't believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF. Given the fact that it has a low alignment tax, I suspect that there's a 50-70% chance that this plan, or a successor will be adopted for alignment. Here's the post:

"I am devoting my life to solving the most important problems in the world and alleviating as much suffering as possible" fits right into the script. That's exactly the kind of thing you are supposed to be thinking. If you frame your life like that, you will fit in and everyone will understand and respect what is your basic deal.

Hm, this is a pretty surprising claim to me. It's possible I haven't actually grown up in a "western elite culture" (in the U.S., it might be a distinctly coastal thing, so the cliché goes? IDK). Though, I presume having gone to so... (read more)

A model that just predicts "what the 'correct' choice is" doesn't seem likely to actually do all the stuff that's instrumental to preventing itself from getting turned off, given the capabilities to do so.

But I'm also just generally confused whether the threat model here is, "A simulated 'agent' made by some prompt does all the stuff that's sufficient to disempower humanity in-context, including sophisticated stuff like writing to files that are read by future rollouts that generate the same agent in a different context window," or "The RLHF-trained model has goals that it pursues regardless of the prompt," or something else.

confused claims that treat (base) GPT3 and other generative models as traditional rational agents

I'm pretty surprised to hear that anyone made such claims in the first place. Do you have examples of this?

I think this mainly comes up in person with people who've just read the intro AI Safety materials, but one example on LW is What exactly is GPT-3's base objective?.

I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understandin... (read more)

2Ramana Kumar1y
I think you're right - thanks for this! It makes sense now that I recognise the quote was in a section titled "Alignment research can only be done by AI systems that are too dangerous to run".

I agree with your guesses.

I am not sure that "controlling for game-theoretic instrumental reasons" is actually a move that is well defined/makes sense.

I don't have a crisp definition of this, but I just mean that, e.g., we compare the following two worlds: (1) 99.99% of agents are non-sentient paperclippers, and each agent has equal (bargaining) power. (2) 99.99% of agents are non-sentient paperclippers, and the paperclippers are all confined to some box. According to plenty of intuitive-to-me value systems, you only (maybe) have reason to increase papercl... (read more)

Ah right, thanks! (My background is more stats than comp sci, so I'm used to "indicator" instead of "predicate.")

Let's pretend that you are a utilitarian. You want to satisfy everyone's goals

This isn't a criticism of the substance of your argument, but I've come across a view like this one frequently on LW so I want to address it: This seems like a pretty nonstandard definition of "utilitarian," or at least, it's only true of some kinds of preference utilitarianism.

I think utilitarianism usually refers to a view where what you ought to do is maximize a utility function that (somehow) aggregates a metric of welfare across individuals, not their goal-satisfaction. Kick... (read more)

6Scott Garrabrant1y
Agreed. I should have had disclaimer that I was talking about preference utilitarianism.  I am not sure what is true about what most people think. My guess is that most philosophers who identify with utilitarianism mean welfare.  I would guess that most readers of LessWrong would not identify with utilitarianism, but would say they identify more with preference utilitarianism than welfare utilitarianism. My guess is that a larger (relative to LW) proportion of EAs identify with utilitarianism, and also they identify with the welfare version (relative to preference version) more than LW, but I have a lot of uncertainty about how much. (There is probably some survey data that could answer this question. I haven't checked.) Also, I am not sure that "controlling for game-theoretic instrumental reasons" is actually a move that is well defined/makes sense.

I think in the social choice literature, people almost always mean preference utilitarianism when they say "utilitarianism", whereas in the philosophical/ethics literature people are more likely to mean hedonic utilitarianism. I think the reason for this is that in the social choice and somewhat adjacent game (and decision) theory literature, utility functions have a fairly solid foundation as a representation of preferences of rational agents. (For example, Harsanyi's "[preference] utilitarian theorem" paper and Nash's paper on the Nash bargaining solutio... (read more)

Basic questions: If the type of Adv(M) is a pseudo-input, as suggested by the above, then what does Adv(M)(x) even mean? What is the event whose probability is being computed? Does the unacceptability checker C also take real inputs as the second argument, not just pseudo-inputs—in which case I should interpret a pseudo-input as a function that can be applied to real inputs, and Adv(M)(x) is the statement "A real input x is in the pseudo-input (a set) given by... (read more)

The idea is that we're thinking of pseudo-inputs as “predicates that constrain X” here, so, for α∈Xpseudo, we have α:X→B.

This is a risk worth considering, yes. It’s possible in principle to avoid this problem by “committing” (to the extent that humans can do this) to both (1) train the agent to make the desired tradeoffs between the surrogate goal and original goal, and (2) not train the agent to use a more hawkish bargaining policy than it would’ve had without surrogate goal training. (And to the extent that humans can’t make this commitment, i.e., we make honest mistakes in (2), the other agent doesn’t have an incentive to punish those mistakes.)

If the developers... (read more)

Thanks. I just wasn't sure if I was missing something. :)

Why is this post tagged "transparency / interpretability"? I don't see the connection.

Tagging is crowdsourced. If something seems wrong, just vote down the relevance, and if it's 0 or lower, the tag gets removed.

I think this is an important question, and this case for optimism can be a bit overstated when one glosses over the practical challenges to verification. There's plenty of work on open-source game theory out there, but to my knowledge, none of these papers really discuss how one agent might gain sufficient evidence that it has been handed the other agent's actual code.

We wrote this part under the assumption that AGIs might be able to just figure out these practical challenges in ways we can't anticipate, which I think is plausible. But of course, an AGI mi... (read more)

Hmm, if A is simulating B with B's source code, couldn’t the simulated B find out it's being simulated and lie about its decisions or hide what its actual preferences? Or would its actual preferences be derivable from its weights or code directly without simulation?
An AGI could give read and copy access to the code being run and the weights directly on the devices from which the AGI is communicating. That could still be a modified copy of the original and more powerful (or with many unmodified copies) AGI, though. So, the other side may need to track all of the copies, maybe even offline ones that would go online on some trigger or at some date. Also, giving read and copy access could be dangerous to the AGI if it doesn't have copies elsewhere.

Thanks for this! I agree that inter-agent safety problems are highly neglected, and that it's not clear that intent alignment or the kinds of capability robustness incentivized by default will solve (or are the best ways to solve) these problems. I'd recommend looking into Cooperative AI, and the "multi/multi" axis of ARCHES.

This sequence discusses similar concerns—we operationalize what you call inter-agent alignment problems as either:

  1. Subsets of capability robustness, because if an AGI wants to achieve X in some multi-agent environment, then accounting f
... (read more)

(Speaking for myself as a CLR researcher, not for CLR as a whole)

I don't think it's accurate to say CLR researchers think increasing transparency is good for cooperation. There are some tradeoffs here, such that I and other researchers are currently uncertain whether marginal increases in transparency are net good for AI cooperation. Though, it is true that more transparency opens up efficient equilibria that wouldn't have been possible without open-source game theory. (ETA: some relevant research by people (previously) at CLR here, here, and here.)

I like that this post clearly argues for some reasons why we might expect deception (and similar dynamics) to not just be possible in the sense of getting equal training rewards, but to actually provide higher rewards than the honest alternatives. This positively updates my probability of those scenarios.

I notice that I strongly disagree with a majority of them (#1, #2, #4, #8, #10, #11, #13, #14, #15, #17, #18, #21)

Re: #2, what do you consider to be The Bad other than suffering?

On my picture, I think a key variable is the length of time between when-we-understand-the-basic-shape-of-things-that-will-get-to-AGI and when-it-reaches-strong-superintelligence.

I don't understand why you think the sort of capabilities research done by alignment-conscious people contributes to lengthening this time. In particular, what reason do you have to think they're not advancing the second time point as much as the first? Could you spell that out more explicitly?

They can read each other's source code, and thus trust much more deeply!

Being able to read source code doesn't automatically increase trust—you also have to be able to verify that the code being shared with you actually governs the AGI's behavior, despite that AGI's incentives and abilities to fool you.

(Conditional on the AGIs having strongly aligned goals with each other, sure, this degree of transparency would help them with pure coordination problems.)

It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive

Maybe I missed something here, but how is this supposed to help with deception? I thought the whole reason deceptive alignment is really hard to solve is that you can't tell if the AI's being deceptive via its behavior.

That all sounds fair. I've seen rationalists claim before that it's better for "interesting" things (in the literal sense) to exist than not, even if nothing sentient is interested by them, so that's why I assumed you meant the same.

Why does the person asking this question care about whether "interesting"-to-humans things happen, in a future where no humans exist to find them interesting?

1David Udell2y
Because beings like us (in some relevant respects) that outlive us might carry the torch of our values into that future! Don't read too much into the word "interesting" here: I just meant "valuable by our lights in some respect, even if only slightly. Sure, it sucks if humanity doesn't live to be happy in the distant future, but if some other AGI civilization is happy in that future, I prefer that to an empty paperclip-maximized universe without any happiness.

Perhaps the crux here is whether we should expect all superintelligent agents to converge on the same decision procedure—and the agent themselves will expect this, such that they'll coordinate by default? As sympathetic as I am to realism about rationality, I put a pretty nontrivial credence on the possibility that this convergence just won't occur, and persistent disagreement (among well-informed people) about the fundamentals of what it means to "win" in decision theory thought experiments is evidence of this.

From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma.  I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself.

I don't see how this makes the point you seem to want it to make. There's still an equilibrium selection problem for a program game of one-shot PD—some other agent might have the program ... (read more)

Sort of! This paper (of which I’m a coauthor) discusses this “unraveling” argument, and the technical conditions under which it does and doesn’t go through. Briefly:

  • It’s not clear how easy it is to demonstrate military strength in the context of an advanced AI civilization, in a way that can be verified / can’t be bluffed. If I see that you’ve demonstrated high strength in some small war game, but my prior on you being that strong is sufficiently low, I’ll probably think you’re bluffing and wouldn’t be that strong in the real large-scale conflict.
  • Supposing
... (read more)
I was actually thinking of the cost of physical demonstrations, and/or the cost of convincing others that simulations are accurate[1], not so much direct simulation costs. That being said, this is still a valid point, just not one that I should be credited for. 1. ^ Imagine trying to convince someone of atomic weapons purely with simulations, without anyone ever having detonated one[2], for instance. It may be doable; it'd be nowhere near cheap. Now imagine trying to do so without allowing the other side to figure out how to make atomic bombs in the process... 2. ^ To be clear: as in alt-history-style 'Trinity / etc never happened'. Not just as in someone today convincing another that their particular atomic weapon works.

The amount of EV at stake in my (and others') experiences over the next few years/decades is just too small compared to the EV at stake in the long-term future.

AI alignment isn't the only option to improve the EV of the long-term future, though.

I think “the very repugnant conclusion is actually fine” does pretty well against its alternatives. It’s totally possible that our intuitive aversion to it comes from just not being able to wrap our brains around some aspect of (a) how huge the numbers of “barely worth living” lives would have to be, in order to make the very repugnant conclusion work; (b) something that is just confusing about the idea of “making it possible for additional people to exist.”

While this doesn't sound crazy to me, I'm skeptical that my anti-VRC intuitions can be explained by ... (read more)

I feel confused as to how step (3) is supposed to work, especially how "having the training be done by the model being trained given access to tools from (2)" is a route to this.

At some step in the amplification process, we'll have systems that are capable of deception, unlike the base case. So it seems that if we let the model train its successor using the myopia-verification tools, we need some guarantee that the successor is non-deceptive in the first place. (Otherwise the myopia-verification tools aren't guaranteed to work, as you note in the bullet po... (read more)

Thanks, this makes it pretty clear to me how alignment could be fundamentally hard besides deception. (The problem seems to hold even if your values are actually pretty simple; e.g. if you're a pure hedonistic utilitarian and you've magically solved deception, you can still fail at outer alignment by your AI optimizing for making it look like there's more happiness and less suffering.)

Some (perhaps basic) notes to check that I've understood this properly:

  • The Bayes net running example per se isn't really necessary for ELK to be a problem.
    • The basic problem i
... (read more)

Not a direct answer to your question, but I want to flag that using "AI alignment" to mean "AI [x-risk] safety" seems like a mistake. Alignment means getting the AI to do what its principal/designer wants, which is not identical to averting AI x-risks (much less s-risks). There are plausible arguments that this is sufficient to avert such risks, but it's an open question, so I think equating the two is confusing.

Thanks for flagging this.  1. I presumed that "AI alignment" was being used as a shorthand for x-risks from AI but I didn't think of that. I'm not aware either that anyone from the rationality community I've seen express this kind of statement really meant for AI alignment to mean all x-risks from AI. That's my mistake. I'll presume they're referring to only the control problem and edit my post to clarify that.   2. As I understand it, s-risks are a sub-class of x-risks, as an existential risk is not only an extinction risk but any risk of the future trajectory of Earth-originating intelligence being permanently and irreversibly altered for the worse. 
4Steven Byrnes2y
I agree, one can conceive of AGI safety without alignment (e.g. if boxing worked), and one can conceive of alignment without safety (e.g. if the AI is "trying to do the right thing" but is careless or incompetent or whatever). I usually use the term "AGI Safety" when describing my job, but the major part of it is thinking about the alignment problem.

Something I'm wondering, but don't have the expertise in meta-learning to say confidently (so, epistemic status: speculation, and I'm curious for critiques): extra OOMs of compute could overcome (at least) one big bottleneck in meta-learning, the expense of computing second-order gradients. My understanding is that most methods just ignore these terms or use crude approximations, like this, because they're so expensive. But at least this paper found some pretty impressive performance gains from using the second-order terms.

Maybe throwing lots of compute at... (read more)

Load More