I'm Anthony DiGiovanni, a suffering-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my blog, Ataraxia. All opinions my own.

Wiki Contributions


primarily because models will understand the base goal first before having world modeling

Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the "what does the overseer want" after that, because that's how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.

"I am devoting my life to solving the most important problems in the world and alleviating as much suffering as possible" fits right into the script. That's exactly the kind of thing you are supposed to be thinking. If you frame your life like that, you will fit in and everyone will understand and respect what is your basic deal.

Hm, this is a pretty surprising claim to me. It's possible I haven't actually grown up in a "western elite culture" (in the U.S., it might be a distinctly coastal thing, so the cliché goes? IDK). Though, I presume having gone to some fancypants universities in the U.S. makes me close enough to that. The Script very much did not encourage me to devote my life to solving the most important problems and alleviating as much suffering as possible, and it seems not to have encouraged basically any of my non-EA friends from university to do this. I/they were encouraged to have careers that were socially valuable, to be sure, but not the main source of purpose in their lives or a big moral responsibility.

A model that just predicts "what the 'correct' choice is" doesn't seem likely to actually do all the stuff that's instrumental to preventing itself from getting turned off, given the capabilities to do so.

But I'm also just generally confused whether the threat model here is, "A simulated 'agent' made by some prompt does all the stuff that's sufficient to disempower humanity in-context, including sophisticated stuff like writing to files that are read by future rollouts that generate the same agent in a different context window," or "The RLHF-trained model has goals that it pursues regardless of the prompt," or something else.

confused claims that treat (base) GPT3 and other generative models as traditional rational agents

I'm pretty surprised to hear that anyone made such claims in the first place. Do you have examples of this?

I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.

I agree with your guesses.

I am not sure that "controlling for game-theoretic instrumental reasons" is actually a move that is well defined/makes sense.

I don't have a crisp definition of this, but I just mean that, e.g., we compare the following two worlds: (1) 99.99% of agents are non-sentient paperclippers, and each agent has equal (bargaining) power. (2) 99.99% of agents are non-sentient paperclippers, and the paperclippers are all confined to some box. According to plenty of intuitive-to-me value systems, you only (maybe) have reason to increase paperclips in (1), not (2). But if the paperclippers felt really sad about the world not having more paperclips, I'd care—to an extent that depends on the details of the situation—about increasing paperclips even in (2).

Ah right, thanks! (My background is more stats than comp sci, so I'm used to "indicator" instead of "predicate.")

Let's pretend that you are a utilitarian. You want to satisfy everyone's goals

This isn't a criticism of the substance of your argument, but I've come across a view like this one frequently on LW so I want to address it: This seems like a pretty nonstandard definition of "utilitarian," or at least, it's only true of some kinds of preference utilitarianism.

I think utilitarianism usually refers to a view where what you ought to do is maximize a utility function that (somehow) aggregates a metric of welfare across individuals, not their goal-satisfaction. Kicking a puppy without me knowing about it thwarts my goals, but (at least on many reasonable conceptions of "welfare") doesn't decrease my welfare.

I'd be very surprised if most utilitarians thought they'd have a moral obligation to create paperclips if 99.99% of agents in the world were paperclippers (example stolen from Brian Tomasik), controlling for game-theoretic instrumental reasons.

Basic questions: If the type of Adv(M) is a pseudo-input, as suggested by the above, then what does Adv(M)(x) even mean? What is the event whose probability is being computed? Does the unacceptability checker C also take real inputs as the second argument, not just pseudo-inputs—in which case I should interpret a pseudo-input as a function that can be applied to real inputs, and Adv(M)(x) is the statement "A real input x is in the pseudo-input (a set) given by Adv(M)"?

(I don't know how pedantic this is, but the unacceptability penalty seems pretty important, and I struggle to understand what the unacceptability penalty is because I'm confused about Adv(M)(x).)

This is a risk worth considering, yes. It’s possible in principle to avoid this problem by “committing” (to the extent that humans can do this) to both (1) train the agent to make the desired tradeoffs between the surrogate goal and original goal, and (2) not train the agent to use a more hawkish bargaining policy than it would’ve had without surrogate goal training. (And to the extent that humans can’t make this commitment, i.e., we make honest mistakes in (2), the other agent doesn’t have an incentive to punish those mistakes.)

If the developers do both these things credibly—and it's an open research question how feasible this is—surrogate goals should provide a Pareto improvement for the two agents (not a rigorous claim). Safe Pareto improvements are a generalization of this idea.

Load More