This is a special post for quick takes by [deactivated]. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

This post exists only for archival purposes.

New to LessWrong?

7 comments, sorted by Click to highlight new comments since:

I have a question about "AGI Ruin: A List of Lethalities".

These two sentences from Section B.2 stuck out to me as the most important in the post:

...outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.

 

...on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.

My question is: supposing this is all true, what is the probability of failure of inner alignment? Is it 0.01%, 99.99%, 50%...? And how do we know how likely failure is?

It seems like there is a gulf between "it's not guaranteed to work" and "it's almost certain to fail".

I think any reasonable estimate would be based on a more detailed plan. What types of rewards (loss function) are we providing, and what type of inner alignment we want.

My intuition roughly aligns with Eliezer's on this point: I doubt this will work.

When I imagine rewarding an agent for doing things humans like, as indicated by smiles, thanks, etc. I have a hard time imagining that this just generalizes to an agent that does what we want, even in very different circumstances, including when it can relatively easily gain sovereignty and do whatever it wants.

Others have a different intuition. Buried in a comment somewhere from Quintin Pope, he says something to the effect of "shard theory isn't a new theory of alignment; it's the hypothesis that we dont need one". I think he and other shard theory optimists think it's entirely plausible that rewarding stuff we like will develop inner representations and alignment that's adequate for our purposes.

While I share Eliezar and others' pessimism about alignment through pure RL, I don't share his overall pessimism. You've seen my alternate proposals for directly setting desirable goals out of an agent's learned knowledge.

It's almost certain in the narrow technical sense of "some difference no matter how small", and unknown (and currently undefinable) in any more useful sense.

Longtermism question: has anyone ever proposed a discount rate on the moral value of future lives? By analogy to discount rates used in finance and investing.

This could account for the uncertainty in predicting the existence of future people. Or serve as a compromise between views like neartermism and longtermism, or pro-natalism and anti-natalism.

Yes, this is an argument people have made. Longtermists tend to reject it. First off, applying a discount rate on the moral value of lives in order to account for the uncertainty of the future is...not a good idea. These two things are totally different, and shouldn't be conflated like that imo. If you want to apply a discount rate to account for the uncertainty of the future, just do that directly. So, for the rest of the post I'll assume a discount rate on moral value actually applies to moral value.

So, that leaves us with the moral argument.

A fairly good argument, and the one I subscribe to, is this:

  • Let's say we apply a conservative discount rate, say, 1% per year, to the moral value of future lives.
  • Given that, one life now is worth approximately 500 million lives two millenia from now. (0.99^2000 = approximately 2e-9)
  • But would that have been reasonably true in the past? Would it have been morally correct to save a life in 0 BC at the cost of 500 million lives today?
  • If the answer is "no" to that, it should also be considered "no" in the present.

This is, again, different from a discount rate on future lives based on uncertainty. It's entirely reasonable to say "If there's only a 50% chance this person ever exists, I should treat it as 50% as valuable." I think that this is a position that wouldn't be controversial among longtermists.

The kinds of discount rates that you see in finance will imply that moral value goes to about zero after like 1000 years, which we basically know couldn't be true (total life tends to grow over time, not shrink). Discount rates are a pretty crude heuristic for estimating value over time and will be inapplicable to many situations.