Nisan

Comments

Some AI research areas and their relevance to existential safety

I'd believe the claim if I thought that alignment was easy enough that AI products that pass internal product review and which don't immediately trigger lawsuits would be aligned enough to not end the world through alignment failure. But I don't think that's the case, unfortunately.

It seems like we'll have to put special effort into both single/single alignment and multi/single "alignment", because the free market might not give it to us.

Some AI research areas and their relevance to existential safety
Nisan9d21Ω9

I'd like more discussion of the claim that alignment research is unhelpful-at-best for existential safety because of it accelerating deployment. It seems to me that alignment research has a couple paths to positive impact which might balance the risk:

  1. Tech companies will be incentivized to deploy AI with slipshod alignment, which might then take actions that no one wants and which pose existential risk. (Concretely, I'm thinking of out with a whimper and out with a bang scenarios.) But the existence of better alignment techniques might legitimize governance demands, i.e. demands that tech companies don't make products that do things that literally no one wants.

  2. Single/single alignment might be a prerequisite to certain computational social choice solutions. E.g., once we know how to build an agent that "does what [human] wants", we can then build an agent that "helps [human 1] and [human 2] draw up incomplete contracts for mutual benefit subject to the constraints in the [policy] written by [human 3]". And slipshod alignment might not be enough for this application.

Singularity Mindset

2 years later, do you have an answer to this?

Risk is not empirically correlated with return

Hm, I think all I meant was:

"If you have two assets with the same per-share price, and asset A's value per share has a higher variance than asset B's value per share, then asset A's per-share value must have a higher expectation than asset B's per-share value."

I guess I was using "cost" to mean "price" and "return" to mean "discounted value or earnings or profit".

Maybe Lying Can't Exist?!

(I haven't read any of the literature on deception you cite, so this is my unimformed opinion.)

I don't think there's any propositional content at all in these sender-receiver games. As far as the P.redator is concerned, the signal means "I want to eat you" and the P.rey wants to be eaten.

If the environment were somewhat richer, the agents would model each other as agents, and they'd have a shared understanding of the meaning of the signals, and then I'd think we'd have a better shot of understanding deception.

What Would I Do? Self-prediction in Simple Algorithms

Ah, are you excited about Algorithm 6 because the recurrence relation feels iterative rather than topological?

Self-sacrifice is a scarce resource

Like, if you’re in a crashing airplane with Eliezer Yudkowsky and Scott Alexander (or substitute your morally important figures of choice) and there are only two parachutes, then sure, there’s probably a good argument to be made for letting them have the parachutes.

This reminds me of something that happened when I joined the Bay Area rationalist community. A number of us were hanging out and decided to pile in a car to go somewhere, I don't remember where. Unfortunately there were more people than seatbelts. The group decided that one of us, who was widely recognized as an Important High-Impact Person, would definitely get a seatbelt; I ended up without a seatbelt.

I now regret going on that car ride. Not because of the danger; it was a short drive and traffic was light. But the self-signaling was unhealthy. I should have stayed behind, to demonstrate to myself that my safety is important. I needed to tell myself "the world will lose something precious if I die, and I have a duty to protect myself, just as these people are protecting the Important High-Impact Person".

Everyone involved in this story has grown a lot since then (me included!) and I don't have any hard feelings. I bring it up because offhand comments or jokes about sacrificing one's life for an Important High-Impact Person sound a bit off to me; they possibly reveal an unhealthy attitude towards self-sacrifice.

(If someone actually does find themselves in a situation where they must give their life to save another, I won't judge their choice.)

Classifying games like the Prisoner's Dilemma

Von Neumann and Morgenstern also classify the two-player games, but they get only two games, up to equivalence. The reason is they assume the players get to negotiate beforehand. The only properties that matter for this are:

  • The maximin value , which represents each player's best alternative to negotiated agreement (BATNA).

  • The maximum total utility .

There are two cases:

  1. The inessential case, . This includes the Abundant Commons with . No player has any incentive to negotiate, because the BATNA is Pareto-optimal.

  2. The essential case, . This includes all other games in the OP.

It might seem strange that VNM consider, say, Cake Eating to be equivalent to Prisoner's Dilemma. But in the VNM framework, Player 1 can threaten not to eat cake in order to extract a side payment from Player 2, and this is equivalent to threatening to defect.

  • item
    • subitem
    • subitem
  • item

Von Neumann and Morgenstern also classify the two-player games, but they get only two games, up to equivalence. The reason is they assume the players get to negotiate beforehand. For them the only properties that matter are:

  • The maximin value , which represents each player's best alternative to negotiated agreement (BATNA).

  • The maximum total utility .

There are two cases:

  1. The inessential case, . This includes the Abundant Commons with . No player has any incentive to negotiate, because the BATNA is Pareto-optimal.

  2. The essential case, . This includes all other games in the OP.

It might seem strange that VNM consider Cake Eating to be equivalent to Prisoner's Dilemma. But in the VNM framework, Player 1 can threaten not to eat cake in order to extract a side payment from Player 2, just and this is the same as threatening to defect.

Load More