Stuart_Armstrong's Shortform

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
10 comments, sorted by Highlighting new comments since Today at 3:46 AM
New Comment

Lexicographical preference orderings seem to come naturally to humans. Sentiments like "no amount of money is worth one human life" are commonly expressed.

Now, that particular sentiment is wrong because money can be used to purchase human lives.

The other problem comes from using probability and expected utility, which makes anything lexicographically second completely worthless is all realistic cases. It's one thing to say that you prefer apples to pears lexicographically when there are ten of each lying around and everything is deterministic (just take the ten apples first then the ten pears afterwards). But does it make sense to say that you'd prefer one chance in a trillion of extending someone's life by a microsecond, over a billion euros of free consumption?

So this short post will propose a more sensible, smoothed version of lexicographical ordering, suitable to capture the basic intuition, but usable with expected utility.

If the utility has lexicographical priority and is subordinate to it, then choose a value and maximise:

In that case, increases in expected always cause non-trivial increases in expected , but an increase in of will always be more important than any possible increase in .

This seems related to scope insensitivity and availability bias. No amount of money (that I have direct control of) is worth one human life ( in my Dunbar group). No money (which my mind exemplifies as $100k or whatever) is worth the life of my example human, a coworker. Even then, its false, but it's understandable.

More importantly, categorizations of resources (and of people, probably) are map, not territory. The only rational preference ranking is over reachable states of the universe. Or, if you lean a bit far towards skepticism/solopcism, over sums of future experiences.

Preferences exist in the map, in human brains, and we want to port them to the territory with the minimum of distortion.

Oh, wait. I've been treating preferences as territory, though always expressed in map terms (because communication and conscious analysis is map-only). I'll have to think about what it would mean if they were purely map artifacts.

This is a link to "An Increasingly Manipulative Newsfeed" about potential social media manipulation incentives (eg FaceBook).

I'm putting the link here because I keep losing the original post (since it wasn't published by me, but I co-wrote it).

Bayesian agents that knowingly disagree

A minor stub, caveating the Aumann's agreement theorem; put here to reference in future posts, if needed.

Aumann's agreement theorem states that rational agents with common knowledge of each other's beliefs cannot agree to disagree. If they exchange their estimates, they will swiftly come to an agreement.

However, that doesn't mean that agents cannot disagree, indeed they can disagree, and know that they disagree. For example, suppose that there are a thousand doors, and behind of these, there are goats, and behind one there is a flying aircraft carrier. The two agents are in separate rooms, and a host will go into each room and execute the following algorithm: they will choose a door at random among the that contain a goat. And, with probability , they will tell that door number to the agent; with probability , they will tell the door number with the aircraft carrier.

Then each agent will have probability of the named door being the aircraft carrier door, and probability on each of the other doors; so the most likely door is the one named by the host.

We can modify the protocol so that the host will never name the same door to each agent (roll a D100; if it comes up 1, tell the truth to the first agent and lie to the second; if it comes up 2, do the opposite; anything else means tell a different lie to either agent). In that case, each agent will have a best guess for the aircraft carrier, and the certainty that the other agent's best guess is different.

If the agents exchanged information, they would swiftly converge on the same distribution; but until that happens, they disagree, and know that they disagree.

Partial probability distribution

A concept that's useful for some of my research: a partial probability distribution.

That's a that defines for some but not all and (with for being the whole set of outcomes).

This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of .

Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has .

Eg: , , will logically imply the value of .

This is a special case of a crisp infradistribution: is equivalent to , a linear equation in , so the set of all 's satisfying it is convex closed.

Thanks! That's useful to know.

Sounds like a special case of crisp infradistributions (ie, all partial probability distributions have a unique associated crisp infradistribution)

Given some , we can consider the (nonempty) set of probability distributions equal to  where  is defined. This set is convex (clearly, a mixture of two probability distributions which agree with  about the probability of an event will also agree with  about the probability of an event).

Convex (compact) sets of probability distributions = crisp infradistributions.