## LESSWRONGLW

Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

# Sequences

Reducing Goodhart
Philosophy Corner

# Wiki Contributions

Sorted by

Yeah, I'm not actually sure about the equilibrium either. I just noticed that not privileging any voters (i.e. the pure strategy of 1/3,1/3,1/3) got beaten by pandering, and by symmetry there's going to be at least a three-part mixed Nash equilibrium - if you play 1/6A 5/6B, I can beat that with 1/6B 5/6C, which you can then respond to with 1/6C 5/6A, etc.

Yeah, "3 parties with cyclic preferences" is like the aqua regia of voting systems. Unfortunately I think it means you have to replace the easy question of "is it strategy-proof" with a hard question like "on some reasonable distribution of preferences, how much strategy does it encourage?"

Epistemic status: I don't actually understand what strategic voting means, caveat lector

Suppose we have three voters, one who prefers A>B>C, another B>C>A, the third C>A>B. And suppose our preferences are that the middle one is 0.8 (on a scale where the top one is 1 and bottom 0).

Fred and George will be playing a mixed Nash equilibrium where they randomize between catering to A, B, or C preferrers - treating them as a black box the result will be 1/3 chance of each, all voters get utility 0.59.

But suppose I'm the person with A>B>C, and I can predict how the other people will vote. Should I change my vote to get a better result? What happens if I vote B>C>A, putting my own favorite candidate at the bottom of the list? Well, now the Nash equilibrium for Fred and George is 100% B, because 2 the C preferrer is outvoted, and I'll get utility 0.8, so I should vote strategically.

Just riffing on this rather than starting a different comment chain:

If alignment is "get AI to follow instructions" (as typically construed in a "good enough" sort of way) and alignment is "get AI to do good things and not bad things," (also in a "good enough" sort of way, but with more assumed philosophical sophistication) I basically don't care about anyone's safety plan to get alignment except insofar as it's part of a plan to get alignment.

Philosophical errors/bottlenecks can mean you don't know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.

The checklist has a space for "nebulous future safety case for alignment," which is totally fine. I just also want a space for "nebulous future safety case for alignment" at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment takes (will it focus on the structure of the institution using an aligned AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.

Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power - that we don't even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.

I am confused by the apparent expectation by many people both pro- and anti- that \$1000 per month was going to cause poor people age 20-40 to have way better overall health and way higher healthcare spending.

Maybe there's some mental model going around that a key thing poor people really need to spend their next marginal dollar on is their health, and if they do it will be impressively effective, but if they don't then the poor people have caused UBI to fail by not spending their money rationally? This is surprising to me on all counts, but perhaps I haven't seen the same evidence about healthcare spending that segment of the predictors have.

I'll admit, my mental image for "our universe + hypercomputation" is a sort of webnovel premise, where we're living in a normal computable universe until one day by fiat an app poofs into existence on your phone that lets you enter a binary string or file and instantaneously get the next bit with minimum description length in binary lambda calculus. Aside from the initial poofing and every usage of the app, the universe continues by its normal rules.

But there's probably simpler universes (by some handwavy standard) out there that allow enough hypercomputation that they can have agents querying minimum description length oracles, but not so much that agents querying MDL oracles can no longer be assigned short codes.

I agree and yet I think it's not actually that hard to make progress.

There is no canonical way to pick out human values,[1] and yet using an AI to make clever long-term plans implicitly makes some choice. You can't dodge choosing how to interpret humans, if you think you're dodging it you're just doing it in an unexamined way.

Yes, humans are bad at philosophy and are capable of making things worse rather than better by examining them. I don't have much to say other than get good. Just kludging together how the AI interprets humans seems likely to lead to problems to me, especially in a possible multipolar future where there's more incentive for people to start using AI to make clever plans to steer the world.

This absolutely means disposing of appealing notions like a unique CEV, or even an objectively best choice of AI to build, even as we make progress on developing standards for good AI to build.

1. ^

See the Reducing Goodhart sequence for me on this, which starts sketching some ways to deal with humans not being agents.

Yeah, I agree with your first paragraph. But I think it's a difference of degree rather than kind. "Do the right thing" is still communication, it's just communication about something indirect, that we nonetheless should be picky about.

All language makes no sense without a method of interpretation. "Get me some coffee" is a horribly ambiguous instruction that any imagined assistant will have to cope with. How might an AI learn what "get me some coffee" entails without it being hardcoded in?

To say it's impossible in theory is to set the bar so high that humans using language is also impossible.

As for military use of AGI, I think I'm fine with breaking that application. If we can build AI that does good things when directed to (which can incorporate some parts of corrigibility, like not being overly dogmatic and soliciting a broad swath of human feedback), then we should. If we cannot build AI that actually does good things, we haven't solved alignment by my lights and building powerful AI is probably bad.

This strikes me as defining "alignment" a little differently than me.

It even might defing "instruction-following" differently than me.

If we really solved instruction following, you could give the instruction "Do the right thing" and it would just do the right thing.

If you that's possible, then what we need is a coalition to tell powerful AIs to "do the right thing", rather than "make my creators into god-emperors" or whatever. This seems doable, though the clock is perhaps ticking.

If you can't just tell an AI to do the right thing, but it's still competent enough to pull off dangerous plans, then to me this still seems like the usual problem of "powerful AI that's not trying to do good is bad" whether or not a human is giving instructions to this AI.

Or to rephrase this as a call to action: AI alignment researchers cannot just hill-climb on making AIs that follow arbitrary instructions. We have to preferentially advance AIs that do the right thing, to avoid the sort of scenario you describe.