Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner

Wiki Contributions

Comments

Sorted by

Control also makes AI more profitable, and more attractive to human tyrants, in worlds where control is useful. People want to know they can extract useful work from the AIs they build, and if problems with deceptiveness (or whatever control-focused people think the main problem is) are predictable, it will be more profitable, and lead to more powerful AU getting used, if there are control measures ready to hand.

This isn't a knock-down argument against anything, it's just pointing out that inherent dual use of safety research is pretty broad - I suspect it's less obvious for AI control simply because AI control hasn't been useful for safety yet.

If you don't just want the short answer of "probably LTFF" and want a deeper dive on options, Larks' review is good if (at this point) dated.

Suppose you're an AI company and you build a super-clever AI. What practical intellectual work are you uniquely suited to asking the AI to do, as the very first, most obvious thing? It's your work, of course.

Recursive self-improvement is going to happen not despite human control, but because of it. Humans are going to provide the needed resources and direction at every step. Expecting everyone to pause just when rushing ahead is easiest and most profitable, so we can get some safety work done, is not going to pan out barring nukes-on-datacenters level of intervention.

Faster can still be safer if compute overhang is so bad that we're willing to spend years of safety research to reduce it. But it's not.

I am confused about the purpose of the "awareness and memory" section, and maybe disagree with the intuition said to be obvious in the second subsection. Is there some deeper reason you want to bring up how we self-model memory / something you wanted to talk about there that I missed?

On the other hand, I can’t have two songs playing in my head simultaneously,

Tangent: I play at irish sessions, and one of the things you have to do there is swap tunes. If you lead a transition you have to be imagining the next tune you're going to play at the same time as you're playing the current tune. In fact, often you have to decide on the next tune on the fly. This us a skill that takes some time to grok. You're probably conceptualizing the current and future tunes differently, but there's still a lot of overlap - you have to keep playing in sync with other people the entire time, while at the same time recalling and anticipating the future tune.

This is related to the question "Are human values in humans, or are they in models of humans?"

Suppose you're building an AI to learn human values and apply them to a novel situation.

The "human values are in humans", ethos is that the way humans compute values is the thing AI should learn, and maybe it can abstract away many kinds of noise, but it shouldn't be making any big algorithmic overhauls. It should just find the value-computation inside the human (probably with human feedback) and then apply it to the novel situation.

The "human values are in models of humans" take is that the AI can throw away a lot of information about the actual human brain, and instead should find good models (probably with human feedback) that have "values" as a component of a coarse-graining of human psychology, and then apply those "good" models to the novel situation.

Here are some different things you might have clustered as "alignment tax."

Thing 1: The difference between the difficulty of building the technologically-closest friendly transformative AI and the technologically-closest dangerous transformative AI.

Thing 2: The expected difference between the difficulty of building likely transformative AI conditional on it being friendly and the difficulty of building likely transformative AI no matter friendly or dangerous.

Thing 3: The average amount that effort spent on alignment detracts from the broader capability or usefulness of AI.

Turns out it's possible to have negative Thing 3 but positive Things 1 and 2. This post seems to call such a state of affairs "optimistic," which is way too hasty.

An opposite-vibed way of framing the oblique angle between alignment and capabilities is as "unavoidable dual use research." See this long post about the subject

Yeah, I'm not actually sure about the equilibrium either. I just noticed that not privileging any voters (i.e. the pure strategy of 1/3,1/3,1/3) got beaten by pandering, and by symmetry there's going to be at least a three-part mixed Nash equilibrium - if you play 1/6A 5/6B, I can beat that with 1/6B 5/6C, which you can then respond to with 1/6C 5/6A, etc.

Yeah, "3 parties with cyclic preferences" is like the aqua regia of voting systems. Unfortunately I think it means you have to replace the easy question of "is it strategy-proof" with a hard question like "on some reasonable distribution of preferences, how much strategy does it encourage?"

Answer by Charlie Steiner75

Epistemic status: I don't actually understand what strategic voting means, caveat lector

Suppose we have three voters, one who prefers A>B>C, another B>C>A, the third C>A>B. And suppose our preferences are that the middle one is 0.8 (on a scale where the top one is 1 and bottom 0).

Fred and George will be playing a mixed Nash equilibrium where they randomize between catering to A, B, or C preferrers - treating them as a black box the result will be 1/3 chance of each, all voters get utility 0.59.

But suppose I'm the person with A>B>C, and I can predict how the other people will vote. Should I change my vote to get a better result? What happens if I vote B>C>A, putting my own favorite candidate at the bottom of the list? Well, now the Nash equilibrium for Fred and George is 100% B, because 2 the C preferrer is outvoted, and I'll get utility 0.8, so I should vote strategically.

Load More