Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner

Wikitag Contributions

Comments

Sorted by

Yeah, I think instead the numbers only work out if you include things like the cost of land, or the cost of the farmer's time - and then what's risen is not the "subsistence cost of horses" per se, but a more general "cost of the things the simplified model of horse productivity didn't take into account."

I feel sad that your hypotheses are almost entirely empirical, but seem like they include just enough metaethically-laden ideas that you have to go back to describing what you think people with different commitments might accept or reject.

My checklist:

Moral reasoning is real (or at least, the observables you gesture towards could indeed be observed, setting aside the interpretation of what humans are doing)

Faultless convergence is maybe possible (I'm not totally sure what observables you're imagining - is an "argument" allowed to be a system that interacts with its audience? If it's a book, do all people have to read the same sequence of words, or can the book be a choose your own adventure that tells differently-inclined readers to turn to different pages? Do arguments have to be short, or can they take years to finish, interspersed with real-life experiences?), but also I disagree with the connotation that this is good, that convergence via argument is the gold standard, that the connection between being changed by arguments and sharing values is solid rather than fluid.

No Uniqueness

No Semi-uniqueness

Therefore Unificiation is N/A

Man, I'm reacting to an entire genre of thought, not just this post exactly, so apologies for combination unkindness and inaccuracy, but I think it's barking up the wrong tree to worry about whether AIs will have the Stuff or not. Pain perception, consciousness, moral patiency, these are things that are all-or-nothing-ish for humans, in our everyday experience of the everyday world. But there is no Stuff underlying them, such that things either have the Stuff or don't have the Stuff - no Platonic-realm enforcement of this all-or-nothing-ish-ness. They're just patterns that are bimodal in our typical experience.

And then we generate a new kind of thing that falls into neither hump of the distribution, and it's super tempting to ask questions like "But is it really in the first hump, or really in the second hump?" "What if we treat AIs as if they're in the first hump, but actually they're really in the second hump?"

A drawing of two humps, and a point in the middle.

Caption: Which hump is X really in?

The solution seems simple to state but very complicated to do: just make moral decisions about AIs without relying on all-or-nothing properties that may not apply.

Do you have any quick examples of value-shaped interpretations that conflict?

  1. Someone trying but failing to quit smoking. On one interpretation, they don't really want to smoke, smoking is some sort of mistake. On another interpretation, they do want to smoke, the quitting-related behavior is some sort of mistake (or has a social or epistemological reason).

    This example stands in for other sorts of "obvious inconsistency," biases that we don't reflectively endorse, etc. But also consider cases where humans say they don't want something but we (outside the thought experiment) think they actually do want that thing! A possible example is the people who say they would hate a post-work world, they want to keep doing work so they have purpose. Point is, the verbal spec isn't always right.

  2. The interpretation "Humans want to follow the laws of physics," versus an interpretation that's a more filled-in version of "Humans want to do a bunch of human-scale things like talking to humans, eating good food, interacting with nature, learning about the world, etc." The first is the limit of being more predictive at the cost of having a more complicated model of humans, and as you can tell, it sort of peters out into explaining everything but having no push towards good stuff.

That's one difference! And probably the most dangerous one, if a clever enough AI notices it.

Some good things to read would be methods based on not straying too far from a "human distribution": Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks' post about decision transformers.

They're important reads, but ultimately, I'm not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn't mean we want the AI to act how the human wants to act. If we can't build AI that reflects this, we're missing some big insights.

In short, no, I don't expect self-other overlap to help. If the human wants coffee, we want the AI to get the human a coffee. We don't want the AI to get itself a coffee.

Second, the problem isn't that we know what we want the AI to do, but are worried the AI will "go against it," so we need to constrain the AI. The problem is that we don't know what we want the AI to do, certainly not with enough precision to turn it into code.

In value learning, we want the AI to model human preferences, but we also want the AI to do meta-preferential activities like considering the preferences of individual humans and aggregating them together, or considering different viewpoints on what 'human preferences' means and aggregating them together. And we don't just want the AI to do those in arbitrary ways, we want it to learn good ways to navigate different viewpoints from humans' own intuitions about what it means to do a good job at that.

Seth, I forget where you fall in the intent alignment typology: if we build a superintelligent AI that follows instructions in the way you imagine, can we just give it the instruction "Take autonomous action to do the right thing," and then it will just go do good stuff without us needing to continue interacting with it in the instruction-following paradigm?

Definitely agree that the implicit "Do what they say [in a way that they would want]" sneaks the problems of value learning into what some people might have hoped was a value-learning-free space. Just want to split some hairs on this:

if an AI understands these values perfectly and is properly motivated to act according to them, that is functionally the same as it having those values itself.

I think this ignores that there are multiple ways to understand humans, what human preferences are, what acting according to them is, etc. There's no policy that would satisfy all value-shaped interpretations of the user, because some of them conflict. This gives us some wiggle room to imagine different ways of resolving those conflicts, some of which will look more like instruction-following and other that will look more like autonomous action.

From a 'real alignment' perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.

You might think of the label 'RLAIF' as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI's predictions (or more general generative output, if the training isn't for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.

Similarly, the AZR setup leverages the AI's unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote "train itself" to code better. Except that relative to vanilla RLAIF, there's more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I've described things in this way, you can probably see how to turn this back into RLAIF for alignment.

The overarching problem is, as usual, we don't understand how to do alignment in a non-hacky way.

We don't know what sorts of moral reflection are necessary for good outcomes, and we don't know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we'll learn some things.

Load More