Charlie Steiner

LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.

Sequences

Philosophy Corner

Comments

SETI Predictions

The twist at the end though. I had to go back and re-think my answers :P

Latent Variables and Model Mis-Specification

Just ended up reading your paper (well, a decent chunk of it), so thanks for the pointer :) 

The ethics of AI for the Routledge Encyclopedia of Philosophy

Congrats! Here are my totally un-researched first thoughts:

Pre-1950 History: Speculation about artificial intelligence (if not necessarily in the modern sense) dates back extremely far. Brazen head, Frankenstein, the mechanical turk, R.U.R. robots. Basically all of this treats the artificial intelligence as essentially a human, though the brazen head mythology is maybe more related to deals with djinni or devils, but basically all of it (together with more modern science fiction) can be lumped into a pile labeled "exposes and shapes human intuition, but not very serious".

Automating the work of logic was of interest to logicians such as Hilbert, Pierce, Frege, so there might be interesting discussions related to that. Obviously you'll also mention modern electronic computing, Turing, Good, Von Neumann, Wiener, etc.

There might actually be a decent amount already written about the ethics of AI for central planning of economies. Thinking about Wiener, and also about the references of the recent book Red Plenty, about Kantorovich and halfhearted soviet attempts to use optimization algorithms on the economy.

The most important modern exercise of AI ethics is not trolley problems for self-driving cars (despite the press), but interpretability. If the law says you can't discriminate based on race, for instance, and you want to use AI to make predictions about someone's insurance or education or what have you, that AI had better not only not discriminate, it has to interpretably not discriminate in a way that will stand up in court if necessary. Simultaneously, there's the famous story of Target sending out advertisements for baby paraphernalia, and Facebook targets ads to you that, if they don't use your phone's microphone, make good enough predictions to make people suspect they do. So the central present issue of AI ethics is the control of information - both its practicality in places we've already legislated that we want certain information not to be acted on, and examining the consequences in places where there's free rein.

And obviously autonomous weapons, the race to detect AI-faked images, the ethics of automation of large numbers of jobs, and then finally maybe we can get to the ethical issues raised by superhuman AI.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.

If you don't agree with me on this, why didn't you reply when I spent about six months just writing posts that were all variations of this idea? Here's Scott Alexander making the basic point.

It's like... is there a True rational approximation of pi? Well, 22/7 is pretty good, but 355/113 is more precise, if harder to remember. And just 3 is really easy to remember, but not as precise. And of course there's the arbitrarily large "approximation" that is 3.141592... Depending on what you need to use it for, you might have different preferences about the tradeoff between simplicity and precision. There is no True rational approximation of pi. True Human Values are similar, except instead of one tradeoff that you can make it's approximately one bajillion.

  • we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense

I have no idea why this would be tied to non-Cartesian-ness.

If a Cartesian agent was talking about their values, they could just be like "you know, those things that are specified as my values in the logic-stuff my mind is made out of." (Though this assumes some level of introspective access / genre savviness that needn't be assumed, so if you don't want to assume this then we can just say I was mistaken.). When a human talks about their values they can't take that shortcut, and instead have to specify values as a function of how they affect their behavior. This introduces the dependency on how we're breaking down the world into categories like "human behavior."

  • Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans [...] and satisfy the modeled values.

How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say "just pick one set of values/one world model and satisfy that", which seems wrong.

Well, if there were unique values, we could say "maximize the unique values." Since there aren't, we can't. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we're going to have to do with that wrong-seeming.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I think that one of the problems in this post is actually easier in the real world than in the toy model.

In the toy model the AI has to succeed by maximizing the agent's True Values, which the agent is assumed to have as a unique function over its model of the world. This is a very tricky problem, especially when, as you point out, we might allow the agent's model of reality to be wrong in places.

But in the real world, humans don't have a unique set of True Values or even a unique model of the world - we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense.

Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans (and we might have desiderata about how it does that modeling and what the end results should contain), and satisfy the modeled values. And in some ways this is actually a bit reassuring, because I'm pretty sure that it's possible to get better final results on this problem than on than learning the toy model agent's True Values - maybe not in the most simple case, but as you add things like lack of introspection, distributional shift, meta-preferences like identifying some behavior as "bias," etc.

Learning Normativity: A Research Agenda

I'm pretty on board with this research agenda, but I'm curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.

And on the assumption that you have no idea what I'm referring to, here's the link to my post.

There are a couple different directions to go from here. One way is to try to collapse the recursion. Find a single agent-shaped model of humans that is (or approximates) a fixed point of this model-ratification process (and also hopefully stays close to real humans by some metric), and use the preferences of that. This is what I see as the endgame of the imitation / bootstrapping research.

Another way might be to imitate communication, and find a way to use recursive models such that we can stop the recursion early without much loss in effectiveness. In communication, the innermost layer of the model can be quite simplistic, and then the next is more complicated by virtue of taking advantage of the first, and so on. At each layer you can do some amount of abstracting away of the details of previous layers, so by the time you're at layer 4 maybe it doesn't matter that layer 1 was just a crude facsimile of human behavior.

Thinking specifically about this UTAA monad thing, I think it's a really clever way to think about what levers we have access to in the fixed-point picture. (If I was going to point out one thing it's lacking, it's that it's a little hazy on whether you're supposed to model V as now having meta-values about the state of the entire recursive tree of UTAAs, or whether your Q function is now supposed to learn about meta-preferences from some outside data source.) But it retains the things I'm worried about from this fixed-point picture, which is basically that I'm not sure it buys us much of anything if the starting point isn't benign in a quite strong sense.

Communication Prior as Alignment Strategy

Yeah, this is basically CIRL, when the human-model is smart enough to do Gricean communication. The important open problems left over after starting with CIRL are basically "how do you make sure that your model of communicating humans infers the right things about human preferences?", both due to very obvious problems like human irrationality, and also due to weirder stuff like the human intuition that we can't put complete confidence in any single model.

Building AGI Using Language Models

Sure. It might also be worth mentioning multimodal uses of the transformer algorithm, or the use of verbal feedback as a reward signal to train reinforcement learning agents.

As for whether this is a fire alarm, this reminds me of the old joke: "In theory, there's no difference between theory and practice. But in practice, there is."

You sort of suggest that in theory, this would lead to an AGI, but that in practice, it wouldn't work. Well in theory, if it fails in practice that means you didn't use a good enough theory :)

Ethics in Many Worlds

If we're cosmopolitan, we might expect that the wavefunction of the universe at the current time contains more than just us. In fact, the most plausible state is that it has some amount (albeit usually tiny) of every possible state already.

And so there is no good sense in which time evolution of the universe produces "more" of me. It doesn't produce new states with me in them, because those states already exist, there's just probably not much quantum measure in them. And it doesn't produce new quantum measure out of thin air - it only distributes what I already have.

Load More