Nominated Posts for the 2018 Review

2018 Review Discussion

Clarifying "AI Alignment"Ω
601y3 min readΩ 19Show Highlight

When I say an AI A is aligned with an operator H, I mean:

A is trying to do what H wants it to do.

The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.

This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.

In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.


Consider a human... (Read more)

3Vanessa Kosoy4dThe idea is, we will solve the alignment problem by (i) formulating a suitable learning protocol (ii) formalizing a set of assumptions about reality and (iii) proving that under these assumptions, this learning protocol has a reasonable subjective regret bound. So, the role of the subjective regret bound is making sure that the what we came up with in i+ii is sufficient, and also guiding the search there. The subjective regret bound does not tell us whether particular assumptions are realistic: for this we need to use common sense and knowledge outside of theoretical computer science (such as: physics, cognitive science, experimental ML research, evolutionary biology...) I disagree with the OP that (emphasis mine): I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem. I don't think strong feasibility results will have to talk about the environment, or rather, they will have to talk about it on a very high level of abstraction. For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result. We can then debate whether an using such a smooth approximation is sufficient for superhuman performance, but establishing this requires different tools, like I said above. The way I imagine it, AGI theory should ultimately arrive at some class of priors that are on the one hand rich enough to deserve to be called "general" (or, practically speaking, rich enough to produce superhuman agents) and on the other hand narrow enough to allow for efficient algorithms. For example the Solomonoff prior is too r
10rohinmshah4dOkay, so there seem to be two disagreements: * How bad is it that intent alignment is ill-defined * Is work on intent alignment urgent The first one seems primarily about our disagreements on the utility of theory, which I'll get to later. For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely (maybe the first few AGIs don't think about simulations; maybe it's impossible to construct such a convincing hypothesis). I especially don't see the argument that it is more likely than the failure mode in which a goal-directed AGI is optimizing for something different from what humans want. (You might respond that intent alignment brings risk down from say 10% to 3%, whereas your agenda brings risk down from 10% to 1%. My response would be that once we have successfully figured out intent alignment to bring risk from 10% to 3%, we can then focus on building a good prior to bring the risk down from 3% to 1%. All numbers here are very made up.) My guess is that any such result will either require samples exponential in the dimensionality of the input space (prohibitively expensive) or the simple and natural condition won't hold for the vast majority of cases that neural networks have been applied to today. I don't find smoothness conditions in particular very compelling, because many important functions are not smooth (e.g. most things involving an if condition). Consider this example: You are a bridge designer. You make the assumption that forces on the bridge will never exceed some value K (necessary because you can't be robust against unbounded forces). You prove your design will never collapse given this assumption. Your bridge collapses anyway because of resonance []. The broader po
4Vanessa Kosoy3dFirst, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk. Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is. Hence, I am skeptical about the usefulness of intent-alignment. For a more "mundane" example, take IRL. Is IRL intent aligned? What if its assumptions about human behavior are inadequate and it ends up inferring an entirely wrong reward function? Is it still intent-aligned since it is trying to do what the user wants, it is just wrong about what the user wants? Where is the line between "being wrong about what the user wants" and optimizing something completely unrelated to what the user wants? It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense. I don't know why you think so, but at least this is a good crux since it seems entirely falsifiable. In an any case, exponential sample complexity definitely doesn't count as "strong feasibility". Smoothness is just an example, it is not necessarily the final answer. But also, in classification problems smoothness usually translates to a margin requirement (the classes have to be separated with sufficient distance). So, in some sense smoothness allows for "if conditions" as long as you're not too sensitive to the threshold. I don't understand this example. If the bridge can never collapse as long as the outside forces don't exceed K, then resonance is covered as well (as long as it is produced by forces below K). Maybe you meant that the outside forces are also assumed to be stationary. Nevertheless most engineering projects make heavy use of theory. I don't understand why you think that AGI must be different? The issue of assumptions in strong feasibility is equivalent to the q
First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk.

Okay. What's the argument that the risk is great (I assume this means "very bad" and not "very likely" since by hypothesis it is unlikely), or that we need a lot of time to solve it?

Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is.

I agree with this; I don't think this is one of our cruxes. (I d... (read more)

Epistemic Status: Simple point, supported by anecdotes and a straightforward model, not yet validated in any rigorous sense I know of, but IMO worth a quick reflection to see if it might be helpful to you.

A curious thing I've noticed: among the friends whose inner monologues I get to hear, the most self-sacrificing ones are frequently worried they are being too selfish, the loudest ones are constantly afraid they are not being heard, the most introverted ones are regularly terrified that they're claiming more than their share of the conversation, the most assertive ones are always su... (Read more)

I'd be quite curious about more concrete examples of systems where there is lots of pressure in *the wrong direction*, due to broken alarms. (Be they minds, organisations, or something else.) The OP hints at it with the consulting example, as does habryka in his nomination.

I strongly expect there to be interesting ones, but I have neither observed any nor spent much time looking.

Being a Robust Agent (v2)
11518d6 min readShow Highlight

Second version, updated for the 2018 Review. See change notes.

There's a concept which many LessWrong essays have pointed at it (honestly, the entire sequences are getting at). But I don't think there's a single post really spelling it out explicitly:

You might want to become a more robust, coherent agent.

By default, humans are a kludgy bundle of impulses. But we have the ability to reflect upon our decision making, and the implications thereof, and derive better overall policies. Some people find this naturally motivating – there’s something aesthetically appea... (Read more)

I'm leaning towards reverting the title to just "being a robust agent", since the new title is fairly clunky, and someone gave me private feedback that it felt less like a clear-handle for a concept. [edit: have done so]

Load More