Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don't currently know of any good writeup. Major pieces in part one:

  • Some semitechnical intuition-building for high-dimensional problem-spaces.
    • Optimization compresses information "by default"
    • Resources and "instrumental convergence" without any explicit reference to agents
  • A frame for thinking about the alignment problem which only talks about high-dimensional problem-spaces, without reference to AI per se.
    • The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values.
    • Details like whether an AI is a singleton, tool AI, multipolar, oracle, etc are mostly irrelevant.
  • Fermi estimate: just how complex are human values?
  • Coherence arguments, presented the way I think they should be done.
    • Also subagents!

Note that I don't talk about timelines or takeoff scenarios; this talk is just about the technical problem of alignment.

Here's the video for part one: 

Big thanks to Rob Miles for editing! Also, the video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.

33 comments, sorted by Click to highlight new comments since: Today at 7:49 PM
New Comment

Are there already plans for a transcript of this? (I could set in motion of a rev.com transcription)

No plans in motion. Thank you very much if you decide to do so! Also, you might want to message Rob to get the images.

I've put in a request for a transcript.

How do transcriptions typically handle images? They're pretty important for this talk. You could embed the images in the text as it progresses?

I second Rob's unanswered question at 40:12: how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?

How is the "the genome is small, therefore generators of human values (that can't be learned from the environment) are no more complex than tens or hundreds of things on the order of a fuzzy face detector" argument compatible with the complexity of value thesis, or does it contradict it?

how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?

This question needs a whole essay (or several) on its own. If I don't get around to leaving a longer answer in the next few days, ping me.

Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?

How is the "the genome is small, therefore generators of human values (that can't be learned from the environment) are no more complex than tens or hundreds of things on the order of a fuzzy face detector" argument compatible with the complexity of value thesis, or does it contradict it?

The key difference is between "human values" vs "generators of human values". The complexity of value thesis (as articulated on that arbital page) says that human values are not algorithmically simple, and I do agree with that. But that still allows for simple generators of human values, which (conceptually) take in lots of data from the real world and spit out values. Everything except those generators is learned from the environment.

In principle, if we can figure out those relatively-simple generators, then we can feed an AI data similar to the data from which humans' value-generators generate their values, and the AI should be able to reconstruct human values (up to within ordinary between-humans-within-similar-environments variation).

Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?

Cultural accumulation and google, but that's mimicking someone who's already figured it out. How about the person who first figured out eg crop growth? Could be scientific method, but also just random luck which then caught on. 

Additionally, sometimes it's just applying the same hammers to different nails or finding new nails, which means that there are general patterns (hammers) that can be applied to many different situations. There's bits of information in both the patterns themselves and when to apply them, though I feel confused trying to connect these ideas here.

People specifically have inner simulations (ie you can imagine what it'd look like to drop a bowling ball off a building even if you've never seen it) from things you have lots of experience with is a way of applying different patterns to new situations.

Words cannot possibly express how thankful I am for you doing this!

Thanks a bunch!

  1. I want to interrogate a little more the notion that gradient descent samples uniformly (or rather, are dominated by the initialization distribution) from good parameters. Have you read various things about grokking like Hypothesis: GD Prefers General Crictuits? That argument seems to be that you might start with parameters dominated by the initialization distribution, but various sorts of regularization are going to push you to sample solutions in a nonuniform way. Do you have a take on this?
  2. For the power-seeking-because-of-entropy example, I want to second the audience questions. If you're getting your policy by sampling from all possible policies, the argument is great, but if you're getting your policy by sampling from NN parameters that generate strings of 100 actions, then you just finished arguing that uniform-ish sampling over NN parameters will give simplcity-ish sampling over policies. What would a NN do if trained to play the example game? I would assume it would quickly learn to exactly alternate $ and Apple. This looks like something that seems a little less like powerseeking, and more like telling DeepDream to fill the image with dogs, except filling a string with buying three apples. I dunno, do you think it's still like powerseeking?
  3. I think you make a subtle error when throwing out a lot of "mere biology" genes as not generating human values. If we had different mere biology than we do, the values we develop would probably be different even if our brain-specific genes were the same! Like, I dunno, suppose you have some genes that build your thyroid. But you can't go "ho hum, the thyroid isn't the brain, let's throw those genes out as uninformative," because thyroid disorders activity impacts your mood, which impacts your expressed values. Or I bet I'd have different values if my eyes saw in UV rather then visible, or my skin had no sense of pain, or I went through adolescence in two days rather than five years. Basically I totally disagree with this notion that "if we share it with plants, an AI wouldn't need to know it."
  4. Actually I'm kinda not sure how relevant you think the size-of-human-preference-generators question is, since we don't want the AI to learn human preferences in gene-format, we want the AI to learn human preferences in some (different, I think we agree) format that's better-suited for doing things like making decisions or comparing between different humans.
  5. Cool last section. If you can have 2 dimensions of things to be Pareto optimal over tradeoffs between, why not N dimensions? It seems like there are behaviors that are irrational even for markets (is failing to make mutually beneficial trades between individuals an example? I'm having trouble thinking of something less inward-facing) that could be "optimal" for decision-making procedures with N of 3 or 4.

>The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values.

This sounds like a fundamental disagreement with Yudkowsky's view. (I think) Yudkowsky thinks the hardest part about alignment is getting an AGI to do any particular specified thing (that requires superhuman general intelligence) at all, whatever it may be, whereas by default AGI will optimize hard for something that no programmer had in mind; rather than the problem being about pointing at particular values. Do you recognize this as a disagreement, and what do you think of it? Do you think aiming-at-all is not that hard, or isn't usefully separated from pointing at human values?

I think these are both pointing to basically-the-same problem. Under Yudkowsky's view, it's presumably not hard to get AI to do X for all values of X, but it's hard for most of the X which humans care about, and it's hard for most of the things which seem like human-intuitive "natural things to do".

Huh. I thought Yudkowsky's view was that it's hard to get an AGI to do X for all values of X, where X is the final effect of the AGI on the world (like, what the universe looks like when the AI is done doing its thing). If X is instead an instrumental sort of thing, like getting a lot of energy and matter, then it's not hard to get an AGI to do that.

So "get enough bits-of-information about human values" makes sense if you have something you can do with the bits, i.e. narrow down something. If we don't know how to specify any final effect of an AGI at all, then we have an additional problem, which is that we don't know how to do anything with the bits of information about which final effects we want.

I mean, yeah, we do need to be able to use the bits to narrow down a search space.

What's the search space? Policies, or algorithms, or behaviors, or something. What's the information? Well, basically pointing a camera at anything in the world today gives you information about human values, or reading anything off the internet. What do we do with this information to get policies we like? The bits of information isn't the problem, the problem is that we don't know how to narrow down policy space or algorithm space or behavior space so that it has some particular final results. Getting bits of information about human values, and being able to aim an AGI at anything, are different problems. 

Getting bits of information about human values, and being able to aim an AGI at anything, are different problems.

I think these are the same problem? Like, ability-to-narrow-down-a-search-space-or-behavior-space-by-a-factor-of-two is what a bit of information is. If we can't use the information to narrow down a search space closer to the thing-the-information-is-supposedly-about, then we don't actually have any information about that thing.

>Like, ability-to-narrow-down-a-search-space-or-behavior-space-by-a-factor-of-two is what a bit of information is.

Information is an upper bound, not a lower bound. The capacity of a channel gives you an upper bound on how many distinct messages you can send, not a lower bound on your performance on some task using messages sent over the channel. If you have a very high info-capacity channel with someone who speaks a different language from you, you don't have an informational problem, you have some other problem (a translation problem).

>If we can't use the information to narrow down a search space closer to the thing-the-information-is-supposedly-about, then we don't actually have any information about that thing.

This seems to render the word "information" equivalent to "what we know how to do", which is not the technical meaning of information. Do you mean to do that? If so, why? It seems like a misframing of the problem, because what's hard about the problem is that you don't know how to do something, and don't know how to gather data about how to do that thing, because you don't have a clear space of possibilities with a shattering set of clear observable implications of those possibilities. When you don't know how to do something and don't have a clear space of possibilities, the sort of pieces of progress you want to make aren't fungible with each other the way information is fungible with other information.

[ETA: Like, if the space in question is the space of which "human values" is a member, then I'm saying, our problem isn't locating human values in that space, our problem is that none of the points in the space are things we can actually implement, because we don't know how to give any particular values to an AGI.]

The Shannon formula doesn't define what information is, it it quantifies amount of information. People occasionally point this out as being kind of philosophically funny - we know how to measure amount of information, but we don't really have a good definition of what information is. Talking about what information is immediately runs into the question of what the information is about, how the information relates to the thing(s) it's about, etc.

Those are basically similar to the problems one runs into when talking about e.g. an AI's objective and whether it's "aligned with" something in the physical world. Like, this mathematical function (the objective) is supposed to talk about something out in the world, presumably it should relate to those things in the world somehow, etc. I claim it's basically the same problem: how do we get symbolic information/functions/math-things to reliably "point to" particular things in the world?

(This is what Yudkowsky, IIUC, would call the "pointer problem".)

Framed as a bits-of-information problem, the difficulty is not so much getting enough bits as getting bits which are actually "about" "human values". (Presumably that's why my explanations seem so confusing.)

If natural abstractions are a thing, in what sense is "make this AGI have particular effect X" trying to be about human values, if X is expressed using natural abstractions? 

In that case, it's not about human values, which is one of the very nice things the natural abstraction hypothesis buys us.

Section 1 (about compression) was pretty good, I don't think I had fully internalized this idea, despite having followed a lot of your posts.

Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:

  • Vast majority don't sort (or are even compilable)
  • The vast majority of programs that "look like they work", don't (eg "forgot a semicolon", "didn't account for an already sorted list", etc)
  • Generalizing: the vast majority of programs that pass [Unit tests, compiles, human says "looks good to me", simple], don't work. 
    • Could be incomprehensible, pass several unit tests, but still fail in weird edge cases (eg. when the input number is [84, >100, a prime number > 13, etc], then it spits out gibberish) 
    • counterargument for alignment check of "run it in a simulation to see if it breaks out of the box" because this is just another proxy.
    • Some constraints above are necessary, like being compilable,  and some aren't, like some randomly generated sorting algorithms that are really hard to understand. For example, could be written in brainfuck, or contain 10,000 lines of code that are mostly redundant or happen to cancel out and sorts correctly
      • To relate to the original talk, I agree that I can recognize my own values once I reflect on them, but this is different than seeing a plan about an AI that keeps my values and thinking "this looks like it works". In other words, the "human values" shouldn't be a strict subset of the "human says it looks like it works", just like "correctly sorts" shouldn't be a strict subset of "human says it looks like it works" due to incomprehensibility.

For programs specifically, if it's simple and passes a relevant distribution of unit tests, we can be highly confident it in fact sorts correctly, but what's the equivalent for "plan that maintains human values"? Let's say John succeeds and finds what we think to be the generators of human values, would it be comprehensible enough to verify it?

Applying the argument again but to John's proposed solution, the vast majority of [Ai's trained in human environments with what we think are the simple generators of human values]'s plans & behaviors may look good but not actually be good. Or the weights are incomprehensible, so we use unit tests to verify and it could still fail.

Counter-counterargument: I can imagine these generators being simple enough that we can indeed be confident they do what we want. Since it should be human-value-equivalent, it should also be human-interpretable (under reflection?). 

This sounds like a good idea overall, but I wouldn't bet my life on it. It'd be nice to have necessary and sufficient conditions for this possible solution.

I think a lot of the values we care about are cultural, not just genetic. A human raised without culture isn't even clearly going to be generally intelligent (in the way humans are), so why assume they'd share our values?

Estimations of the information content of this part are discussed by Eric Baum in What is Thought?, although I do not recall the details.

I find that plausible, a priori. Mostly doesn't affect the stuff in the talk, since that would still come from the environment, and the same principles would apply to culturally-derived values as to environment-derived values more generally. Assuming the hardwired part is figured out, we should still be able to get an estimate of human values within the typical-human-value-distribution-for-a-given-culture from data which is within the typical-human-environment-distribution-for-that-culture.

Thanks a lot for posting this! A minor point about the 2nd intuition pump (100-timesteps, 4 actions: Take $1, Do Nothing, Buy Apple, Buy Banana; the point being that most action sequences take the Take $1 action a lot rather than the Do Nothing action): the "goal" of getting 3 apples seems irrelevant to the point, and may be misleading if you think that that goal is where the push to acquire resources comes from. A more central source seems to me to be the "rule" of not ending with a negative balance: this is what prunes paths through the tree that contain more "do nothing" actions.

Yup! More generally, key pieces for modeling a "resource": amounts of the resource are additive, and more resources open up more actions (operationalized by the need for a positive balance in this case). If there's something roughly like that in the problem space, then the resource-seeking argument kicks in.

Cheers for posting! I've got a question about the claim that optimizers compress by default, due to the entropy maximization-style argument given around 20:00 (apologies if you covered this, it's not easy to check back through a video):

Let's say that we have a neural network of width 100, which is trained on a dataset which could be trained to perfect accuracy on a network of width of only 30. If it compresses it into only 30 weights there's a 70-dimensional space of free parameters and we should expect a randomly selected solution to be of this kind. 

I agree that if we randomly sample zero-loss weight configurations, we end up with this kind of compression, but it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do. 

Assuming that the network is parameterized by, say, float16s, maximal compression of the data would result in the output of the network being sensitive to the final bit of the weights in as many cases as possible, thereby leaving the largest number of free bits, so 16 bits of info would be compressed in to one weight, rather than spread among 3-4.

My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice (and also have a big problem with an unknown examples, and are therefore screened off by techniques like dropout and regularization). There is therefore a competing incentive towards minima which are easy to land on - probably flat minima surrounded by areas of relatively good performance. Further, I expect that these kind of minima tend to leverage the whole network for redundancy and flatness (not needing to depend tightly on the final bit of weights).

The properties of would be not just compression but some combination of compression and smoothness (smoothness being sort of a variant of compression where the final bits don't matter much) which would not result in some subset of the parameters having all the useful information. 

If you agree that this is what happens, in what sense is there really compression, if the info is spread among multiple bits? Perhaps given the structure of NNs, we should expect to be able to compress by removing the last bits of weights as these are the easiest to leave free given the structure of training?

If you disagree I'd be curious to know where. I sense that Mingard et al shares your conclusion but I don't yet understand the claimed empirical demonstration.

tldr: optimization may compress by default, but learning seems to counteract this by choosing easy-to-find minima.

it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do.

This is where Mingard et al come in. One of their main results is that SGD training on neural nets does quite well approximate just-randomly-sampling-an-optimal-point. Turns out our methods are not actually very path-dependent in practice!

My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice... There is therefore a competing incentive towards minima which are easy to land on - probably flat minima surrounded by areas of relatively good performance.

There is a mismatch between your intuition and the implications of "flat minima surrounded by areas of relatively good performance".

Remember, the whole point of the "highly compressed arrangements" is that we only need to lock in a few parameter values in order to get optimal behavior; once those few values are locked in, the rest of the parameters can mostly vary however they want without screwing stuff up. "Flat minimum surrounded by areas of relatively good performance" is synonymous with compression: if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can't-vary-without-losing-performance.

Now, your intuition is correct in the sense that info may be spread over many parameters; the relevant "ways to vary things" may not just be "adjust one param while holding others constant". For instance, it might be more useful to look at parameter variation along local eigendirections of the Hessian. Then the claim would be something like "flat optimum = performance is flat along lots of eigendirections, therefore we can project the parameter-values onto the non-flat eigendirections and those projections are the 'compressed info'". (Tbc, I still don't know what the best way is to characterize this sort of thing, but eigendirections are an obvious approximation which will probably work.)

Turns out our methods are not actually very path-dependent in practice!

Yeah I get that's what Mingard et al are trying to show but the meaning of their empirical results isn't clear to me - but I'll try and properly read the actual paper rather than the blog post before saying any more in that direction.

"Flat minimum surrounded by areas of relatively good performance" is synonymous with compression. if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can't-vary-without-losing-performance.

I get that a truly flat area is synonymous with compression - but I think being surrounded by areas of good performance is anti-correlated with compression because it indicates redundancy and less-than-maximal sensitivity. 

I agree that viewing it as flat eigendimensions in parameter space is the right way to think about it, I still worry that the same concerns apply that maximal compression in this space is traded against ease of finding what would be a flat plain in many dimensions, but a maximally steep ravine in all of the other directions. I can imagine this could be investigated with some small experiments, or they may well already exist but I can't promise I'll follow up, if anyone is interested let me know.

Bump re/ my question about trying to make an AI do any specifiable thing at all vs. specifying some good thing to do; still curious what you think. 

Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I'm unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it.

Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)? Or, how do you use this to create a simulated long reflection? (ie what humans would decide ethics to be if they thought about it for [1000] years)

You could first figure out meta-preferences and bootstrap that in for figuring out preferences. Though, I'm unsure if there are a "correct" set of meta-preferences, with my main confusion being the blank spot in my map where "enlightenment" is.