Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Subject matter here.

I: Intro

I am extremely sympathetic to the program of AI safety by understanding value learning. Because of that sympathy, I have more thoughts than average prompted by Stuart Armstrong's post along those same lines.

Stuart's post mostly deals with "partial preferences," which are like simple statements of binary preference (A is better than B), but associated with a context - supposedly the "human's model" the human was using when they exhibited or stated that preference. Then the post says that you should sort these partial preferences according to meta-levels and aggregate them from the top down, updating your procedure after you finish each meta-level, eventually producing a utility function over world-histories.

Broadly, I'd say that my opinion is sort of like the bitter lesson. The bitter lesson in, say, image recognition, is that people wanted to do image recognition with a bunch of human-designed features and formal reasoning and human-understandable internal moving parts, and they tried that for a long time, and what worked was using way bigger models, way more computing power, much fewer human-understandable internal parts, and almost no human-designed features.

I like Stuart's outline more than most value learning proposals. But it still strikes me as primarily a list of human-designed features and human-understandable internal moving parts. We might be better off throwing away some of the details and abstracting in a way that allows for some of these problems to be solved by big models and computing power.

It's like the just-so story about ResNets, which is that they're a fix to humans thinking the insides of neural nets should look too much like human logic[^1]. I think speculating about the human-sized logical relationships between speculative parts inside the AI is easier but less useful than speculating about the algorithm that will connect your inputs to your outputs with a big model and lots of computing power, which may or may not have your logical steps as emergent features.

II: A long analogy about dams

If you want to design a dam, you don't draw the blueprint of the dam first and figure out what materials it should be made of later - first you learn a lot about hydrology and materials science so you know how steel and concrete and earth and water interact, then you draw the high-level design, then you fill in the details that weren't dictated either by physics or by your design goals. I'm claiming that we don't yet know much about the steel and water of value learning.

Here's a long digression as an example. Suppose you're trying to work out how to model human values the way humans do, even given lots of computing power and data. If you want to locate values within a model of humans, you can't just train the model for predictive power, because human values only appear in a narrow zone of abstraction, more abstract than biology and less abstract than population statistics, and an AI scored only on prediction will be pressured to go to a lower level of abstraction.

If you train an AI on a shared input of sensory data and a text channel from humans, will it learn a shared model of the world and the text channel that effectively solves the symbol grounding problem? Can you then activate desired concepts through the text channel, "cheating" a solution to lots of value learning problems?

No. Consider what happens in the limit of lots of resources, particularly if we are training this model for predictive power - it will be pressured towards a lower level of abstraction. Once it starts encoding the world differently than we do, it won't have the generalization properties we want - we'd be caught cheating, as it were. And if we could solve the training problem for verbal models, it seems like we could just solve the training problem to learn the concepts we want to learn. But maybe there's still some way to "cheat" in practice.

Another way to think of this problem is as making "artificial intentional stance." But we have to remember that the intentional stance is not just a single model (and definitely not the assumption that humans are like homo economicus.) It's a family of strategies used to learn about humans, model humans, and model interacting with humans. Stances aren't just an assumption about how to model one thing within a fixed model of the world, they're part of complete languages for talking about the world.

I want to know how to design an AI that not only develops approximate ways of understanding the world, but matches some of those ways of understanding to what it sees humans use. But even to do this, we don't really know how to talk in a principled way about what it is that it's supposed to be matching. So we've got to think about that.

This is an example of the sort of consideration that I think is urgent and interesting - and you can't always leave it as a detail to be filled in later, because depending on the base materials, the best design might be quite different.

III: Miscellaneous specific comments

Now some more specific comments about the proposal.

- How much of the hidden details are in eliciting partial preferences? I've sort of been implying that it's a lot. Does it require a general artificial intentional stance to extract not just binary preferences but also the model the human is using to express those preferences?

- How much of the hidden details are in doing meta-reasoning? If I don't trust an AI, more steps of meta-reasoning makes me trust it even less - humans often say things about meta-reasoning that would be disastrous if implemented. What kind of amazing faculties would be required for an AI to extract partial preferences about meta-reasoning that actually made things better rather than worse? If I was better at understanding what the details actually are, maybe I'd pick on meta-reasoning more.

I do agree that the meta-reasoning step is necessary for this scheme, but I think that's because this scheme doesn't involve the AI building an explicit model of humans to provide consistency - it's repeatedly outsourcing the modeling job to amnesiac single-shot modules. If humans were reliable sources about meta-reasoning principles for combining binary preferences, this would work great, but since they aren't it won't - a low-level practical concern dictating higher-level design.

- The "symbol grounding module"'s job seems to be to take the partial preferences inside the partial preferences' contextual models and translate them into full preferences in the AI's native ontology. This seems like it requires the AI to have a really trustworthy grasp on the intentional stance and its variations - maybe I should imagine this as coming from the same process that originates those contextual models for partial preferences in the first place. This is a bit different than the symbol grounding I normally think about (grounding of internal symbols by their causal relationship to reality), but I agree it's an important part of the artificial intentional stance.



[^1]: The story goes something like this: When people first thought of neural networks, they thought of as if each neuron was a logical node making a human-sized step in reasoning. And so they optimized the initialization of weights and the nonlinearity for each individual neuron functioning like a discriminator. But after many years of work, people realized that the "neurons are doing human-sized logical steps" model wasn't the best, and a better picture is that the neural network is massaging the input manifold around in a higher-dimensional space until eventually the input space gets transformed into something that's easy to classify. And so the people developed ResNets that were specialized for this gradual massaging of the input into the output, and they worked great.

New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 10:53 AM

If you want to locate values within a model of humans, you can't just train the model for predictive power, because human values only appear in a narrow zone of abstraction, more abstract than biology and less abstract than population statistics, and an AI scored only on prediction will be pressured to go to a lower level of abstraction.

I don't understand why you're so confident. It doesn't seem to me that my values are divorced from biology (I want my body to stay healthy) or population statistics (I want a large population of people living happy lives). And if the AI knows the level of abstraction below my values, that sounds great, if it's found a good way to express my values in terms of those lower-level abstractions. Indeed, this is exactly what I'd expect a functioning FAI to do!

If you're worried that the AI will generalize poorly, that's a reasonable worry! But, everyone in machine learning knows generalization is an extremely important problem, so maybe also say whether you think "better machine learning" will solve the problem, and if not, why not.

Once it starts encoding the world differently than we do, it won't have the generalization properties we want - we'd be caught cheating, as it were.

Are you sure?

...brains, by contrast to the kinds of program we typically run on our computers, do not use standardized data storage and representation formats. Rather, each brain develops its own idiosyncratic representations of higher-level content. Which particular neuronal assemblies are recruited to represent a particular concept depends on the unique experiences of the brain in question (along with various genetic factors and stochastic physiological processes). Just as in artificial neural nets, meaning in biological neural networks is likely represented holistically in the structure and activity patterns of sizeable overlapping regions, not in discrete memory cells laid out in neat arrays.

From Superintelligence, p. 46.

I think your claim proves too much. Different human brains have different encodings, and yet we are still able to learn the values of other humans (for example, when visiting a foreign country) reasonably well when we make an honest effort.

BTW, I think it's harmful to confidently dismiss possible approaches to Friendly AI based on shakey reasoning, especially if those approaches are simple. Simple approaches are more likely to be robust, and if the AI safety community has a bias towards assigning credence to pessimistic statements over optimistic statements (even when the pessimistic statements are based on shakey reasoning or reasoning which hasn't been critically analyzed by others), that may cause us to neglect approaches which could actually be fruitful.

I don't understand why you're so confident. It doesn't seem to me that my values are divorced from biology (I want my body to stay healthy) or population statistics (I want a large population of people living happy lives).

When I say your preference is "more abstract than biology," I'm not saying you're not allowed to care about your body, I'm saying something about what kind of language you're speaking when you talk about the world. When you say you want to stay healthy, you use a fairly high-level abstraction ("healthy"), you don't specify which cell organelles should be doing what, or even the general state of all your organs.

This choice of level of abstraction matters for generalization. At our current level of technology, an abstract "healthy" and an organ-level description might have the same outcomes, but at higher levels of technology, maybe someone who preferred to be healthy would be fine becoming a cyborg, while someone who wanted to preserve some lower-level description of their body would be against it.

"Once it starts encoding the world differently than we do, it won't have the generalization properties we want - we'd be caught cheating, as it were."
Are you sure?

I think the right post to link here is this one by Kaj Sotala. I'm not totally sure - there may be some way to "cheat" in practice - but my default view is definitely that if the AI carves the world up along different boundaries than we do, it won't generalize in the same way we would, given the same patterns.

Nice find on the Bostrom quote btw.

I think your claim proves too much. Different human brains have different encodings, and yet we are still able to learn the values of other humans (for example, when visiting a foreign country) reasonably well when we make an honest effort.

I would bite this bullet, and say that when humans are doing generalization of values into novel situations (like trolley problems, or utopian visions), they can end up at very different places even if they agree on all of the everyday cases.

If you succeed at learning the values of a foreigner, so well that you can generalize those values to new domains, I'd suspect that the simplest way for you to do it involves learning about what concepts they're using well enough to do the right steps in reasoning. If you just saw a snippet of their behavior and couldn't talk to them about their values, you'd probably do a lot worse - and I think that's the position many current value learning schemes place AI in.

Each of your three responses talks about generalization. As I mentioned, generalization is one of the central problems of machine learning. For example, in his critique of deep learning, Gary Marcus writes:

What we have seen in this paper is that challenges in generalizing beyond a space of training examples persist in current deep learning networks, nearly two decades later. Many of the problems reviewed in this paper — the data hungriness, the vulnerability to fooling, the problems in dealing with open-ended inference and transfer — can be seen as extension of this fundamental problem.

If Gary Marcus is right, then we'll need algorithms which generalize better if we're going to get to AGI anyway. So my question is: Let's suppose better generalization is indeed a prerequisite for AGI, which means we can count on algorithms which generalize well being available at the time we are trying to construct our FAI. What other problems might we encounter? Is FAI mainly a matter of making absolutely sure that the algorithms do in fact generalize, or are there still other worries?

BTW, note that perfect generalization is not actually needed. It's sufficient for the system to know when its models might not apply in a particular situation (due to "distributional shift") and ask for clarification at that point in time. See also this previous thread where I claimed that "I haven't yet seen an FAI problem which seems like it can't somehow be reduced to calibrated learning." (Probably hyperbolic.)

I don't think whether labels are provided by humans fundamentally changes the nature of the generalization problem. Humans providing labels is very typical in mainstream ML work.

I'm not excited about FAI work which creates a lot of additional complexity in order to get around the fact that current algorithms don't generalize well, if algorithms are going to have to generalize better in order to get us to AGI anyway.

Yes, I agree that generalization is important. But I think it's a bit too reductive to think of generalization ability as purely a function of the algorithm.

For example, an image-recognition algorithm trained with dropout generalizes better, because dropout acts like an extra goal telling the training process to search for category boundaries that are smooth in a certain sense. And the reason we expect that to work is because we know that the category boundaries we're looking for are in fact usually smooth in that sense.

So it's not like dropout is a magic algorithm that violates a no-free-lunch theorem and extracts generalization power from nowhere. The power that it has comes from our knowledge about the world that we have encoded into it.

(And there is a no free lunch theorem here. How to generalize beyond the training data is not uniquely encoded in the training data, every bit of information in the generalization process has to come from your model and training procedure.)

For value learning, we want the AI to have a very specific sort of generalization skill when it comes to humans. It has to not only predict human actions, it has to make a very particular sort of generalization ("human values"), and single out part of that generalization to make plans with. The information to pick out one particular generalization rather than another has to come from humans doing hard, complicated work, even if it gets encoded into the algorithm.

The power that it has comes from our knowledge about the world that we have encoded into it.

This knowledge could also come from other sources, e.g. transfer learning.

We know that human children are capable of generalizing from many fewer examples than ML algorithms. That suggests human brains are fundamentally better at learning in some sense. I think we'll be able to replicate this capability before we get to AGI.

For value learning, we want the AI to have a very specific sort of generalization skill when it comes to humans. It has to not only predict human actions, it has to make a very particular sort of generalization ("human values"), and single out part of that generalization to make plans with.

As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is "learning and respecting human preferences", object recognition is "learning human preferences about how to categorize images", and sentiment analysis is "learning human preferences about how to categorize sentences".

I've never heard anyone in machine learning divide the field into cases where we're trying to generalize about human values and cases where we aren't. It seems like the same set of algorithms, tricks, etc. work either way.

BTW, Claude Shannon once wrote:

Suppose that you are given a problem to solve, I don’t care what kind of a problem — a machine to design, or a physical theory to develop, or a mathematical theorem to prove, or something of that kind — probably a very powerful approach to this is to attempt to eliminate everything from the problem except the essentials; that is, cut it down to size. Almost every problem that you come across is befuddled with all kinds of extraneous data of one sort or another; and if you can bring this problem down into the main issues, you can see more clearly what you’re trying to do and perhaps find a solution. Now, in so doing, you may have stripped away the problem that you’re after. You may have simplified it to a point that it doesn’t even resemble the problem that you started with; but very often if you can solve this simple problem, you can add refinements to the solution of this until you get back to the solution of the one you started with.

In other words, I think trying to find the "essential core" of a problem is a good problem-solving strategy, including for a problem like friendliness. I have yet to see a non-handwavey argument against the idea that generalization is the "essential core" of friendliness.

The information to pick out one particular generalization rather than another has to come from humans doing hard, complicated work, even if it gets encoded into the algorithm.

I actually think the work humans do can be straightforward and easy. Something like: Have the system find every possible generalization which seems reasonable, then synthesize examples those generalizations disagree on. Keep asking about the humans about those synthesized examples until you've narrowed down the number of possible generalizations the human plausibly wants to the point where you can be reasonably confident about the human's desired behavior in a particular circumstance.

I think this sort of approach is typically referred to as "active learning" or "machine teaching" by ML practitioners. But it's not too different from the procedure that you would use to learn about someone's values if you were visiting a foreign country.

Ah, but I don't trust humans to be a trusted source when it comes to what an AI should do with the future lightcone. I expect you'd run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.

As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is "learning and respecting human preferences", object recognition is "human preferences about how to categorize images", and sentiment analysis is "human preferences about how to categorize sentences"

I somewhat agree, but you could equally well call them "learning human behavior at categorizing images," "learning human behavior at categorizing sentences," etc. I don't think that's enough. If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.

So this is two separate problems: one, I think humans can't reliably tell an AI what they value through a text channel, even with prompting, and two, I think that mimicking human behavior, even human behavior on moral questions, is insufficient to deal with the possibilities of the future.

I've never heard anyone in machine learning divide the field into cases where we're trying to generalize about human values and cases where we aren't. It seems like the same set of algorithms, tricks, etc. work either way.

It also sounds silly to say that one can divide the field into cases where you're doing model-based reinforcement learning, and cases where you aren't. The point isn't the division, it's that model-based reinforcement learning is solving a specific type of problem.

Let me take another go at the distinction: Suppose you have a big training set of human answers to moral questions. There are several different things you could mean by "generalize well" in this case, which correspond to solving different problems.

The first kind of "generalize well" is where the task is to predict moral answers drawn from the same distribution as the training set. This is what most of the field is doing right now for Ian Goodfellow's examples of categorizing images or categorizing sentences. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing the test set.

Another sort of "generalize well" might be inferring a larger "real world" distribution even when the training set is limited. For example, if you're given labeled data for handwritten digits 0-20 into binary outputs, can you give the correct binary output for 21? How about 33? In our moral questions example, this would be like predicting answers to moral questions spawned by novel situations not seen in training. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing examples later drawn from the real world.

Let's stop here for a moment and point out that if we want generalization in the second sense, algorithmic advances in the first sense might be useful, but they aren't sufficient. For the classifier to output the binary for 33, it probably has to be deliberately designed to learn flexible representations, and probably get fed some additional information (e.g. by transfer learning). When the training distribution and the "real world" distribution are different, you're solving a different problem than when they're the same.

A third sort of "generalize well" is to learn superhumanly skilled answers even if the training data is flawed or limited. Think of an agent that learns to play Atari games at a superhuman level, from human demonstrations. This generalization task often involves filling in a complex model of the human "expert," along with learning about the environment - for current examples, the model of the human is usually hand-written. The better we get at generalizing in this way, the more the AI's answers will be like "what we meant" (either by some metric we kept hidden from the AI, or in some vague intuitive sense) even if they diverge from what humans would answer.

(I'm sure there are more tasks that fall under the umbrella of "generalization," but you'll have to suggest them yourself :) )

So while I'd say that value learning involves generalization, I think that generalization can mean a lot of different tasks - a rising tide of type 1 generalization (which is the mathematically simple kind) won't lift all boats.

Ah, but I don't trust humans to be a trusted source when it comes to what an AI should do with the future lightcone.

First, let's acknowledge that this is a new objection you are raising which we haven't discussed yet, eh? I'm tempted to say "moving the goalposts", but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)

I expect you'd run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.

Scott is describing distributional shift in that essay. Here's a quote:

The further we go toward the tails, the more extreme the divergences become. Utilitarianism agrees that we should give to charity and shouldn’t steal from the poor, because Utility, but take it far enough to the tails and we should tile the universe with rats on heroin. Religious morality agrees that we should give to charity and shouldn’t steal from the poor, because God, but take it far enough to the tails and we should spend all our time in giant cubes made of semiprecious stones singing songs of praise. Deontology agrees that we should give to charity and shouldn’t steal from the poor, because Rules, but take it far enough to the tails and we all have to be libertarians.

The "distribution" is the set of moral questions that we find ourselves pondering in our everyday lives. Each moral theory (Utilitarianism, religious morality, etc.) is an attempt to make sense of our moral intuitions in a variety of different situations and "fit a curve" through them somehow. The trouble comes when we start considering unusual "off-distribution" moral situations and asking what our moral intuitions say in those situations.

So this isn't actually a different problem. As Shannon said, once you pare away the extraneous data, you get a simplified problem which represents the core of what needs to be accomplished.

humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.

Yep. I address this in this comment; search for "The problem is that the overseer has insufficient time to reflect on their true values."

I somewhat agree, but you could equally well call them "learning human behavior at categorizing images," "learning human behavior at categorizing sentences," etc.

Sure, so we just have to learn human behavior at categorizing desired/undesired behavior from our AGI. Approval-direction, essentially.

If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.

Eliezer Yudkowsky wrote:

If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I'm pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.

Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)

If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can't be trusted to build it right. So we might as well just give up now.

It also sounds silly to say that one can divide the field into cases where you're doing model-based reinforcement learning, and cases where you aren't. The point isn't the division, it's that model-based reinforcement learning is solving a specific type of problem.

Sure. So my point is, so far, it hasn't really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven't really needed to develop special methods to solve this specific type of problem. (Correct me if I'm wrong.) So this all suggests that it isn't actually a different problem, fundamentally speaking.

By the way, everything I've been saying is about supervised learning, not RL.

I agree with the rest of your comment. I'm focused on the second kind of generalization. As you say, work on the first kind may or may not be useful. I think you can get from the second kind (correctly replicating human labels) to the third kind ("superhuman" labels that the overseer wishes they had thought of themselves) based on active learning, as I described earlier.

"I don't trust humans to be a trusted source when it comes to what an AI should do with the future lightcone."
First, let's acknowledge that this is a new objection you are raising which we haven't discussed yet, eh? I'm tempted to say "moving the goalposts", but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)

Sure :) I've said similar things elsewhere, but I suppose one must sometimes talk to people who haven't read one's every word :P

We're being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn't just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.

There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.

Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call "better" between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.

I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by "take this into account," I'm pretty sure that means model the human and treat preferences as objects in the model.

Skipping over the intervening stuff I agree with, here's that Eliezer quote:

Eliezer Yudkowsky wrote: "If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I'm pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back."
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can't be trusted to build it right. So we might as well just give up now.

I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.

Though I'm not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn't really matter if it's passing the buck or not.

But my original thought wasn't about uploads (though that's definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.

Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book - Zendegi?

so far, it hasn't really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven't really needed to develop special methods to solve this specific type of problem. (Correct me if I'm wrong.)

There are some cases where the AI specifically has a model of the human, and I'd call those "special methods." Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like "value iteration networks." This is the sort of development I'm thinking of that helps AI do a better job at generalizing human values - I'm not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.

I agree with your concerns. I'm glad Stuart is doing this work as it moves us much further along than we have been, but it also falls short in a number of ways.

I'm trying to find time to write up all of my own current thinking on this, but the short version is that I suspect modeling human preferences in the way Stuart does is too leaky an abstraction to work, humans don't have "preferences" per se they instead have valences over mental actions and it's how those valences interact with the actions we take, observable to outsiders or not, that produce events we can reason about using the preference model (that is, treating those phenomena as if they were preferences).

It would help if we had better neuroscience than we do today, but i guess we'll have to make due with what we've got for the time being, which means unfortunately our models can't (yet) be totally grounded in what's happening physically.

I'd definitely be interested in your thoughts about preferences when you get them into a shareable shape.

In some sense, what humans "really" have is just atoms moving around, all talk of mental states and so on is some level of convenient approximation. So when you say you want to talk about a different sort of approximation from Stuart, my immediate thing I'm curious about is "how can you make your way of talking about humans convenient for getting an AI to behave well?"

You can get some clues on my thoughts I think. I used to take an approach much like Stuart, but I now that that's the wrong abstraction. The thing I've recently written that most points towards my thinking is "Let Values Drift", which I wrote mostly because it was the first topic that really started to catalyze my thinking about human values.

I found this to be interesting/valuable commentary after just reading through Stuart's agenda.

I think speculating about the human-sized logical relationships between speculative parts inside the AI is easier but less useful than speculating about the algorithm that will connect your inputs to your outputs with a big model and lots of computing power, which may or may not have your logical steps as emergent features.

With this more compute/less abstraction approach you're suggesting, do you mean that it may produce a model of the preferences that's inscrutable to humans? If so, that could be an issue for getting the human's buy-in. He talks about this some in section 4.5, that there's "the human tendency to reject values imposed upon them, just because they are imposed upon them" and the AI may need to involve the human in construction of the utility function.

- How much of the hidden details are in doing meta-reasoning? If I don't trust an AI, more steps of meta-reasoning makes me trust it even less - humans often say things about meta-reasoning that would be disastrous if implemented. What kind of amazing faculties would be required for an AI to extract partial preferences about meta-reasoning that actually made things better rather than worse? If I was better at understanding what the details actually are, maybe I'd pick on meta-reasoning more.

Which part of his post are you referring to by "meta-reasoning"? Is it the "meta-preferences"?