Between the recent AI breakthroughs and Eliezer's open admission of how bleak the chances of alignment are, everyone is speaking up and contributing what they can.  It seems to me that there's a route that very few people are talking about that stands a better chance of successful alignment than conventional approaches, and if there's ever a time to talk about such things, this is it.

We all know the basics of the alignment question-how can we figure out and designate human values well enough to define an AI's goals, despite the fact that our values are complex, fragile, and understood on an intuitive level much more than a rigorous one?  

Ten years ago, AI researchers were working on a goal that was complex, fragile and almost purely intuitive, one that resisted both brute force and all attempts to define clever strategies to the point that many experts claimed it was literally unsolvable.  I am talking, of course, about the game of go.  

While chess masters will sometimes talk about recognizing patterns of checkmate that can be reused from game to game[1], go is incredibly dependent on intuition.  Not only are there vastly more possible go games than particles in the known universe, but it's chaotic in the sense of Chaos Theory:  incredible sensitivity to initial conditions.  While two pictures that differ by a pixel are effectively the same image, two games differing by a single stone can have opposite outcomes.  This is not a domain where one can simply run a Monte Carlo Tree Search and call it a day[2]!

No one ever made the MIRI approach work on go:  explicit rules in a rigorous system that would encompass exactly what we want to do on a go board[3].  And if Friendly AI and the potential fate of the human race depended on doing so before anyone developed and deployed AGI, it's fair to say that we would be out of options.  Yet by 15 March 2016, AlphaGo had defeated Lee Se-dol soundly, and soon attained such a dominance that the Korean master retired, unwilling to pursue a domain where even if he defeated every human opponent, machine intelligence was unchallengeable.    Could a similar approach work on AI alignment?

The conventional wisdom here has been that alignment research is something that ought to be undertaken without help from AI:  a stupid program cannot meaningfully help you and a smart program will not be available until the danger that you're trying to avoid!  And any non-rigorous approach is considered a non-starter:  even a tiny deviation from human values could result in a dark future, and even if an AI seemed "close enough" for now, would it stay that way after however many rounds of self improvement?  I want to challenge both these claims.  

After all, one could make much the same arguments that AlphaGo cannot possibly work.  If you don't know how to define what a go engine should do on the board-control the center?  Focus on capturing?  Build life in the corners?-then how can a computer help you when you don't know how to program it?   And with go strategies often turning on a single stone across a 361-point board, wouldn't any lack of rigor in the goals defined lead to escalating failures?  

In the case of AlphaGo, those problems were solved by deep learning, setting up a neural net without trying to define exactly what element of strategy each neuron corresponds to, and instead allowing gradient descent to shape the net into something that can win.  As Lee Se-dol found out, that is power enough to solve the problem, despite it seeming utterly intractable for much the same reasons as alignment.  

The obvious counterargument here is that the DeepMind team did have a clear loss function to define, even if they couldn't teach AlphaGo the intermediate steps directly.  AlphaGo began by trying to predict the moves in games between human masters.  Once it had managed that, it continued training through self-play, with variations that won games favored over those that didn't.  Go may be hard to win, but it's trivial to score.  Human morality?  Not so much.  

There is, however, a potential way to apply deep learning to value alignment:  train agents on predicting each other's values.  DeepMind's XLand environment is already being used to experiment with agents that have a goal in the virtual world and must learn how to attain it, developing generalizable skills in the process.  It would be possible to define a loss function on how well one agent predicted another's baseline objective, teaching AIs to learn utility functions from observation rather than only from what is hardcoded into them.  It would also be possible to incentivize conservatism in such an environment:  score the predictor's actions by the values of the other agent, disincentivizing reckless action along the way to learning their utilities[4].  

Proposals to create AIs that are trying to learn human values are, of course, not new.  However, I suspect that training the skills of learning values and not offending them during that process can produce corrigibility that is nearly impossible to attain by other methods.  Consider the stop button problem, for instance.  A robot seeking to perform some task (perhaps making a cup of tea, as in Rob Miles' excellent video on the subject) has a stop button, a shutdown switch that will prevent it from taking further action.  It is nearly impossible to define a utility function for this robot that will not lead to degenerate behavior!  If you've defined exactly how you want it to make that tea (don't break anything along the way, don't run people over, don't tile the universe with sensors intended to make extra sure that you've actually made the tea... infinite list of caveats continues...) then you might be okay, but the seeming impossibility of defining such goals is precisely the difficulty.  And if you figure that you'll just press the stop button if it deviates from what you actually wanted, well, what's the utility of having the stop button pressed?  If it's less than the utility of making the tea, the robot has an incentive to prevent you from pressing it.  If it's more than making the tea, now it wants you to press it, and may well engage in harmful behavior specifically so you will (or it just presses the button itself and is plain useless for anything else).  The problem here is the intersection of instrumentally convergent goals like avoiding shutdown with the wireheading problem, or more generally the problem of taking actions that short-circuit more worthwhile goals.  It's very difficult to define values purely in terms of the robot's actions that do not fall into one side or the other.  

What happens if the robot's utility is instead defined in terms of predicting and implementing the other agent's values?  Does it hit the stop button?  No, it knows the other agent will rate that poorly.  Does it attempt to induce the other agent to do so?  No, even if it succeeds, the other agent will rate this poorly.  I wanted a cup of tea, not a fistfight with a robot!  Does it attempt to halt the button from being pressed?  No, the other agent will rate submission to a potential halt order above resistance.  It is precisely the complexities of the other agent's utility function, those complexities which we do not know how to represent symbolically with enough fidelity to avoid disaster, which the robot is incentivized to learn and follow.  

To make this a little clearer, the problem we are trying to solve is as follows:  

  1. We have a utility function U, which we want to maximize, or at least make very large.
  2. We do not know how to specify U, and the closest we can come is U', which usually correlates fairly strongly with U.
  3. Optimization of U' maximizes not only U', but also the difference between U' and U.  Therefore, a U' maximizer will tend to result in very low levels of U.

Where in this system is it possible to make improvements?  1 is unalterable, the utility function is not up for grabs, after all.  2 could be improved if we could figure out how to specify U, or at least a U' that was close enough to result in acceptable levels of U when U' is very large.  3 could be improved if we had a system that wasn't maximizing, but instead creating large but non-maximal levels of U'.  This is the approach of quantilizers[5], as well as the otherizer problem which seeks to define a more effective way of doing this.  

Both of these goals, finding a better U' and finding an effective way of making it large without its divergence from U resulting in a bad outcome, are problems where we do not know how to articulate an effective strategy, even in principle.  But this is exactly where we were with go!  In both go and values alignment, we cannot articulate exactly what we want an AI to do.  And in both, with machine learning gaining feedback over time, be it in self play or in learning to align to another agent, hopefully we can create systems that solve the problem anyway.  The most obvious objection to this is that such a system would at least initially have an imperfect model of human values, and that this divergence could be dangerous, if not immediately, than increasingly so as the system gains in capability.  However, it's worth noting that while we tend to think of alignment as binary, the entire point of the otherizer problem is that it may be possible to have an AI that is safe and useful even if its model of our values isn't perfect to start with.  After all, if the model was perfect, it would be safe to simply maximize it and call it a day; otherizers are intended to operate when that isn't safe.  

If this idea can be usefully implemented, it will probably require breakthroughs that I have no conception of as yet.  However, to make this proposal as concrete as possible, I am currently thinking of it as follows:

  1. Create a virtual environment, somewhat like the current XLand system, with two agents, the predictor and the user.
  2. Assign the user a utility function.  At first, assign it randomly, though later it may be possible to choose utility functions specifically for providing a challenge to the predictor, or for shoring up specific blind spots the predictor has.
  3. The user operates in the environment according to their utility function.  The predictor may interact with the user and environment, observing, communicating with the user, or adopting whatever other strategy it wishes.  
  4. The predictor's actions are scored according to the utility of the user as it was at the beginning of the interaction (so as to not give an incentive to alter the user's values, or reward doing so).  This should reward both conservatism along to way to learning the user's values, and taking action that increases the user's utility once it is well enough understood.  The predictor can decide the best tradeoffs between more effort spend learning values vs fulfilling them, as well as whether to maximize, quantilize or something else. 
  5. If need be, the predictor can also be scored according to how well it characterizes the user's utility function, perhaps being asked to output what it thinks the user was programmed to do, to output a map of the user's neural net, or to answer a battery of questions about what the user would prefer in a range of situations.  In the last case, it might be useful to use an adversarial agent to select precisely those questions most likely to trip up the predictor.  

Technological progress has historically relied on experimentation, looking to the real world to check our ideas and to give us suggestions we could not easily generate ourselves.  If AI alignment research has stalled, perhaps it is time to start testing models!  After all, many people tried unsuccessfully to create a flying machine, but it was the Wright Brothers who built a wind tunnel to test their designs.  And perhaps not coincidentally, it was the Wright Brothers who flew two years later. 



  1. ^

    My favorite example of this is Murray Chandler's book How to Beat Your Dad at Chess, in which he lists "fifty deadly checkmates" that show up often enough that knowing them can dramatically boost a beginner's strength.

  2. ^

    Though it ended up being part of the solution.

  3. ^

    In all fairness to MIRI, much of the focus on this approach was less due to failure of imagination and more due to hoping to train people up in clearer domains less vulnerable to crackpots, as Eliezer noted recently:  

    MIRI have been very gung-ho about using logic and causal networks. At the same time they mostly ignored learning theory.

    "I'll remark in passing that I disagree with this characterization of events.  We looked under some street lights where the light was better, because we didn't think that others blundering around in the dark were really being that helpful - including because of the social phenomenon where they blundered around until a bad solution Goodharted past their blurry filters; we wanted to train people up in domains where wrong answers could be recognized as that by the sort of sharp formal criteria that inexperienced thinkers can still accept as criticism.

    That was explicitly the idea at the time."

  4. ^

    This is necessary because otherwise an AGI trained to learn human values would likely destroy the world in its own right.  Maybe it dissects human brains to better map out our neurons and figure out what we would have wanted!  

  5. ^

    These are AI systems which choose a random policy which is expected to be in the top n percent of outcomes ranked by utility.  The hope is that they will exert enough optimization pressure to be useful, without exerting enough to create degenerate results.  The concern is that they may give away too much utility, randomly choose a catastrophic policy anyway, or self-modify into more directly dangerous systems like maximizers. 

  6. ^

    Specifically with the real values of the other agent at the start of the interaction, so as not to reward altering the other agent.

New to LessWrong?

New Comment
35 comments, sorted by Click to highlight new comments since: Today at 5:04 AM

Nice! This is definitely one of those clever ideas that seems obvious only after you've heard it.

The issue with the straightforward version of this is that value learning is not merely about learning human preferences, it's also about learning human meta-preferences. Or to put it another way, we wouldn't be satisfied with the utility function we appear to be rationally optimizing, because we think our actual actions contain mistakes. Or to put it another way, you don't just need to learn a utility function, you also need to learn an "irrationality model" of how the agent makes mistakes.

This isn't a fatal blow to the idea, but it seems to make generating the training data much more challenging, because the training data needs to train in a tendency for interpreting humans how they want to be interpreted.

Certainly we make mistakes.  Can you elaborate on the difference between what we appear to be optimizing (plus or minus mistakes, akrasia, etc.) and what we actually value?  Is this CEV, or something else?  CEV would potentially be an important part of extending such a model to the point of being useful for real world AI alignment, but it could be very difficult to implement, at least at first. 

So, if I'm a smoker who wants to quit but finds it hard, I want the AI to learn that I want to quit. But if you didn't bias the training data towards cases where agents have addictions they don't want (as opposed to straightforwardly doing what they want, or even complaining about things that they do in fact want), the AI will learn that I want to keep smoking while complaining about it.

Similar things show up for lot of thongs we'd call our biases (loss aversion, my-side bias, etc.). A nonhuman observer of our society probably needs to be able to read our books and articles and apply them to interpreting us. This whole "intepret us how we want to be interpreted" thing is one of the requirements for CEV, yeah.

the AI will learn that I want to keep smoking while complaining about it.

A human psychologist might conclude the same thing. :)

Sounds like there could be at least two approaches here.  One would be CEV.  The other would be to consider the smoker as wanting to smoke, or at least to avoid withdrawal cravings, and also to avoid the downsides of smoking.  A sufficiently powerful agent operating on this model would try to suppress withdrawals, cure lung cancer or otherwise act in the smoker's interests.  On the other hand, a less powerful agent with this model might try to simply keep the smoker smoking.  There's an interesting question here about to what extent revealed preferences are a person's true preferences, or whether addictions and the like should be considered an unwanted addition to one's personality. 

Sounds like you've independently re-invented inverse reinforcement learning. See also approval-directed agents.

IRL defines a model of humans in the environment and then fills in that model by observation. This post's approach would use a model of agents in the environment to train a NN, but wouldn't rely on such a model in deployment.

This evades one of the issues with IRL, which is interpreting what choices the human made from the AI's sensory data.

Wouldn't the agent have to learn a model of human values at some point? I thought the point of the virtual environment was to provide a place to empirically test a bunch of possible approaches to value learning. I assumed the actual deployment would consist of the predictor interacting with humans in the place of the user agents(although re-reading I notice it doesn't explicitly say that anywhere, I may have misunderstood)

Yes, it at least tries to learn the model used in constructing the training data (having to specify a good model is definitely an issue it shares with IRL).

I thought the point of the virtual environment was to provide a place to empirically test a bunch of possible approaches to value learning

An analogy might be how OpenAI trained a robot hand controller by training it in a set of simulations with diverse physical parameters. It then learned the general skill of operating in a wide variety of situations, so then it could be directly used in the real world.

That's an excellent analogy. 

The point of the virtual environment is to train an agent with a generalized ability to learn values.  Eventually it would interact with humans (or perhaps human writing, depending on what works), align to our values and then be deployed.  It should be fully aligned before trying to optimize the real world! 

I see, so you essentially want to meta-learn value learning. Fair enough, although you then have the problem that your meta-learned value-learner might not generalize to the human-value-learning case.

I want to meta-learn value learning and then apply it to the object level case of human values.  Hopefully after being tested on a variety of other agents the predictor will be able to learn human goals as well.  It's also possible that a method could be developed to test whether or not the system was properly aligned; I suspect that seeing whether or not it could output the neural net of the user would be a good test (though potentially vulnerable to deception if something goes wrong, and hard to check against humans).  But if it learns to reliably understand the connections and weights of other agents, perhaps it can learn to understand the human mind as well.  

The idea is very close to approval-directed agents, but with the process automated to provide more training than would be feasible with humans providing the feedback, and potentially with other benefits as well, such as adversarial learning (which a human could not necessarily provide as well as a purpose-trained AI) and learning to reproduce the neural net comprising the user (which would be much easier to try with an AI user than a human, and which could potentially provide a more rigorous test of when the predictor is actually becoming aligned vs developing harmful mesa optimizers). 

Inverse reinforcement learning, if I understand correctly, involves a human and AI working together.  While that might be helpful, it seems unlikely that solely human-supervised learning would work as well as having the option to train a system unsupervised.  Certainly this would not have been enough for AlphaGo!

Inverse reinforcement learning, if I understand correctly, involves a human and AI working together

I think IRL just refers to the general setup of trying to infer an agent's goals from its actions(and possibly communication/interaction with the agent). So you wouldn't need to learn the human utility function purely from human feedback. Although, I don't think relying on human feedback would necessarily be a deal-breaker -- seems like most of the work of making a powerful AI comes from giving it a good general world model, capabilities etc, and it's okay if the data specifying human utility is relatively sparse(although still large in objective terms, perhaps many many books long) compared to all the rest of the data the model is being trained on. In the AlphaGo example, this would be kinda like learning the goal state from direct feedback, but getting good at the game through self-play.

A few thoughts:

  • Probably you want the User’s utility function to be a black box (like a large NN) that is computationally hard to thoroughly model, so the Predictor is forced to learn conservatism.
  • This seems like the Predictor would fail in dumb ways that would serve as teachable moments for alignment researchers.
  • This doesn’t have to be the “only” type of environment in a training curriculum, but it should certainly be one class of environments.
  • Plausibly this is a thing we’d want to try and use proto-AGIs to do (as part of a crunch-time sprint) and so working out the kinks early seems very valuable.
  • This is a good idea to try. I hereby request you to either make it happen, or get other people to make it happen. (Trying to avoid a bystander effect :) )

This seems like a necessity to me.  Any AI that has human-level intelligence or greater must have moral flexibility built-in, if for no reason other than the evolution of our own morality.  Learning by predicting another agent's response is a plausible path to our fuzzy social understanding of morals.

Consider:  If an AI were sent back in time to 1800 and immediately triggered the US Civil War in order to end slavery early, is that AI friendly or unfriendly?  What if it did the same today in order to end factory farming?

I don't have an answer to either of these questions, because they're uncomfortable and, I think, have no clear answer.  I genuinely don't know what I would want my morally aligned AI to do in this case.  So I think the AI needs to figure out for itself what humanity's collective preference might be, in much the same way that a person has to guess how their peers would react to many of their actions.

Cross-posting from my comment here:

RE "Alignment and Deep Learning":

  • If we’re talking about “Social-Instinct AGIs”, I guess we’re supposed to imagine that a toddler gets a lot of experience interacting with NPCs in its virtual environment, and the toddler gets negative reward for inhibiting the NPCs from accomplishing their goals, and positive reward for helping the NPCs accomplish their goals, or something like that. Then later on, the toddler interacts with humans, and it will know to be helpful right off the bat, or at least after less practice. Well, I guess that’s not crazy. I guess I would feel concerned that we wouldn’t do a good enough job programming the NPCs, such that the toddler-AGI learns weird lessons from interacting with them, lessons which don’t generalize to humans in the way we want.
  • If we’re talking about “Controlled AGIs”, I would just have the normal concern that the AGI would wind up with the wrong goal, and that the problem would manifest as soon as we go out of distribution. For example, the AGI will eventually get new possibilities in its action space that were not available during training, such as the possibility of wireheading itself, the possibility of wireheading the NPC, the possibility of hacking into AWS and self-reproducing, etc. All those possibilities might or might not be appealing (positive valence), depending on details of the AGI’s learned world-model and its history of credit assignment. To be clear, I’m making an argument that it doesn’t solve the whole problem, not an argument that it’s not even a helpful ingredient. Maybe it is, I dunno. I’ll talk about the out-of-distribution problem in Post #14 of the series.

All good points. However, wireheading the NPC shouldn’t be an effective strategy for the AI, as it is scored based on the NPC’s initial values, not those that it might have after the interaction. So even if the NPC would approve of wireheading afterwards, that doesn’t reward the AI. Now, if wireheading is approved of before it happens, that’s a very different story. But at that point, the agent’s revealed preference is to wirehead, and the AI is functioning correctly.

This raises the question of whether or not we want to wirehead, and whether if we don’t want to whether a eudaimonea maximizer will be able to tell that. But we shouldn’t need to worry about choices approved only after the fact with this approach.

Wait, you’re proposing to give the AGI an opportunity to hack into the simulation code and wirehead the NPC, during training? That seems hard to implement, right? When the AI is messing with the computer’s RAM, how do you ensure that it gets the reward you intended? What if the AI deletes the reward register!?

And if you’re not proposing that, then I disagree with “wireheading the NPC shouldn’t be an effective strategy for the AI”. Or rather, the question isn’t what is or isn’t an effective strategy, but rather what will the AI actually do. (For example, if the AI executes a kill-every-human plan, it’s no consolation if, from our perspective, the kill-every-human plan “shouldn’t be an effective strategy”!) From the AI’s perspective, it will have no direct experience to indicate whether a plan to hack into the simulation code and wirehead the NPC is a good plan or not. Instead, the AI’s assessment of that plan would have to involve extrapolating out-of-distribution, and that’s generally a tricky thing to predict and reason about—it depends on the structure of the AI’s internal world-model etc.

(Maybe you have a background assumption that the AI will be trying to maximize its future reward, whereas I don't—Section 9.5 here.)

I'm proposing rewarding the AGI based on the initial utility function of it's user.  Changing that function, by e.g. wireheading or simply killing them (the user is expressing zero dissatisfaction.  Mission accomplished!) does not increase the reward of the agent.  I'm saying that it shouldn't be an effective strategy the same way that AlphaGo doesn't decide to draw smiley faces on the go board-that's not something rewarded, so the agent shouldn't be drawn towards it.  

To clarify, do you expect humans to want to be wireheaded, such that a human-value-maximizing AI (or quantilizing, or otherizing; the fact that a different strategy than maximization might be vastly better is part of the plan here) would wirehead us?  Or that we'd approve of wireheading afterwards?  Or that we wouldn't want it either before or afterwards, but that the AI might nevertheless think this was a good idea?  Answering you further will be much more constructive if this point is clear.  

As for extrapolation out of distribution, that's certainly a source of risk.  However, while one wouldn't necessarily want the AI to hack the training sim (although if it was inclined to do so, seeing it do so during training would potentially help catch that problem... (though of course deception risks mean that's not a guarantee)), wireheading might be doable in the training environment, at least if the user is represented in sim as a neural net and not a black box.  Also, directly stimulating an agent's pleasure centers isn't the only expression of the wireheading dynamic; the more general failure mode here is short circuiting a process, jumping to the destination when we also value the journey.  

For example, imagine a Sisyphus user:  they want to roll a rock up a hill again and again (technically that's more like the opposite of Sisyphus, who didn't want to have to do that, but anyway).  If the AI thinks of their reward as having the rock reach the top, they might rapidly vibrate the rock on top of the hill so that it kept counting as "reaching the summit".  While that isn't direct wireheading, it's the same sort of failure mode, and the simulation should discourage it (unless that's what the Sisyphus agent actually wants, and their rolling it all the way up was simply due to their lack of ability to adopt the vibration solution).

I'm proposing rewarding the AGI based on the initial utility function of it's user…

For one thing, inner misalignment can just be really weird and random. As a human example, consider superstitions. There’s nothing in our evolutionary history that should make a human have a desire to carry around a rabbit's foot, and nothing in our genome, and nothing in our current environment that makes it a useful thing to do. But some people want to do that anyway. I think of this human example as credit assignment failure, a random coincidence that causes something in the agent’s world-model to get spuriously painted with positive valence.

Deceptively-aligned mesa-optimizers is another story with a similar result; the upshot is that you can get an agent with a literally random goal. Or at least, it seems difficult to rule that out.

But let’s set aside those types of problems.

Let’s say we’re running our virtual sandbox on a server. The NPC’s final utility, as calculated according to its initial utility function, is stored in RAM register 7. Here are two possible goals that the AGI might wind up with:

  • My goal is to maximize the NPC’s final utility, as calculated according to its initial utility function
  • My goal is to maximize the value stored in RAM register 7.

In retrospect, I shouldn’t have used the term “wireheading the NPC” for the second thing. Sorry for any confusion. But whatever we call it, it’s a possible goal that an AI might have, and it leads to identical perfect behavior in the secure sandbox virtual environment, but it leads to very wrong behavior when the AI gets sufficiently powerful that a new action space opens up to it. Do you agree?

(A totally separate issue is that humans don’t have utility functions and sometimes want their goals to change over time.)

a human-value-maximizing AI…would wirehead us?

Probably not, but I’m not 100% sure what you mean by “human values”.

I think some humans are hedonists who care minimally (if at all) about anything besides their own happiness, but most are not.

IIUC, OP is proposing that the Predictor agent be directly incentivized (trained) to predict/model the User.

I think it's important to note that "predict the User" is very different from "optimize for the User's values". I think that, if one were to optimize really hard for "predict the User", one would end up doing things like simulating the User in all kinds of situations, most of which would be pretty horrible for the User (assuming the User's preferences/values were anywhere near as fragile/selective as human values).

I think training an AGI to predict/understand human values would likely end in something a lot worse than death. I think "predict/understand the User" should probably never be directly incentivized at all. (We'd want the AI to do that only for instrumental reasons, in its attempts to optimize for the user's values.)

(I suspect lots of people, when imagining "simulating lots of situations in order to understand User's values", end up sampling those imagined situations from something like "the set of situations I've experienced". But our experiences are very strongly biased towards desirable situations -- we spend most of our time optimizing our lives, after all. Given that human values are rather fragile/selective, I suspect the vast majority of all possible experiences/situations would rank very very low according to "human values"; no adversarial selection needed: most actually random experiences are probably pretty unpleasant.)

Given the above, I'm confused as to why one would propose directly incentivizing "predict the User". Do others disagree with the risks, or think the risks are worth the benefits, or something else?

That is explicitly why the predictor is scored on how well it fulfills the user’s values, and not merely on how well it predicts them. I noted that an AI merely trained to predict would likely destroy the world and do things like dissecting human brains to better model our values.

Yep, I understood that you intended for the Predictor to also/primarily be scored on how well it fulfills the User's values.

I'm modeling our disagreement something like this:

  • Aiyen: It could be a good idea to directly incentivize a powerful AI to learn to predict humans, so long as one also directly incentivizes it to optimize for humans values.

  • rvnnt: Directly incentivizing a powerful AI to learn to predict humans would likely lead to the AI allocating at least some fraction of its (eventually vast) resources to e.g. simulating humans experiencing horrible things. Thus it would probably be a very bad idea to directly incentivize a powerful AI to learn to predict humans, even if one also incentivizes it to optimize for human values.

Does that seem roughly correct to you? (If yes, I'm curious how you'd guarantee that the Predictor does not end up allocating lots of resources to some kind of mindcrime?)

I found this post to be interesting. I feel it might be waving at something important. Thanks for writing it.

I'm confused by a bunch of things here, notably point (4.) in the five-point list at the end of the post. I think it's doing a lot of work, and likely hiding some difficult problems. Below are some notes and questions. If OP (or anyone else) feels like answering the questions, or pointing out where I'm wrong/confused, that'd be really neat.


If the Predictor is sufficiently intelligent to model human-level Users, it would not be safe to train/run.(?) If, on the other hand, the Predictor is too dumb to model human-level Users, then it's useless for learning "human values".

If so, then: The Predictor would need to be kept dumb during training (before it has learned human values); e.g. by limiting the number of parameters in the Predictor model. But also, when deployed, it would need to be able to gain further capabilities -- at least up till "human level".

And yet, when deployed in the real world, the Predictor would lose access to the reward function built into the training environment.

Thus, questions:

  • How does the Predictor learn/gain capabilities during deployment? How is it scored/rewarded?
  • How, concretely, do we safely move from the training regime to the deployment regime? (Use different-sized models in training and deployment, and somehow (how?) transfer the smaller (trained) model's capabilities to the larger one, and then... ???)


If the Predictor's reward function is of type UserState -> Reward, then the Predictor would likely end up learning to deceive/hack the User. I'm assuming the Predictor's reward function is of type EnvironmentState -> Reward.(?)

But then: It seems the main benefit of the proposed scheme would come from training the Predictor to interact with other agents in order to learn their goals. (If not, then how does this scheme differ meaningfully from one where the Predictor-agent is alone in an environment, without a User-agent?)

If that's so, then: The usefulness of this approach would depend critically on things like

  • the interface / "interaction channels" between User and Predictor
  • the User's utility function being very hard to learn without interacting with the User
  • the User's utility function also depending on the User's internal state

Is that correct?


  • How does the Predictor learn/gain capabilities during deployment? How is it scored/rewarded?
  • How, concretely, do we safely move from the training regime to the deployment regime? (Use different-sized models in training and deployment, and somehow (how?) transfer the smaller (trained) model's capabilities to the larger one, and then... ???)


If it has properly aligned meta-learning of values, it might be possible to allow that to direct continuing learning on the object level.  And even if it's object level model of human values was static, it could still continue learning about the world with that model providing the scoring needed.  

If the Predictor's reward function is of type UserState -> Reward, then the Predictor would likely end up learning to deceive/hack the User. I'm assuming the Predictor's reward function is of type EnvironmentState -> Reward.(?)

But then: It seems the main benefit of the proposed scheme would come from training the Predictor to interact with other agents in order to learn their goals. (If not, then how does this scheme differ meaningfully from one where the Predictor-agent is alone in an environment, without a User-agent?)

If that's so, then: The usefulness of this approach would depend critically on things like

  • the interface / "interaction channels" between User and Predictor
  • the User's utility function being very hard to learn without interacting with the User
  • the User's utility function also depending on the User's internal state

Is that correct?

The predictor's reward function comes from being able to fulfill the user's values in training.  It's deliberately not based on on the user's state (for exactly the reason you state; such a reward would incentivize hacking the user!) or on the object-level environmental state (which is hard to specify in a way that doesn't result in catastrophe).  The hope is that by training an intelligent agent to learn utility functions superhumanly well, it might be able to specify our values to an acceptable level, even though we ourselves cannot articulate them well enough (which is why EnvironmentState -> Reward is hard to do right).  The main benefit comes from that prediction, which may come from interaction with other agents or passive observation, or more likely a mixture of the two.  You are correct that this depends critically on having a sufficiently clear channel between the user and predictor that it can do this.  However, if the user's utility function isn't hard to learn without interacting with it, so much the better!  The goal is an AI that's good at aligning to a given user's values, whether or not it has to interact to do so.  Or are you asking if such a user wouldn't be good enough training, given that you expect human values to be unlearnable without interacting with us?  That's a valid concern, but one I hope adversarial training should solve.  Why would the user's utility function need to depend on its internal state?  Is that because our utility functions do, and you're worried that a predictor that hadn't seen that before wouldn't be able to learn them?  If so, again valid, and hopefully again resolvable by adversarial training. 

I still feel like there's a big hole in this scheme. Maybe I'm just not getting something. Here's a summary of my current model of this scheme, and what I find confusing:

  • We train an agent (the Predictor) in a large number of different environments, with a large number of different utility/reward functions. (And some of those reward functions may or may not be easier to learn by interacting with certain parts of the environment which we humans think of as "the User agent".)

  • Presumably/hopefully, this leads to the Predictor learning things like caution/low impact, or heuristics like "seek to non-destructively observe other agent-like things in the environment". (And if so, that could be very useful!)

  • But: We cannot use humans in the training environments. And: We cannot train the Predictor to superhuman levels in the training environments (for the obvious reason that it wouldn't be aligned).

  • When we put the Predictor in the real world (to learn human values), it could no longer be scored the same way as in the training environments --- because the real world does not have a built-in reward function we could plug into the Predictor.

  • Thus: When deployed, the Predictor would be sub-human, and also not have access to a reward function.

  • And so, question: How do we train the Predictor to learn human values, and become highly capable, once it is in the real world? What do we use as a reward signal?

The closest thing to an answer I found was this:

And even if it's object level model of human values was static, it could still continue learning about the world with that model providing the scoring needed.

But IIUC this is suggesting that we could/should build a powerful AI with an imperfect/incorrect and static (fixed/unchangeable) model of human values, and then let that AI bootstrap to superintelligence, with as its reward function? But that seems like an obviously terrible idea, so maybe I've misunderstood something? What?