Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Research projects

I'm planning to start two research projects on model splintering/reward generalisation and learning the preferences of irrational agents.

Within those projects, I'm aiming to work on subprojects that are:

  1. Posed in terms that are familiar to conventional ML;
  2. interesting to solve from the conventional ML perspective;
  3. and whose solutions can be extended to the big issues in AI safety.

The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.

The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.

Force model use and then detect it

Parent project: this is a subproject of the value learning project.


I've seen human values residing, at least in part, in our mental models. We have a mental model of what might happen in the world, and we grade these outcomes as good or bad. In order to learn what humans value, the AI needs to be able to access the mental models underlying our thought processes.

Before starting on humans, with our messy brains, it might be better to start on artificial agents, especially neural-net based ones that superficially resemble ourselves.

The problem is that deep learning RL agents are generally model-free. Or, when they are model-based, they are generally constructed with a model explicitly, so that identifying their model is as simple as saying "the model is in this sub-module, the one labelled 'model'."


The idea here is to force a neural net to construct a model within itself - a model that we can somewhat understand.

I can think of several ways of doing that. We could get a traditional deep learning agent that performs on a game. But we might also force it to answer questions about various aspects of the game, identifying the values of certain features we have specified in advance ("how many spaceships are there on the screen currently?"). We can then use multi-objective optimisation with a strong simplicity prior/regulariser. This may force the agent to use the categories it has constructed to answer the questions, in order to play the game.

Or we could be more direct. We could, for instance, have the neural net pass on instructions or advice to another entity that actually plays the game. The neural net sees the game state, but the other entity can only react in terms of the features we've laid down. So the neural net has to translate the game state into the features (this superficially looks like an autoencoder; those might be another way of achieving the aim).

Ideally, we may discover ways of forcing an agent to use a model without specifying the model ourselves; some approaches to transfer learning may work here, and it's possible that GPT-3 and other transformer-based architectures already generate something that could be called an "internal model".

Then, we go looking for that model within the agent. Here the idea is to use something like the OpenAI microscope. That approach allows people to visualise what each neuron in an image classifier is reacting to, and how the classifier is doing its job. Similarly, we'd want to identify where the model resides, how it's encoded and accessed, and similar questions. We can then modify the agent's architecture to test if these characteristics are general, or particular to the agent's specific design.

Research aims

  1. See how feasible it is to force a neural net based RL agent to construct mental models.
  2. See how easy it is to identify these mental models within the neural net, and what characteristics they have (are they spread out, are they tightly localised, how stable are they, do they get reused for other purposes?).
  3. See how the results of the first two aims might lead to more research, or might be applied to AI-human interactions directly.


Ω 11

8 comments, sorted by Click to highlight new comments since: Today at 7:08 PM
New Comment

Excited to see this proposal and would be interested in following your results, and excited about bridging "traditional deep learning" and AF - I personally think that there's a lot of value in having a common language between "traditional DL" community and the AIS community (such as, the issues with current AI ethics could be seen as a scaled-down issues with the AGI). A lot of theoretical results on AF could benefit from simple practical examples for the sake of having a clear definition in code, and a lot of the ethics discussions could benefit from a larger perspective of AGI alignment (my own personal opinion)

I have a prediction that policy gradient RL agents (and all of them that only learn the policy) do not have good models of environments in them. For example, in order to succeed in Cartpole, all we need (in the most crude version) is to map "pole a bit to the left" to "go to the left" and "pole a bit to the right" to "go to the right". Having a policy of such an agent does not allow to determine the exact properties of the environment such as the mass of the cart or the mass of the pole, because a single policy would work for some range of masses (my prediction). Thus, it does not contain a "mental model". Basically, solving the environment is simpler than understanding it (this is also the case in the real world sometimes :). In contrast, more sophisticated agents such as MuZero or WorldModels, use a value function inside + a learned mapping from current observation+action to the next one, and in a way it is a "mental model" (though not a very interpretable one...). Would be excited about ones that "we can somewhat understand" -- current ones seem to lack this property...

 Some questions below:

  • I was wondering about what would be the concrete environments and objectives to train it on/with, and what would be the way to find the model inside the agent (if going with the "Microscope" approach), or how to enforce the agent to have a model inside (if going with the "regularizer" approach)
  • I'm a bit confused -- would it be a natural language model that could respond to questions, or would it be a symbolic-like approach, or something else?
  • Research into making RL agents use better internal models of environments could increase capabilities as well as providing safety, because a model allows for planning, and a good model allows for generalization, and both are capability-increasing. Potentially, the "right" way of constructing models could as well decrease sample complexity, because the class of "interesting" environments is quite specific (real world and games both have certain common themes -- distinct objects, object permanence, notion of "place" and "2D/3D maps" etc), and current RL agents can as well fit random noise, and, thus, they are searching though a much wider space of policies than we are interested in. So, to sum up, better models could potentially lead to a speedup of AI timelines. I'm wondering about what are your thoughts about these tradeoffs (benefits to safety VS potential risks of decreasing time-to-AGI).

My background on the question: I worked in my MSc thesis on one of the directions, specifically, "using a simplicity prior" to uncover "a model that we can somewhat understand" from pixels. Specifically, I discover a transformation from observation to latent features, such that it allows for a causal model on these latent features with the fewest edges (simplicity). Some tricks (lagrangian multipliers, some custom neural nets etc) are required, and the simplicity prior makes the problem of finding a causal graph NP-hard. The upside though is that the models are quite interpretable in the end, though it works only for small grid-worlds and toy benchmarks so far... I have a vague feeling that the general problem could be solved in a much simpler way than I did, and would be excited to see the results of the research! Thesis: GH: 

[This comment is no longer endorsed by its author]Reply

Thanks! Lots of useful thoughts here.

My impression is that a "model" in this context is fundamentally about predicting the future, I don't see how having the network answer "how many spaceships are on the screen right now?" would give us any indication about it having built something more complex than a simple pattern-recognizer. 

Maybe a better way to detect an internal model is to search for subnetworks that are predicting the future activations of some network nodes. So for instance we might have a part of the network that is detecting the number of spaceships on the screen right now, which ought to be a simple pattern recognition subnetwork. Then, we search for a node in the network which predicts the future activation of the spaceship-recognition-node, computing the autocorrelation between the timeseries of the two nodes. If we find one, then this is strong evidence that the network is simulating the future in some way internally.

In fact we can use this to enforce model-building, we could divide the network into a pattern-recognition half, which would be left alone (not extra constraints put on it), and a modelling half, which would be forced to have its outputs predict the future values of the pattern-recognition half (again with multi-objective optimisation). Of course the pattern-recognition half could also take inputs from the modelling half.

Interesting idea. I might use that; thanks!

This one kinda confuses me. I'm of the opinion that the human brain is "constructed with a model explicitly, so that identifying the model is as simple as saying "the model is in this sub-module, the one labelled 'model'"." Of course the contents of the model are learned, but I think the question of whether any particular plastic synapse is or is not part of the information content of the model will have a straightforward yes-or-no answer. If that's right, then "it's hard to find the model (if any) in a trained model-free RL agent" is a disanalogy to "AIs learning human values". It would be more analogous to just train a MuZero clone, which has a labeled "model" component, instead of training a model-free RL.

And then looking at weights and activations would also be disanalogous to "AIs learning human values", since we probably won't have those kinds of real-time-brain-scanning technologies, right?

Sorry if I'm misunderstanding.

I think the question of whether any particular plastic synapse is or is not part of the information content of the model will have a straightforward yes-or-no answer.

I don't think it has an easy yes or no answer (at least without some thought as to what constitutes a model within the mess of human reasoning) and I'm sure that even if it does, it's not straightforward.

since we probably won't have those kinds of real-time-brain-scanning technologies, right?

One hope would be that, by the time we have those technologies, we'd know what to look for.

I was writing a kinda long reply but maybe I should first clarify: what do you mean by "model"? Can you give examples of ways that I could learn something (or otherwise change my synapses within a lifetime) that you wouldn't characterize as "changes to my mental model"? For example, which of the following would be "changes to my mental model"?

  1. I learn that Brussels is the capital of Belgium
  2. I learn that it's cold outside right now
  3. I taste a new brand of soup and find that I really like it
  4. I learn to ride a bicycle, including
    1. maintaining balance via fast hard-to-describe responses where I shift my body in certain ways in response to different sensations and perceptions
    2. being able to predict how the bicycle and me would move if I swung my arm around
  5. I didn't sleep well so now I'm grumpy

FWIW my inclination is to say that 1-4 are all "changes to my mental model". And 5 involves both changes to my mental model (knowing that I'm grumpy), and changes to the inputs to my mental model (I feel different "feelings" than I otherwise would—I think of those as inputs going into the model, just like visual inputs go into the model). Is there anything wrong / missing / suboptimal about that definition?

Vertigo, lust, pain reactions, some fear responses, and so on, don't involve a model. Some versions of "learning that it's cold outside" don't involve a model, just looking out and shivering; the model aspect comes in when you start reasoning about what to do about it. People often drive to work without consciously modelling anything on the way.

Think model-based learning versus Q-learning. Anything that's more Q-learning is not model based.