Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Research projects

I'm planning to start two research projects on model splintering/reward generalisation and learning the preferences of irrational agents.

Within those projects, I'm aiming to work on subprojects that are:

  1. Posed in terms that are familiar to conventional ML;
  2. interesting to solve from the conventional ML perspective;
  3. and whose solutions can be extended to the big issues in AI safety.

The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.

The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.

Learning textual values from textual descriptions

Parent project: this is a subproject of the value learning project.


This idea grew out of a conversation with Matija Franklin and Rebecca Gorman.

My Occam's razor paper with Sören Mindermann demonstrates that human behaviour does not provide enough data to define human preferences and values. I've grounded human values, ultimately, in our mental models, including our assessments about our own irrationality/preferences and the irrationality/preferences of others.

But it is very difficult for an algorithm to use this information. In order for an algorithm to use people's judgements to assess the irrationality or preferences of another person, the algorithm would have to know that some human action was going on, that another human was reacting to it, how they were interpreting it, that they were judging it to be irrational/rational/preference-relevant/etc... and that this judgement was to be taken as normative or as a label. So, not only is the person thinking "how irrational", but that interpretation is to be taken to be correct.


There is a shortcut to part of this process, though: psychology papers. Let's take the anchoring bias. As I've noted, we could see the anchoring bias as a preference to name numbers close-to-numbers-we-have-recently-heard. But no-one thinks about it that way.

If we look at a seminal paper on anchoring bias, it details how students were presented with products, asked if they would buy those products for the last two digits of their social security numbers, and were then asked to state the actual amount they would pay for them (the experiment). The responses were recorded (the results) and the authors commented on how they demonstrated an (irrational) "anchoring and adjustment" heuristic. This commentary might be in the abstract, introduction, or conclusion sections (or spread across other sections).

What if a fine-tuned GPT-3 or GPT-n was able to generate the author's comments from just the experiment and results? In that case the GPT-3 would be able to generate textual assessments of irrationality or bias, from textual descriptions of experiments or situations:

The blue and green arrows are ungrounded; the algorithm is merely learning the red connection. But that solves a large chunk of the problem. This algorithm would be able to infer textual descriptions of biases and preferences, from textual descriptions of situations.

Grounding the descriptions of biases and preferences (the green arrow) shouldn't be too hard: we just need to point out that words such as "values", "biases", "preferences", "heuristics" and so on, mean "values", "biases", "preferences", "heuristics" and so on.

Grounding the other side (the blue arrow) is more tricky - going from a situation to a textual description of it. This move is especially difficult if we need to add hypothetical situations as well. But it is still essentially empirical, asking how a human would describe a situation, and, as such, it can be trained on human descriptions of situations in general.

So it seems that not only does human text encode a lot of preference-relevant information, but that we can access this information, in part at least, without too much difficulty. I've been mainly thinking in terms of learning biases, but the general idea might allow us to unlock a lot of implicit human knowledge, hiding in our texts.

Research aims

  1. See if psychology research papers can be easily separated into "descriptions of experiments or situations" and "human interpretations of the experiments or situations".
  2. Generate textual assessments of irrationality or preferences from textual descriptions of situations or experiments.
  3. Check the reliability of the textual assessments, and whether they can be used as labels for assessing preferences.
  4. See what other types of human judgement information is coded, ungrounded, in texts of various types.

New to LessWrong?

New Comment