Research projects

I'm planning to start two research projects on model splintering/reward generalisation and learning the preferences of irrational agents.

Within those projects, I'm aiming to work on subprojects that are:

Posed in terms that are familiar to conventional ML;
interesting to solve from the conventional ML perspective;
and whose solutions can be extended to the big issues in AI safety.

The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.

The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.

Learning textual values from textual descriptions

Parent project: this is a subproject of the value learning project.

Background

This idea grew out of a conversation with Matija Franklin and Rebecca Gorman.

My Occam's razor paper with Sören Mindermann demonstrates that human behaviour does not provide enough data to define human preferences and values. I've grounded human values, ultimately, in our mental models, including our assessments about our own irrationality/preferences and the irrationality/preferences of others.

But it is very difficult for an algorithm to use this information. In order for an algorithm to use people's judgements to assess the irrationality or preferences of another person, the algorithm would have to know that some human action was going on, that another human was reacting to it, how they were interpreting it, that they were judging it to be irrational/rational/preference-relevant/etc... and that this judgement was to be taken as normative or as a label. So, not only is the person thinking "how irrational", but that interpretation is to be taken to be correct.

Setup

There is a shortcut to part of this process, though: psychology papers. Let's take the anchoring bias. As I've noted, we could see the anchoring bias as a preference to name numbers close-to-numbers-we-have-recently-heard. But no-one thinks about it that way.

If we look at a seminal paper on anchoring bias, it details how students were presented with products, asked if they would buy those products for the last two digits of their social security numbers, and were then asked to stat...

AI Safety Subprojects

AI Safety Subprojects