Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for quick takes by Robert Miles. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
2 comments, sorted by Click to highlight new comments since:

Learning Extensible Human Concepts Requires Human Values

[Based on conversations with Alex Flint, and also John Wentworth and Adam Shimi]

One of the design goals of the ELK proposal is to sidestep the problem of learning human values, and settle instead for learning human concepts. A system that can answer questions about human concepts allows for schemes that let humans learn all the relevant information about proposed plans and decide about them ourselves, using our values.

So, we have some process in which we consider lots of possible scenarios and collect a dataset of questions about those scenarios, along with the true answers to those questions. Importantly these are all 'objective' or 'value-neutral' questions - things like "Is the diamond on the pedestal?" and not like "Should we go ahead with this plan?". This hopefully allows the system to pin down our concepts, and thereby truthfully answer our objective questions about prospective plans, without considering our values.

One potential difficulty is that the plans may be arbitrarily complex, and may ask us to consider very strange situations in which our ontology breaks down. In the worst case, we have to deal with wacky science fiction scenarios in which our fundamental concepts are called into question.

We claim that, using a dataset of only objective questions, it is not possible to extrapolate our ontology out to situations far from the range of scenarios in the dataset. 

An argument for this is that humans, when presented with sufficiently novel scenarios, will update their ontology, and *the process by which these updates happen depends on human values*, which are (by design) not represented in the dataset. Accurately learning the current human concepts is not sufficient to predict how those concepts will be updated or extended to novel situations, because the update process is value-dependent.

Alex Flint is working on a post that will move towards proving some related claims.


Ah, the good ol' Alien Concepts problem. Another interesting place this motif comes up is in defining logical counterfactuals - you'd think that logical inductors would have let us define logical counterfactuals, but it turns out that what we want from logical counterfactuals is basically just to use them in planning, which requires taking into account what we want.