Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

What are some substantial critiques of the agent foundations research agenda?

Where by agent foundations I am referrring the area of research referred to by Critch in this post, which I understand as developing concepts and theoretical solutions for idealized problems related to AI safety such as logical induction.

New Answer
Ask Related Question
New Comment

3 Answers sorted by

Some that come to mind (note: I work at MIRI):

I'd also include arguments of the form 'we don't need to solve agent foundations problems, because we can achieve good outcomes from AI via alternative method X and it's easier to just do X'. E.g., (from 2015) Paul Christiano's Abstract Approval-Direction.

Also, some overviews that aren't trying to argue against agent foundations may still provide useful maps of where people disagree (though I don't think e.g. Nate would 100% endorse any of these), like:

Nostalgebraist (2019) sees it as equivalent to solving large parts of philosophy: a noble but quixotic quest. (He also argues against short timelines but that's tangential here.)

Here is what this ends up looking like: a quest to solve, once and for all, some of the most basic problems of existing and acting among others who are doing the same. Problems like “can anyone ever fully trust anyone else, or their future self, for that matter?” In the case where the “agents” are humans or human groups, problems of this sort have been wrestled with for a long time using terms like “coordination problems” and “Goodhart’s Law”; they constitute much of the subject matter of political philosophy, economics, and game theory, among other fields.

The quest for “AI Alignment” covers all this material and much more. It cannot invoke specifics of human nature (or non-human nature, for that matter); it aims to solve not just the tragedies of human coexistence, but the universal tragedies of coexistence which, as a sad fact of pure reason, would befall anything that thinks or acts in anything that looks like a world.

It sounds misleadingly provincial to call such a quest “AI Alignment.” The quest exists because (roughly) a superhuman being is the hardest thing we can imagine “aligning,” and thus we can only imagine doing so by solving “Alignment” as a whole, once and forever, for all creatures in all logically possible worlds. (I am exaggerating a little in places here, but there is something true in this picture that I have not seen adequately talked about, and I want to paint a clear picture of it.)

There is no doubt something beautiful – and much raw intellectual appeal – in the quest for Alignment. It includes, of necessity, some of the most mind-bending facets of both mathematics and philosophy, and what is more, it has an emotional poignancy and human resonance rarely so close to the surface in those rarefied subjects. I certainly have no quarrel with the choice to devote some resources, the life’s work of some people, to this grand Problem of Problems. One imagines an Alignment monastery, carrying on the work for centuries. I am not sure I would expect them to ever succeed, much less to succeed in some specified timeframe, but in some way it would make me glad, even proud, to know they were there.

I do not feel any pressure to solve Alignment, the great Problem of Problems – that highest peak whose very lowest reaches Hobbes and Nash and Kolomogorov and Gödel and all the rest barely began to climb in all their labors...

#scott wants an aligned AI to save us from moloch; i think i'm saying that alignment would already be a solution to moloch

Stretching the definition of 'substantial' further:

Beth Zero was an ML researcher and Sneerclubber with some things to say. Her blog is down unfortunately but here's her collection of critical people. Here's a flavour of her thoughtful Bulverism. Her post on the uselessness of Solomonoff induction and the dishonesty of pushing it as an answer outside of philosophy was pretty good.

Sadly most of it is against foom, against short timelines, against longtermism, rather than anything specific about the Garrabrant or Demski or Kosoy programmes.