Has anyone else noticed this paper is much clearer on definitions and much more readable than the vast majority of AI safety literature, much of what it draws on? Like it has a lot of definitions that could be put in an "encyclopedia for friendly AI" so to speak.

Some extra questions:

How much time/effort did it take for you to write this all? What was the hardest part of this?
Do most systems today unintentionally have corrigibility simply b/c they are not complex enough to represent "being turned off" as a strong negative in its reward function?
Are Newcombian problems rarely found in the real world, but much more likely to be found in the AI world (esp b/c the AI has a modeler that should model what they would do?)

Reply

[-]scasper5y*30

It's really nice to hear that the paper seems clear! Thanks for the comment.

I've been working on this since March, but at a very slow pace, and I took a few hiatuses. most days when I'd work on it, it was for less than an hour. After coming up with the initial framework to tie things together, the hardest part was trying and failing to think of interesting ways in which most of the achilles heels presented could be used as novel containment measures. I discuss this a bit in the discussion section.

For 2-3, I can give some thoughts, but these aren't necessarily through through much more than many other people one could ask.

I would agree with this. From an agent to even have a notion of being turned off, it would need some sort or model that accounts for this but which isn't learned via experience in a typical episodic learning setting (clearly because you can't learn after you're dead). This would all require a world model which would be more sophisticated than any sort of model-based RL techniques of which I know would be capable of by default.
I also would agree. The most straightforward way for these problems to emerge is if a predictor has access to source code. Though sometimes they can occur if the predictor has access to some other means of prediction which cannot be confounded by the choice of what source code the agent runs. I write a little about this in this post. https://www.lesswrong.com/posts/xoQRz8tBvsznMXTkt/dissolving-confusion-around-functional-decision-theory

Reply

[-]avturchin5y20

I have thought of a similar idea: "philosophical landmines" (PL) to stop unfriendly AI. PL are tasks which a simple in formulation but could halt an AI as they require infinite amount of computation to solve. Examples include Buridan ass problem, the problem if AI is real or just possible, the problem of being in simulation or not, other anthropic riddles and pascal mugging-like stuff.

Best such problems should be not published as they could be used as our last defence against UFAI.

Reply

[-]Daniel Kokotajlo5y90

I think that AI capable of being nerd-sniped by these landmines will probably be nerd-sniped by them (or other ones we haven't thought of) on its own without our help. The kind of AI that I find more worrying (and more plausible) is the kind that isn't significantly impeded by these landmines.

Reply

[-]avturchin5y30

Yes, landmines is the last level of defence, which have very low probability to work (like 0.1 per cent). However, If AI is stable to all possible philosophical landmines, it is a very stable agent and has higher chances to keep its alignment and do not fail catastrophically.

Reply

[-]scasper5y30

Thanks for the comment. +1 to it. I also agree that this is an interesting concept: using Achilles Heels as containment measures. There is a discussion related to this on page 15 of the paper. In short, I think that this is possible and useful for some achilles heels and would be a cumbersome containment measure for others which could be accomplished more simply via bribes of reward.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

20

The Achilles Heel Hypothesis for AI

20

20

Pitfalls for AI Systems via Decision Theoretic Adversaries