petersalib — LessWrong

Hi Seth--the other author of the paper here.

I think there are two things to say to your question. The first is that, in one sense, we agree. There are no guarantees here. Conditions could evolve such that there is no longer any positive-sum trade possible between humans and AGIs. Then, the economic interactions model is not going to provide humans any benefits.

BUT, we think that there will be scope for positive-sum trade substantially longer than is currently assumed. Most people thinking about this (including, I think, your question above) tre... (read more)

AI Will Not Want to Self-Improve

petersalib3yΩ110

A few people have pointed out this question of (non)identity. I've updated the full draft in the link at the top to address it. But, in short, I think the answer is that, whether an initial AI creates a successor or simply modifies its own body of code (or hardware, etc.), it faces the possibility that the new AI failed to share its goals. If so, the successor AI would not want to revert to the original. It would want to preserve its own goals. It's possible that there is some way to predict an emergent value drift just before it happens and cease improvement. But I'm not sure it would be, unless the AI had solved interpretability and could rigorously monitor the relevant parameters (or equivalent code).

AI Will Not Want to Self-Improve

petersalib3yΩ110

I think my response to this is similar to the one to Wei Dai above. Which is to agree that there are certain kinds of improvements that generate less risk of misalignment but it's hard to be certain. It seems like those paths are (1) less likely to produce transformational improvements in capabilities than other, more aggressive, changes and (2) not the kinds of changes we usually worry about in the arguments for human-AI risk, such that the risks remain largely symmetric. But maybe I'm missing something here!

AI Will Not Want to Self-Improve

petersalib3yΩ460

This seems right to me, and the essay could probably benefit from saying something about what counts as self-improvement in the relevant sense. I think the answer is probably something like "improvements that could plausibly lead to unplanned changes in the model's goals (final or sub)." It's hard to know exactly what those are. I agree it's less likely that simply increasing processor speed a bit would do it (though Bostrom argues that big speed increases might). At any rate, it seems to me that whatever the set includes, it will be symmetric as between human-produced and AI-produced improvements to AI. So for the important improvements--the ones risking misalignment--the arguments should remain symmetrical.

AI Will Not Want to Self-Improve

petersalib

Ω 173y

[Note: This post was written by Peter N. Salib. Dan H assisted me in posting to Alignment Forum, but no errors herein should be attributed to him. This is a shortened version of a longer working paper, condensed for better readability in the forum-post format. This version assumes familiarity with standard arguments around AI alignment and self-improvement. The full 7,500 word working paper is available here. Special thanks to the Center for AI Safety, whose workshop support helped to shape the ideas below.]

Introduction

Many accounts of existential risk (xrisk) from AI involve self-improvement. The argument is that, if an AI gained the ability to self-improve, it would. Improved capabilities are, after all, useful for achieving essentially any goal. Initial self-improvement could enable further self-improvement. And so on, with the...

(Continue Reading - 5797 more words)