Can we create self-improving AIs that perfect their own ethics?

Jan 30, 2024

What ethics? What is ethics? Does the machine mind settle on an ethics that insists on wiping out humanity? I think the intuition of looking for a star able attractor state for the AI to pursue is a good one. But "ethics" is not a sufficiently coherent and objective concept to serve as that target.

AnthonyC

Jan 30, 2024

If the process of self-improving AIs like described in an simple article by Tim Urban (below) is mastered, then the AI alignment problem is solved.

I would say this has causality backwards. In other words, one of the ways of solving the AI alignment problem is figuring out how to master the plausibly extremely complex process necessary to successfully implement a strategy that can be pointed to in a simple article.

research on AI, ON ETHICS, and coding changes into itself

As I understand it, the vast majority of the difficulty is in figuring out what the second goal in that list actually is, and how to make an AI care about it. Keep in mind that in so many cases we humans are still arguing about the same questions, answers, and frameworks that we've been debating for millennia.

Dagon

Jan 30, 2024

This overall tactic can work well for problems that are difficult to solve, but easy (or at least possible) to test a solution.

I don't think alignment is such a thing. At least I haven't seen any proposals for measuring "how aligned" a system is.

[-]the gears to ascension2y*40

I have seen many. the only ones that seem to have any chance of, after heavy modification, becoming a seed of something that holds up, are QACI and open agency+boundaries. both have big holes that make attempting to implement them as-is guaranteed to fail.

2Dagon2y

Yeah, there are a lot of sketches for how to test a system for various specific behaviors. But no actual gears-level definition of what would succeed at alignment in such a way as it does any good, while doing no (or acceptably small, being the key undefined variable) harm. A brick is aligned in that it does no harm. But it also doesn't make anyone immortal or solve any resource-allocation pains that humans have.

2the gears to ascension2y

do you know of any other sketches of how to measure that are reasonably close to mechanically specified?

6Dagon2y

Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a "true alignment" level. They all seem to be about noticing specific problems. I have seen none that try to quantify a "semi-aligned power" tradeoff between the good it does and the harm it does. I think Eliezer's early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.

2ChristianKl2y

You can measure AlphaGo's ability to play go by letting it play go which you can very well mechanically specify. Just let it play a game against a pro. We don't have a similar measurement for ethics.

mishka

Jan 30, 2024

It's not clear if this ends up working as intended, but there are proposals to that effect.

For example, "Safety without alignment", https://arxiv.org/abs/2303.00752 proposes to explore a path which is closely related to what you are suggesting.

(It would be helpful to have a link to Tim Urban's article.)

[-]mishka2y20

Thanks for including the link in your edit.

One factor which is important to consider is how likely a goal or a value to persist during self-improvements (those self-improvements might end up being quite radical, and also fairly rapid).

An arbitrary goal or value is unlikely to persist (this is why the "classical formulation of alignment problem" is so difficult, the difficulties come from many directions, but the most intractable one is how to make it so that the desired properties are preserved during radical self-modifications). That's the main obstacle t... (read more)

LESSWRONG
LW

LESSWRONG
LW

1

[ Question ]

Can we create self-improving AIs that perfect their own ethics?

1

1

4 Answers sorted by
top scoring

Jan 30, 2024

Jan 30, 2024

Jan 30, 2024

Jan 30, 2024

1

[ Question ]

Can we create self-improving AIs that perfect their own ethics?

1

1

4 Answers sorted by top scoring

Jan 30, 2024

Jan 30, 2024

Jan 30, 2024

Jan 30, 2024

4 Answers sorted by
top scoring