Great post Peter. I think a lot about whether it even makes sense to use the term "aligned AGI" as powerfull AGIs may break human intention for a number of reasons (https://www.lesswrong.com/posts/3broJA5XpBwDbjsYb/agency-engineering-is-ai-alignment-to-human-intent-enough).

I see you didn't refer to AIs become self driven (as in Omohundro: https://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf). Is there a reason you don't view this as part of the college kid problem?

Reply

[-]TAG3y31

Alignment needs something to.align with, but it's far from proven that there is a coherent set of values shared by all humans.

Reply

[-]Peter S. Park3y21

Thank you so much for your kind words! I really appreciate it.

One definition of alignment is: Will the AI do what we want it to do? And as your post compellingly argues, "what we want it to do" is not well-defined, because it is something that a powerful AI could be able to influence. For many settings, using a term that's less difficult to rigorously pin down, like safe AI, trustworthy AI, or corrigible AI, could have better utility.

I would definitely count the AI's drive towards self-improvement as a part of the College Kid Problem! Sorry if the post did not make that clear.

Reply

[-][anonymous]3y21

You neglect possibility (4). This is what a modern day engineer will do, and this method is frequently used.

If the environment, Epost, is out of distribution - measurable because there is high prediction error for many successive frames - our AI system is failing. It cannot operate effectively if it's predictions are often incorrect*.

What do we do if the system is failing?

One concept is that of a "limp mode". Numerous real life embedded systems use exactly this, from aircraft to hybrid cars. Waymo autonomous vehicles, which are arguably a prototype AI control system, have this. "Limp mode" is a reduced functionality mode where you enable just minimal features. For example in a waymo it might use a backup power source, a single camera, and a redundant braking and steering controller to bring the vehicle to a controlled, semi-safe halt.

Key note: the limp mode controllers have authority. That is, they activated by things like interrupts in a watchdog message from the main system, or a stream of 'health' messages from the main system. If the main system reports it is unhealthy (such as successive frames where predictions are misaligned with observed reality) the backup system takes away control physically. (this is done a variety of ways, from cutting power to the main system to just ignoring it's control messages)

For AGI there appears to be a fairly easy and obvious way to have limp mode. The model in control in a given situation can be the one that scored the best in the training environment. We can allocate enough silicon to have more than 1 full featured model hosted in the AI system. So we just switch control authority over to the one making the best predictions in the current situation.

We could even make this seamless and switch control authority multiple times a second - a sort of 'mixture of experts'. Some of the models in the mixture will have simpler, more general policies that will be safer in a larger array of situations.

A mixture of experts system could easily be made self improving, where models are being upgraded by automated processes all the time. The backend that decides who gets control authority, provides the training and evaluation framework to decide if a model is better or not, etc, does not get automatically upgraded, of course.

*You could likely formally show this - Intelligence is simply modeling the future probability distribution contingent on your actions and taking the action that results in the most favorable distribution. A new and strange environment, your model will fail.

Reply