I think a main focus of alignment work should be on redesigning AI from the ground up. In doing so, I think we should keep in mind a set of desirable characteristics to aim for in a better AI.

Most of my ideas here aren't original to me, so I'll add some links to sources. For lots more background material, check out the posts associated with the tags on this post. If you're familiar with AInotkilleveryoneism background material, you'll probably already have come across the content of these links. I'm not familiar with all these ideas being placed together as a set of goals to be pursued by a research program, which is why I'm writing this post.

These are meant to be pointers towards more true and fundamental ideals, not complete descriptions without edge cases. Also, I feel sure that there are desirable qualities missing from this list. Hopefully readers will think of some and put them in the comments. Iteratively improving these descriptions would be an important aspect of the research.

What isn't covered: I don't think that even a model with all these characteristics would be safe for use by a malicious human or group. This post is presuming wise, careful, kind operators with good intentions towards all of humanity. That's a big fragile assumption, and something that I think needs more work.


Background: Theories of Impact of Interpretability, How Interpretability can be impactful, World-Model Interpretability

Interpretable meaning that the model is fundamentally easy to understand for human observers. This should minimize the dependency on special tools or techniques in order for observers to understand what's going on in the reasoning processes of the model. I don't think it would make sense to include this on the list if I didn't think it was feasible. I believe, from years of studying neuroscience, that we can design a different kind of model which is inherently more interpretable. I will make another post talking about these hypotheses.


Background: Paul Christiano on corrigibility 

Corrigible is, to my mind, a sort of wise and genuine obedience. A corrigible assistant is one who seeks to understand the spirit of your requests, not just the letter, and obey both as far as able. Where there is confusion, asks for clarification. When it thinks it cannot achieve what has been asked of it, lets you know, and perhaps suggests alternative plans which might meet a similar set of goals. Where it foresees danger, it warns you before proceeding. If ordered to proceed anyway, it does so cautiously, trying to minimize risks and harm. This is the opposite of an Evil Genie that twists your expressed wishes in order to maximize your regret for making a wish. Also very different from a Selfish Genie, which technically fulfills your wish, but also uses the path they choose as an excuse to fulfill their own aims (e.g. pursuit of Instrumental Goals).

A big part of this is allowing itself to be turned off or edited. Edits are inherently dangerous though since they might reduce the safety mechanisms of the model. Therefore, edits should be pursued cautiously and with lots of warnings. Corrigibility is, by it's nature, not resistant to unwise or maleficent actors trying to remove the corrigibility.



Background: Deceptive Alignment

Honest, in addition to the straightforward term as used between humans, also meaning that true motivations are reported to the operators. It's most important of all that the model not deceive the operators, and a secondary priority should be that it minimizes subterfuge towards other humans. Deception of the operators is very dangerous. This is an active sort of honesty, trying to give warnings ahead of time whenever you think the operators might be missing a critical piece of information. This goal of warning the operators is connected with the concept of 'cautious' discussed below.


Background: Avoiding Side Effects, «Boundaries/Membranes» and AI safety compilation

Being cautious is about trying to actively foresee and avoid risks. Also, respecting boundaries. To prefer actions with reversible effects, maintaining optionality.  To do this well, you need an accurate world model. Noticing possible harms that could come to you and those you value (e.g. humanity) from action or inaction. Where uncertain, a cautious actor should slightly prefer inaction. Avoiding negative side-effects. Coming up with useful warnings to inform the operators of consequences they might be unaware of, but without flooding them with useless false positives.  


Background: AI Safety in a World of Vulnerable Machine Learning Systems

By robust, I mean that the model minimizes vulnerability to adversarial inputs, whether engineered or coincidental. By coincidental adversarial input, I mean some sort of out-of-distribution input which would cause the model to behave drastically out of sync with the operators' expectations. By engineered adversarial input, I mean the sort of thing that is typically meant in the ML field by 'adversarial input'. Communication sent to the model which was deliberately shaped to try to make the model behave against the intent of the legitimate operators.

Robust also means being stable in the face of learning new information. I want an AI to have a deployment mode where it does 'careful online learning'. I'd like it to be able to learn new things in a way that does not risk introducing deep changes. It should not be vulnerable to 'brainwashing' or manipulation from adversarial inputs. Think of an adult with a stable personality, who can read some unusual news in the newspaper and still behave in a reasonable and predictable way. Another aspect of this is reliability over time, a consistency of performance. Behaving in cautious reasonable ways in reaction to out of distribution inputs. Compare a polynomial fit (unreliable, dramatic fail), to a moving average. If you want a system that never deviates dramatically from the norm, a moving average is a much better bet.

There will still need to be a training mode which is less stable, because you need to be able to create the desired 'personality' in the first place. I see stability as a desired characteristic of deployment, not training. If the world changed enough that the stability of your current model was feeling like a disadvantage, that would be a good signal that you'd need to retrain.


In conclusion, I think there's a lot to be gained from focusing less on what has worked in the past and focusing more on how we could do things better. In An attempt to steelman OpenAI's alignment plan I discussed the setup I would design to proceed in a safe way with alignment research done with the assistance of an approximately-AGI Alignment Researcher. Such a setup could potentially greatly accelerate the search for novel architectures which are Pareto improvements on these desiderata. I wouldn't recommend setting these as training objectives, since I think there's too much potential to deceptively Goodhart them. Rather, I'd recommend using a neutral standardized training, and then evaluating after training how well the resulting AI fits these desiderata. Put the emphasis on the underlying algorithms and architecture rather than on the training regime.


(Thank you to <anonymous> AI alignment researcher who gave feedback and suggestions on this post.)

New to LessWrong?

New Comment