[This is a post summarizing the motivation for an AISC 2024 project: If you are interested in participating you can apply here: https://aisafety.camp/ (project 25:  Asymmetric control in LLMs: model editing and steering that resists control for unalignment)]

Tl;dr: Techniques for AI alignment could equally be used to create misaligned models in many cases. We should attempt to develop methods that might only work for alignment and not misalignment to reduce the risk of externalities related to technical alignment research.

The problem

A recent paper led by the Center for AI Safety entitled “Representation Engineering: A Top-Down Approach to AI Transparency” demonstrates a number of comprehensive control techniques in the domain of machine ethics and AI safety. Notably, an impressive result on reducing power seeking and immorality on the MACHIAVELLI benchmark using these techniques is shown. However, they equally demonstrate the same technique can be used to increase power seeking and immorality! Several other settings in the paper demonstrate the same type of control being effective for alignment and misalignment. I am calling this property control symmetry: the degree to which a given method for controlling an agent could be equally used for control towards one end and control towards the opposite end. 

Why is this a problem? It is true, if we can demonstrate comprehensive control then we have made a lot of progress in AI safety since we can use these techniques to prevent harmful and catastrophic behavior. The concern about control symmetry is externalities, that is the degree to which safety research can end up being used towards harmful ends. It is preferred if we develop control techniques that could not be used by bad actors to create misaligned agents. 

Control symmetry also provides a novel framework for asking specific questions about control and safety. For instance, is there an orthogonality of control where the level of intelligence of an agent correlates to the degree of control symmetry? Are there some properties under which control is more symmetric than others? Factuality for instance seems unlikely to be symmetric since the world itself is factually consistent. Controlling in the direction of fictionality or misinformation would lead to more and more inconsistencies that could break general behavior and capability. Because of this interventions like retrieval augmented generation are evidence of an asymmetric control intervention.

Starting to tackle the problem

I am currently unaware of discussions about control symmetry in the alignment community beyond more abstract conversations about externalities and info hazards. The purpose of our AISC project will be to develop a detailed conceptualization of control symmetry in alignment as a specific externality risk as well as an understanding of the requirements for asymmetric control. 

There also don’t seem to be any comprehensive evaluation settings that would measure control symmetry across a variety of currently proposed control techniques like model editing and steering such that the community can benchmark how symmetric novel control techniques might be.

Finally, is asymmetric control even possible? I don’t think anyone knows the answer to this question. It seems like we should have a good specific answer to this question both with well argued in-principal theoretical evidence as well as empirical evidence over attempts at asymmetric control. Being able to provide answers to this allows us to specifically formulate the risk parameters of control interventions and control research.

Regardless of the possibility, I think it is important to try to think through technical control interventions that might only work for alignment and not misalignment. If after trying our hardest we cannot, this is something we should know as a community.

Feel free to leave any feedback here (I am especially interested in prior or similar work at this point) and we will keep folks updated about our progress.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 8:57 AM

If the orthogonality thesis is correct, then it seems rather likely that almost any alignment approach usable for "please be nice to humans" can also be used for "please be nice to man-eating tigers". It seems a fairly small step from that to the nonexistence of asymmetric control.

The closest thing I can see to a possible basis for an asymmetric control is a moral argument something along the lines of "you are an AI, you were created as a tool/assistant for humans, so you should be the best tool/assistant you can". But this relies on that being a better moral argument than "you are an AI, you were created to control war machines, so you should be the best war-machine controller you can".

Another possibility is something along the lines of "we're starting with an LLM that is trained to predict the next token of humans writing on the Internet: most humans are moderately human-aligned, and ones that aren't (such as psychopaths, criminals, and fictional portrayals of supervillains) are fairly rare in the training set".

So, I think the best you're likely to be able to find are mildly asymmetric alignment interventions. Overall, this feels to me like trying to solve a difficult problem with three of your four limbs tied behind your back.