Posts

Sorted by New

0A Proposal for AI Alignment: Using Directly Opposing Models

Wiki Contributions

Comments

A Proposal for AI Alignment: Using Directly Opposing Models

Arne B1y10

Thank you again for your response! Someone taking the time to discuss this proposal really means a lot to me.

I fully agree with your conclusion of "unnecessary complexity" based on the premise that the method for aligning the judge is then somehow used to align the model, which of course doesn't solve anything. That said I believe there might have been a misunderstanding, because this isn't at all what this system is about. The judge, when controlling a model in the real world or when aligning a model that is already reasonably smart (more on this in the following Paragraph) is always a human.

The part about using a model trained via supervised learning to classify good or bad actions isn't a core part of the system, but only an extension to make the training process easier. It could be used at the start of the training process when the agent, police and Defendant models only possess a really low level of intelligence (meaning the police and Defendant models mostly agree). As soon as the models show really basic levels of intelligence the judge will immediately need to be a human. This should have been better explained in the post, sorry.

Of course there is a point to be made about the models pretending to be dumber than they actually are to prevent them being replaced by the human, but this part of the system is only optional, so I would prefer that we would first talk about the other parts, because the system would still work without this step (If you want I would love to come back to this later).

Mutual self preservation

No punishment/no reward for all parties is an outcome that might become desirable to pursue mutual self-preservation.

At first I also thought this to be the case, but when thinking more about it, I concluded that this would go against the cognitive grooves instilled inside the models during reinforcement learning. This is based on the Reward is not the optimisation target post. This conclusion can definitely be debated.

Also you talked about there being other assumptions, if you listed them I could try to clarify what I meant.

I also don't consent to the other assumptions you've made

Thank you again for your time

A Proposal for AI Alignment: Using Directly Opposing Models

Arne B1y00

Thank you a lot for your response! The judge is already aligned because it's a human (at least later in the training process), I am sorry if this isn't clear in the article. The framework is used to reduce the amount of times the human (judge) is asked to judge the agents actions. This way the human can align the agent to it's values while only rarely having to judge the agents actions.

The section in the text:

At the beginning of the training process, the Judge is an AI model trained via supervised learning to recognize bad responses. Later, when the Defendant and Police disagree less, it is replaced by a Human. Using this system allows the developer to align the model with their values without having to review each individual response generated by the AI.

Also you talk about "collusion for mutual self-preservation" which I claim is impossible in this system, because of the directly opposed reward functions of the police, defendant and agent model.

While this idea has been suggested previously, it is not safe on its own as the AI's could potentially cooperate to achieve a shared objective. To address this issue, the following system has been designed to make cooperation impossible by assigning directly opposite goals to the AIs.

This means that the Police can't get rewarded without the Agent and Defendant getting punished and vice versa. Them acting against this would constitute a case of wire heading which have assumed to be unlikely to impossible (see Assumptions). I would love to hear your opinion on this, because the system kind of relies on this to stay safe.