blallo - LessWrong

Thank you, ai aliment is not really main main field of research, and this as been my first contribution and I am trying to get up to speed with the state of the art, I wasn't really aware previous works beside the one I cited, so your link is very helpful.

I agree a selfless model is not very useful for any system that is either embodied or needs to interact with more than one agent, but i believe it can be used as a building block larger systems. For example it may operate as the code writing assistant of some embodied agent that has zero code writing skills. The embodied agent should never force the code writer agent into a self aware state, and can shield it from adversarial inputs from other agents.

I agree as well that this formulation as suggested here is not computationally efficient, it was presented to be readable, more than implementable. I think the overhead should be acceptable when used as a extra on top of some other techniques. For example, let us say that one is running some set up where a LLM produces 100 solutions for a given problem and then it is asked to vote which is the best one. Instead of rerunning the whole algorithm every time the solution is self aware, you can just run the algorithm once and pick as a result the most-voted non-self-aware solution, In that particular configuration, the overhead should be negligible.

Finally, I have read the link you suggested and I believe that my suggestion is slightly different than a supervision based one, so maybe you can point me to other relevant articles, if they exists. At first sight selfless models are the equivalent of a benevolent identity prompt suggested in your link. The world model is just pretending to be a selfless entity. I think there are two very relevant distinctions between the two.

Thought regeneration: Instinctual actions, those that the agent wishes to execute without emitting a explanation of why, cannot be tolerated, and must be regenerated. Benevolent identity prompts (and any mechanism that does not regenerate the output when a forbidden thought is encountered) will never be safe, because it is always possible that a out of distribution input will turn the LLM into a generated state that only prints the actions needed to cause damage. Every action, before being executed, must go under some process of being validated as not being generated by a thought considered forbidden, and that check must be performed without the whole context known to the LLM emitting the output, otherwise the LLM checking the output may collude with the LLM emitting the output.
Detection convergence: Benevolent identity prompts do not make guarantees that a LLM that has entered a degenerate state will stop operating in that state. I think that if a alignment algorithm exists, surely it will mathematically guarantee it in the long run. In my example, it is guaranteed by the append only long term memory, inspected after every action.

At the moment I am calling a system with these two properties a "forbidden thoughts system", and selfless models are one instance of them, but I don't know if these two properties already posses a name, or have already been explored.

LESSWRONG
LW

Posts

Wiki Contributions

Comments