I think I have an interesting new research direction: Aligning Committees.
Make ensembles of agents and have them act together as only one agent in the world with some protocol for how they combine preferences.
Main Motivating Construction:
Given a target consequence T, we construct a committee out of 3 expected utility maximizing agents: Planner, Wanter, and Unwanter (which have different utility functions). Planner proposes a plan and then Wanter and Unwanter either signoff or veto said plan. A plan P which gets both to sign off is executed, otherwise the agent does the safe but useless null action ∅.
The Utility Functions
Uplanner={1ifPhappens0if∅happens
Uwanter=⎧⎨⎩wif∅happens1ifP∧Thappens0ifP∧¬Thappens
Uunwanter=⎧⎨⎩1−nif∅happens1ifP∧¬Thappens0ifP∧Thappens
Then as the committee gets smarter and more powerful Pr[T|P]∈(w,n) I believe this is a mild optimizer.
Questions:
1.... (read more)
New location of previous file on github: https://github.com/Alexhb61/Alignment/blob/main/committees/Control_by_committee.md
The main criticism I imagined of this piece was the disassembly risk.
By adding a latter step to this outer alignment approach (transforming a committee into a modified learning problem), I believe I have a much stronger angle on outer alignment.
I will continue to write about it here:
https://github.com/Alexhb61/Alignment/blob/main/committees/outer_alignment.md