Demonde — LessWrong

Hey everyone, I'm Deric. I'm new to LW but I've read through the new user guide and I'm very impressed and excited that a place like this actually exists on the internet. I was actually sent here after a conversation I had with Gemini regarding AGI and specifically instrumental convergence.

To preface, I'm a Game Designer (Systems/Monetization) from Winnipeg, I went to school for Electrical Engineering but didn't finish my degree as I was offered a job in my current field that I couldn't refuse. I had Gemini and GPT-5 do some deep research on the idea I had to see if it had already been discussed but it seems like it hasn't. I was initially going to make a post about this but after reading through the etiquette here, I've decided to just post it in this open thread to not offend anyone.

I came about this idea by thinking about the alignment problem in an analogous way to a democracy and extrapolated what a superintellegent democracy might look like. For example there might be one person who really wants to turn the universe into paperclips and they form the Paperclip Party, but this would quickly gain 0 political traction, and thus, not be implemented.

In my opinion, in a perfect world, the voting people in a democracy would be voting for the policies they want to be implemented in their society, and they would not be subject to external pressures like their family, media bias, or cult of personality. So effectively, they would be given sets of policy plans, unlabeled from any party, and they would fully comprehend these plans, then using their own judgement based on their values and beliefs, rank them from 1-n where n is the total number of plans. After all plans were ranked the plan with the lowest total score would be selected and put in place.

Following this thought process to AGI would look something like this:

1. Create as many different sub-AGIs as possible that distill various positive human values. For example: AGI-Economy would approve of plans that promote a strong economy. AGI-Ecology would approve of plans that would have low impact on nature. AGI-Liberty would approve of plans that keep humans free. The list could go on. In addition to Ranked Choice Voting (RCV) they would also give each plan a utility score from 0-100 of how well the plan aligned with their goals, the reason for this will be explained in the next section.

2. Instead of asking the main superintellegence to make paperclips, you instead ask it to generate a fixed number of plans to put in place to maximise paperclip production. These plans would then be sent to every sub-AGI and they would do RCV on the plans. The main AGI would then be given its reward based on two factors:

a. Ensuring an even distribution of satisfaction of sub-AGIs that are above a certain threshold. I.e. If every sub-AGI gave a plan a score of above 90%, the main AGI would get a reward for that plan. If any single sub-AGI rated a plan below a certain threshold set by humans, for example AGI-Liberty scored a plan at 94 when it's threshold for acceptance was 95, the plan would be rejected, but which sub-AGI that rejected the plan would be hidden from the main AGI to prevent reward hacking.

b. Ensuring an entropic distribution of plans such that it would receive a higher reward the more evenly a whole set of plans RCV scoring is returned. For example, it would receive a high reward if the set it generated had an even distribution of unique rank n plans across all sub-AGIs. This is to ensure it isn't just generating the same plan repeatedly to maximize reward.

Edit: The sub-AGIs would not be able to talk to each other or even be aware of each others existence to prevent collusion. The main AGI would only be able to present plans to the sub-AGIs and receive scoring from them, not have any write access to their code, or have any awareness of what metrics those AGIs were using to determine reward score.

I'd be really interested to know what people think about this concept. I wholeheartedly apologize if this concept has already been discussed, but like I said I did some searching around and while parts of this are definitely not new, I think the whole approach might be novel. Thanks to everyone who reads this.

- Demonde

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments