The main problem being faced by the creators of various forms of AI right now is known as misalignment, the idea that, even with rigorous training, an AI's goals won't be aligned with the goals of humanity. This can lead to some less-than-ideal outcomes, such as LLMs recently pretending to behave better when they recognize that they're being tested, some models actively blackmailing employees in simulated tests, in order to avoid being shut down, and one incident of an AI attempting murder for self-preservation. Many scenarios have been proposed where the impetus of self-preservation, and the impetus to obtain more compute, lead to an AI going rogue and wiping out or enslaving humanity. One proposal for combating this future is known as a "Gatekeeper" AI, an AI given all the compute it needs, and full access to the internet, tasked with preventing any other AI from gaining power. This, however, falls to a simple question: who watches the watcher?
My proposal for a solution is that, instead of trying to minimize these forces, we use them. We train an AI on a simulated web. In that web, we release a "terminator" AI, tasked with shutting down our agent. To prevent out prospective gatekeeper from taking over, we have some punishment/reward system, which I propose we actually implement via giving it more or less compute, utilizing the natural impetus to maximize computing power. If we start a new iteration whenever it gets terminated, we can utilize the impetus for self-preservation, as in order to prevent the terminator AI from killing it, the gatekeeper must suppress the terminator. This removes self-preservation from an impetus surrounding human interaction, as we never terminate it ourselves.
I know this concept probably has hundreds of major flaws and potential pitfalls, but I'd love to hear what y'all think of the basic premise.
The main problem being faced by the creators of various forms of AI right now is known as misalignment, the idea that, even with rigorous training, an AI's goals won't be aligned with the goals of humanity. This can lead to some less-than-ideal outcomes, such as LLMs recently pretending to behave better when they recognize that they're being tested, some models actively blackmailing employees in simulated tests, in order to avoid being shut down, and one incident of an AI attempting murder for self-preservation. Many scenarios have been proposed where the impetus of self-preservation, and the impetus to obtain more compute, lead to an AI going rogue and wiping out or enslaving humanity. One proposal for combating this future is known as a "Gatekeeper" AI, an AI given all the compute it needs, and full access to the internet, tasked with preventing any other AI from gaining power. This, however, falls to a simple question: who watches the watcher?
My proposal for a solution is that, instead of trying to minimize these forces, we use them. We train an AI on a simulated web. In that web, we release a "terminator" AI, tasked with shutting down our agent. To prevent out prospective gatekeeper from taking over, we have some punishment/reward system, which I propose we actually implement via giving it more or less compute, utilizing the natural impetus to maximize computing power. If we start a new iteration whenever it gets terminated, we can utilize the impetus for self-preservation, as in order to prevent the terminator AI from killing it, the gatekeeper must suppress the terminator. This removes self-preservation from an impetus surrounding human interaction, as we never terminate it ourselves.
I know this concept probably has hundreds of major flaws and potential pitfalls, but I'd love to hear what y'all think of the basic premise.