To clarify definitions a bit, by GPT-N I mean a future deep learning model trained to minimize the error of next-word prediction on a corpus that includes substantial portions of the internet. This model is assumed to be superhuman at its given task. There are only 2 requirements on the model necessary for the trick to work: #1. Needs to be trained with something like gradient descent, or otherwise have a clear mean of learning from a single example. #2. Needs to predict the probability of arbitrary text samples.
The general idea is the following: you can make GPT-N model humanity having access to its outputs by adding its outputs to its dataset and training on them. This allows us to use GPT-N's immense knowledge to predict the consequences of us having its output to a question. A practical process for using this to make safer GPT-N is something like this:
Setup the API to the model such that for every query you want to make of the AI, you make GPT-N predict a continuation without showing you, then you train it on that continuation (basically making it think that what it produced was already available to humanity). Then you ask it to predict the likelihood of various doom-implying headlines from a major newspaper in the 2040s, if training on the continuation to your query makes the likelihood of doom headlines greater, then the AI doesnt publish the continuation to your question.
The question of ensuring safety is then rephrased to building a corpus of text that we expect to reliably indicate positive vs negative outcomes of humanity. If we have such a corpus, we can have a system that causes GPT-N to only show answers that cause the good outcomes and avoid the bad ones, crucially without it trying to cause either, no direct optimization pressure is being applied to causing these outcomes, it's still producing the most likely text continuation, but we're using its knowledge to know whether we should look at its answers. There is some optimization pressure being applied to steering the future: we still have the at most 1bit/question decision of showing vs not-showing an answer, and that is quite unavoidable if we want to actually have any impact on the future.
This is an extremely cheap thing to implement, no additional fancy algorithms or breakthroughs required, so AI leaders get no measurable capability penalty from using this. The corpus of text indicating good/bad outcomes should be made publicly available and as large as possible. However it does raise the compute cost of inference, since we need to evaluate the model on the whole good/bad corpus for each question we want to ask.
Of course, this is just a patch, and we should use the immense tool of GPT-N to look for proper safe AGI (by making it predict future papers in AI safety), but this trick might help us navigate the short hazardous period before we have such a thing.