Epistemic status: a rough sketch of an idea
Current LLMs are huge and opaque. Our interpretability techniques are not adequate. Current LLMs are not likely to run hidden dangerous optimization processes. But larger ones may.
Let's cap the model size at the currently biggest models, ban everything above. Let's not build superhuman level LLMs. Let's build human level specialist LLMs and allow them to communicate with each other via natural language. Natural language is more interpretable than the inner processes of large transformers. Together, the specialized LLMs will form a meta-organism which may become superhuman, but it will be more interpretable and corrigible, as we'll be able to intervene on the messages between them.
Of... (read more)
Thanks. So what do you think is the core of the problem? The LLM not recognizing that a user given instruction is trying to modify the system prompt and proceeds out of its bounds?