As we continue to develop increasingly capable AI systems, many of us have noted the striking contrast between two areas of research. On one hand, we see tremendous energy devoted to making AI systems more capable - better at following instructions, solving problems, and generally doing what we want. On the other hand, we see comparatively less attention paid to a question that seems increasingly important: how do we ensure these systems don't do what we don't want, especially if they're actively trying to subvert our safeguards?
This is the core challenge of AI control - developing techniques that remain robust even against an AI system that might be actively working to circumvent... (read 932 more words →)