This is a special post for quick takes by Jason Hise. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Using a less capable AI to evaluate the danger of outputs of more capable models feels quite limited. Often what we care more about is whether user prompts are laundering intent. There might be an information-theoretic defense here that is better than evaluating outputs… evaluate inputs. Treat user prompt ‘obfuscation’ as a measure of intent to attack. If there was an easier way to ask for something, then asking in an obfuscated way is a signal.