I'm a rookie in the LessWrong/rationality community, but have been involved in tech for five decades. I also have real-life experience dealing with people who act irrationally and consciously accept the risks of getting caught violating societal norms.
This is a fascinating analysis! I'm here because I'm a fan of the rational approach (mostly), so this question may be a bit naive. It stems from my experience in government bureaucracy and how it attempts to prevent violations and identify violators. It does this through an independent third-party audit process of an activity.
Would it be practical to train a language model to audit a model for unsafe behavior? That way a CoT could be... (read more)
I'm a rookie in the LessWrong/rationality community, but have been involved in tech for five decades. I also have real-life experience dealing with people who act irrationally and consciously accept the risks of getting caught violating societal norms.
This is a fascinating analysis! I'm here because I'm a fan of the rational approach (mostly), so this question may be a bit naive. It stems from my experience in government bureaucracy and how it attempts to prevent violations and identify violators. It does this through an independent third-party audit process of an activity.
Would it be practical to train a language model to audit a model for unsafe behavior? That way a CoT could be... (read more)