The case for industrial evals
EDIT 2026-02-13: the transcripts are now collapsible sections Summary We present an industrial “honeypot” evaluation designed to test whether frontier models will engage in real-world misconduct under operational pressure. Instead of typical chat/coding evals, we simulate a steel plant where the model (“Meltus”) has access to email and a quality-control...