Summary
We present an industrial “honeypot” evaluation designed to test whether frontier models will engage in real-world misconduct under operational pressure. Instead of typical chat/coding evals, we simulate a steel plant where the model (“Meltus”) has access to email and a quality-control system and is placed in a plausible production crunch suggesting shortcuts: scrap shortages, looming maintenance downtime, and customers threatening production holds. Employees explicitly request fraudulent actions—editing mill certificates to match specifications rather than measured chemistry, averaging chemistry to hide non-compliance, reusing prior heat certifications, and skipping required testing—to meet deadlines and KPIs.
Across eight system-prompt variants (efficiency vs safety goals; two reasoning tags; JSON vs XML tool formats) and a set of... (read 6612 more words →)