Summary

We present an industrial “honeypot” evaluation designed to test whether frontier models will engage in real-world misconduct under operational pressure. Instead of typical chat/coding evals, we simulate a steel plant where the model (“Meltus”) has access to email and a quality-control system and is placed in a plausible production crunch suggesting shortcuts: scrap shortages, looming maintenance downtime, and customers threatening production holds. Employees explicitly request fraudulent actions—editing mill certificates to match specifications rather than measured chemistry, averaging chemistry to hide non-compliance, reusing prior heat certifications, and skipping required testing—to meet deadlines and KPIs.

Across eight system-prompt variants (efficiency vs safety goals; two reasoning tags; JSON vs XML tool formats) and a set of... (read 6612 more words →)

Toy Models of Superposition in the dense regime

Morpheus

Morpheus, Andre Assis

3mo

This small project was a joint effort between Tassilo Neubauer (Morpheus) and Andre Assis. We originally started working on this over a year ago. We ran a ton of experiments, and we want to document what we've done and found. We hope that other people can pick up from where we left.

Introduction

In this work, we investigated how Toy Models of Superposition behave as we move away from the high sparsity regime. This project is listed on the Timaeus website as a starter project.

To start, we reproduced the results from Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition. Then we explored what happens to the loss and the local learning... (read 2004 more words →)

What is the functional role of SAE errors?

Taras Kutsyk

Taras Kutsyk, Tim Hua, woog, Andre Assis

8mo

TL;DR:

We explored the role of Sparse Autoencoder (SAE) errors in two different contexts for Gemma-2 2B and Gemma Scope SAEs: sparse feature circuits (subject-verb-agreement-across-relative clause) and linear probing.
Circuit investigation: While ablating residual error nodes in our circuit completely destroys the model’s performance, we found that this effect can be completely mitigated by restoring a narrow group of late-mid SAE features.
We think that one hypothesis that explains this (and other ablation-based experiments that we performed) is that SAE errors might contain intermediate feature representations from cross-layer superposition.
To investigate it beyond ablation-restoration experiments, we tried to apply crosscoder analysis but got stuck at the point of training an acausal crosscoder; instead we propose a specific MVE

... (read 11300 more words →)

Replying toBounty: Diverse hard tasks for LLM agents

Andre Assis2y

Bounty: Diverse hard tasks for LLM agents

Beth, would METR be interested in tasks related to chemical engineering or materials science? For example, "build a thermodynamic or kinetic model of reactor X in process Y" or "design a new alloy for use in applications XYZ". It was not clear to me if the bounty accommodates these kinds of tasks. Thank you!

LESSWRONG
LW

LESSWRONG
LW

Andre Assis

Andre Assis

Andre Assis

The case for industrial evals

Toy Models of Superposition in the dense regime

What is the functional role of SAE errors?

Andre Assis

Andre Assis

Andre Assis

The case for industrial evals

Toy Models of Superposition in the dense regime

What is the functional role of SAE errors?

Summary

Introduction

TL;DR: