d4hines

Message

Ralph-wiggum is Bad and Anthropic Should Fix It

The language of the official ralph-wiggum plugin goes hard... > IMPORTANT - Do not circumvent the loop: > Even if you believe you're stuck, the task is impossible, or you've been running too long - you MUST NOT output a false promise statement. The loop is designed to continue until...

Feb 419

Some Thoughts on Mech Interp

Dario's latest blog post hypes up both the promise and urgency of interpretability, with a special focus on mechanistic interpretability. As a concerned layman (software engineer, but not in ML), watching from afar, I've always had mixed feelings about the mech interp agenda, and after conversing with a few researchers...

Sep 25, 202513

LESSWRONG
LW

LESSWRONG
LW

d4hines

d4hines

d4hines

d4hines

Ralph-wiggum is Bad and Anthropic Should Fix It

Some Thoughts on Mech Interp

Ralph-wiggum is Bad and Anthropic Should Fix It

Some Thoughts on Mech Interp