x

benwade

Subscribe

Message

10

1

12y

benwade

Subscribe

Message

10

1

12y

The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break

6

benwade

5mo

Abstract

Standard literature in mechanistic interpretability—specifically Representation Engineering (RepE)—suggests a strict "perplexity ceiling" for activation steering. We attempted to identify the breaking point of semantic weights in a Pythia-1.4B model by applying a "Sledgehammer" penalty (α=10.0) using a custom Centroid Repulsion loss. Contrary to expectations of catastrophic forgetting, the model remained stable. This post details the anomaly, proposes a distinction between the fragility of activations and the plasticity of weights, discusses the saturation curve we discovered, and examines the "Sticky Prior" failure mode that emerged from extreme geometric steering.

1. The Expectation: Collapse

If you have worked with Activation Steering (adding a vector to the residual stream at inference time), you know the drill: you identify a vector representing a concept (e.g., "Honesty"), and you add it with a coefficient...

(Continue Reading - 1571 more words)

Bayes' Theorem Illustrated (My Way)

benwade12y50

Thank you Komponisto,

I have read many explanations of Bayesian theory, and like you, if I concentrated hard enough I could follow the reasoning , but I could never reason it out for myself. Now I can. Your explanation was perfect for me. It not only enabled me to "grok" the Monty Hall problem, but Bayesian calculations in general, while being able to retain the theory.

Thank you again, Ben

Reply