Probe Experiment Brief: Testing the Reconciling-Capacity Hypothesis
> TL;DR: This is the runnable specification for testing a prediction from a structural model of alignment failure [LINK: https://fusiondynamics.org/alignment-short/]: that sycophancy, deceptive alignment, reward hacking, and harmful compliance share one activation-level signature. The test uses existing interpretability probes on existing benchmarks. Timeline: 1-2 weeks. A clean null kills the...
Apr 161