I was actually expecting Penny to develop dystonia coincidentally, and the RL would tie-in by needing to be learned in reverse ie optimizing from dystonic to normal. It is a much more pleasant ending than the protagonist's tone the whole way through.
If I was writing a fanfic of this, I'd keep the story as is (+ or - the last paragraph), but then continue into the present moment which leads to the realization.
It's very exciting to have an orthogonal research direction that finds these ground truth features, which might possibly even generalize(!!). Please do report future results, even if negative (though your Malladi et al link is some evidence in the positive).
It's also very confusing since I'm unsure how this all fits in with everything else? This clearly works in these cases. SAE's clearly work in some cases as well (and same w/ the parameter decomposition research), but what's the "Grand Theory of NN Interp" that explains all of these results?
In general, I believe it's very important that we hedge our bets on research directions for interp. The main reason being one of them actually panning out, but even if not, they already provide unique pieces of evidences for later researchers (maybe us, maybe LLMs, lol) to hopefully figure out that "Grand Theory of NN Interp".
Thanks Charlie:)
My wife and I are pretty sure the paramedic checked w/ a stethascope, and so did the doctor when we arrived. But they didn't mention anything until the x-ray.
The paramedics might not've done the pads due to being a few minutes ride from the hospital (I'm literally on the same block as the hospital), but I did recieve them at the hospital (I've still got some glue-residue on me actually).
When nursing staff is working long shifts and spread between a lot of patients,
Ya, mine were working 12 hour shifts, 3 days/nights in a row.
Well I signed up for half-haven, and thought "Well I need to write a post every 2 days", haha. (I'm also more of an over-sharer than others)
Thanks for hosting half-haven btw!
That seems incredibly important, so I've added to the main text. Thanks!
Would be interesting to empirically check the reward surrounding reward hacking solutions. Should be able to plot the reward against variance and see if that’s different than other spots.
Thanks for making these connections!
"Learned helplessness" only seems to cover some of these cases though. My exaggerated tiredness didn't relate to a feeling of lack of control, but I do agree that it is a learned behavior.
...
Cognitive Defusion is the idea that your thoughts & emotions are separate from you. They aren't immutable facts.
I do think that relates, but I want to communicate:
(1) Your mindset can drastically change how you experience things. (2) A specific subset of those mindsets involve exaggerating your suffering (sincerely), as a learned behavior that got you what you wanted in the past from those in power over you.
The oreo/heaven-or-hell point is to drive home #(1)
So yes, I agree then!
...
For relating to my close friends, it usually involves them telling themselves a very sad story (which is true but only focuses on a small subset of details), which can be asked for them to articulate, which can then be challenged directly (in a loving, tactful way; almost socratic?) while providing comfort.
Yeah, I think what I'm trying now is a CBT solution of "Notice your brain consistently telling yourself the worse-case story", and might need to lean more into CBT. Thanks!
One intuition I can offer is that you end up in wider basins of reward/loss landscape.
If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.
Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.
Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??
The second intuition is the paths being substantially different, which can be quantified as well.
Great work!
Listened to a talk from Philipp on it today and am confused on why we can't just make a better benchmark than LDS?
Why not just train eg 1k different models, where you left 1 datapoint out? LDS is noisy, so I'm assuming 1k datapoints that exactly capture what you want is better than 1M datapoints that are an approximation. [1]
As an estimate, Nano-GPT speedrun takes a little more than 2 min now, so you can train 1001 of these in:
2.33*1k/60 = 38hrs on 8 H100's which is maybe 4 b200's which is $24/hr, so ~$1k.
And that's getting a 124M param LLM trained on 730M tokens up to GPT2 level. Y'all's quantitative setting for Fig 4 was a 2M parameter Resnet on Cifar-10 on 5k images, which would be much cheaper to do (although the GPT2 one has been very optimized, so you could just do the speedrun one but on less data).
LDS was shown to be very noisy, but a colleague mentioned that this could be because 5k images is a very small amount of data. I guess another way to validate LDS is running the expensive full-training run on a few datapoints.
Confusion on LDS Hyperparameter Sweep Meaning
Y'all show in Fig 4 that there are large error bars across seeds for the different methods. This ends up being a property of LDS's noisiness, as y'all show in Figures 7-8 (where BIF & EK-FAC are highly correlated). This means that, even using noisy LDS, you don't need to re-run 5 times if a new method is much better than previous ones (but only if it's narrowly better).
What I'm confused about is why you retrained on 100 different ways to resample the data at each percentage? Is this just because LDS is noisy, so you're doing the thing where randomly sampling 100 datapoints 500 times gives you a good approximation of the causal effect of each individual datapoint (or that is what LDS actually is)? Was there high variance in the relative difference between methods across the 100 retrained models?
Other Experiments
Just wild speculation that there are other data attribution methods as opposed to prediction of the output. When a model "groks" something, there will be some datapoints that were more important for that happening that should show up in an ideal data attribution method.
Similar w/ different structures forming in the dataset (which y'all's other paper shows AFAIK).
[Note: there's a decent chance I've terribly misunderstood y'all's technique or misread the technical details, so corrections are appreciated]
It initially seemed confusing on how to evaluate this, but I think we need to look at the variance over the distribution of datapoints. If BIF is consistently more accurate than EK-FAC over eg 100 randomly sampled datapoints, then that's a good sign for BIF; however, if there's a high level of variance, then we'd need more data to differentiate between the two. I do think higher quality data attribution methods would have higher signal, so you'd need less data. For example, I predict that BIF does better than Trak on ~all datapoints (but this is an empirical question).