PhD student at UCL DARK doing RL, OOD Robustness and safety. Interested in self improvement.
A quick technical question: In the comparison to fine-tuning results in Section 6 where you stack CAA with fine-tuning, do you find a new steering vector after each fine-tune, or are you using the same steering vector for all fine-tuned models? My guess is you're doing the former as it's likely to be more performant, but I'd be interested to see what happens if you try to do the latter.
I think a point of no return exists if you only use small LRs. I think if you can use any LR (or any LR schedule) then you can definitely jump out of the loss basin. You could imagine just choosing a really large LR to basically resent to a random init and then starting again.
I do think that if you want to utilise the pretrained model effectively, you likely want to stay in the same loss basin during fine-tuning.
To check I understand this, is another way of saying this that in the scaling experiments, there's effectively a confounding variable which is model performance on the task:
If you had a way of measuring the faithfulness conditioned on a given performance for a given model scale then you could measure how scaling up changes faithfulness. Maybe for a given model size you can plot performance vs faithfulness (with each dataset being a point on this plot), measure the correlation for that plot and then use that as a performance-conditioned faithfulness metric? Or there's likely some causally correct way of measuring the correlation between model scale and faithfulness while removing the confounder of model performance.
If you train on infinite data, I assume you'd not see a delay between training and testing, but you'd expect a non-monotonic accuracy curve that looks kind of like the test accuracy curve in the finite-data regime? So I assume infinite data is also cheating?
I've been using "delayed generalisation", which I think is more precise than "grokking", places the emphasis on the delay rather the speed of the transition, and is a short phrase.
I think the hyperlink for "conv nets without residual streams" is wrong? It's https://www.westernunion.com/web/global-service/track-transfer for me
This feels kind of like a semantic disagreement to me. To ground it, it's probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I'm uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.
I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property.
I'm trying to understand why the juice shard has this propety. Which of these (if any) are the the explanation for this:
The first seems at least plausibly to also to apply to "avoid moldy food", if it requires multiple steps of planning to avoid moldy food (throwing out moldy food, buying fresh ingredients and then cooking them, etc.)
The second does seem to be more specific to juice than mold, but it seems to me that's because getting juice is rare, and is something we can better and better at, whereas avoiding moldy food is something that's fairly easy to learn, and past that there's not much reinforcement to happen. If that's the case, then I kind of see that as being covered by the rare-states explanation in my previous comment, or maybe an extension of that to "rare states and skills in which improvement leads to more reward".
Having just read tailcalled comment, I think that is in some sense another of phasing what I was trying to say, where rare (but not too rare) states are likely to mean that policy-caused variance is high on those decisions. Probably policy-caused variance is more fundamental/closer as an explanation to what's actually happening in the learning process, but maybe states of certain rarity which are high-reward/reinforcement is one possibly environmental feature that produces policy-caused variance.
So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.
One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn't seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren't misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren't sufficiently robust.
How would you imagine doing this? I understand your hypothesis to be "If a model generalises as if it's a mesa-optimiser, then it's better-described as having simplicity bias". Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?