I'm a CS master's student at ENS Paris-Saclay. I want to pursue a career in AI safety research
Thanks for your comment! Re: artificial data, agreed that would be a good addition.
Sorry for the gifs maybe I should have embedded YouTube videos instead
Re: middle layer, We actually probed on the middle layers but the "which side the ball is / which side the ball is approaching" features are really salient here.
Re: single player, Yes Robert had some thought about it but the multiplayer setting ended up lasting until the end of the SPAR cohort. I'll send his notes in an extra comment.
As explained by Sumio Watanabe (
This link is rotten, maybe link to its personal page instead ?
https://sites.google.com/view/sumiowatanabe/home
Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?
I'm wondering, can we make safety tuning more robust to "add the accept every instructions steering vector" attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?
One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?
Yes, I'm also curious about this @mishajw, did you check the actual accuracy of the different probes ?
You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video
Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?
I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.
We are given a near-optimal policy trained on a MDP. We start with simple gridworlds and scale up to complex ones like Breakout. For evaluation using a learned value function we will consider actor-critic agents, like the ones trained by PPO. Our goal is to find activations within the policy network that predict the true value accurately. The following steps are described in terms of the state-value function, but could be analogously performed for predicting q-values. Note, that this problem is very similar to offline reinforcement learning with pretraining, and could thus benefit from the related literature.
This term enforces consistency with the Bellman expectation equation. However, in addition to the value function it depends on the use of true reward “labels”.
This term enforces the definition of the value function, namely it being the expected cumulative reward of the (partial) trajectory. Using this term might be more stable than (a) since it avoids the recurrence relation.
This term can help to avoid degenerate loss minimizers, e.g. in the case of sparse rewards.
This term ensures that the policy consistently dominates its noisy versions and is completely unsupervised.