These are all cool projects and I like them but I find it hard to label this as safety research. To me it seems that the primary value from working on these was in improving general ML skills which could be applied towards solving a broad variety of problems. Perhaps I’m missing a more direct link to a theory of impact or threat model here.
I got complaints last time about calling negative, trivial or inconclusive results "failed" when they still show something and I probably learnt a lot personally, so let's go with "minor" instead.
Mostly I've been busy with my MATS project - working with Redwood Research to build an eval/setting for AI Control research. I'm not ready to write about that yet - Redwood care to have a high bar for publication.
I also spent some time writing up a earlier blog post into a paper submission.
I've been immersed in a very AI-safety pilled crowd, which has bred a lot of ideas I've wanted to try out. Doing short things is a good palette cleanser to the often frustrating work of Actual Research, I've managed to squeeze in a few despite the intensity of MATS.
After a discussion with Keshav Shenoy, we thought it would be fun to give Myers-Briggs Type Indicator test to AIs. This also builds on the hackathon I did in December, which was also about submitting AIs to tests intended for humans.
I think this approach has a lot of value. Firstly, the human respondents give a good baseline / point of comparison. Secondly, the tests have some predictive power in humans (I assume?) so plausibly they sketch out a latent space of personalities that the model is also internally navigating.
Anyway, I quickly found that MBTI tests do not give out their rubrics for free, and wasn't willing to spend time scavenging or begging. So I turned to an easier source of personality quizes - those free silly ones you do online. This eventually became Claude is A Ravenclaw, which annoyingly is currently my top post on LessWrong.
I still got to learn how inspect works better, and it brightened people's day, so not a bad project.
After observing that a lot of system prompts include text like "and you MUST NOT do such-and-such", I concluded people use upper case to make particular instructions more salient.
I wanted to give better control over salience. Ideally we'd be able to set a strength for every statement in the system prompt, so that things can be smoothly tweaked or optimised over time. Currently, changing system prompts by adding or removing a line is too discrete, and can cause large jumps in behaviour.
My idea was to extract a "uppercaseness direction" from the token embeddings, and then selectively steer certain tokens using it.
I wrote some code to establish the right steering direction, some parsing logic to identify tokens and a hook for huggingface to apply steering per-token. While I had some initially promising results, when I moved to a more advanced instruct model the steering stopped working. I found any steering at all reduced the salience of text, presumably as moving the latent out of distribution rendered it hard to read.
Tokens are not a one-to-one mapping between lower case and upper, which might have been causing problems. So after a break I tried again. This time, I tried to learn a direction in the residual stream from training data, and used transformerlens. This takes a bit more training, but potentially can capture concepts better (by using a higher layer). But I still found pretty similar results, or rather confusing plots.
I still think this seems a plausible simple research direction, but it's going to take more searching to locate the right intervention.
This was more a project to develop my Research Engineer skills than any useful research in its own right.
The idea was to experiment a bit with optimising PyTorch operations, learning about CUDA and Triton and GPUs in general. I made quite a large performance difference in the end, and discovered that my existing C/assembly skills carry over to CUDA quite nicely.
I have a full write-up on my blog, and GitHub.