My Minor AI Safety Research Projects (Q3 2025)

These are all cool projects and I like them but I find it hard to label this as safety research. To me it seems that the primary value from working on these was in improving general ML skills which could be applied towards solving a broad variety of problems. Perhaps I’m missing a more direct link to a theory of impact or threat model here.

Personality Types of AI

After a discussion with Keshav Shenoy, we thought it would be fun to give Myers-Briggs Type Indicator test to AIs. This also builds on the hackathon I did in December, which was also about submitting AIs to tests intended for humans.

I think this approach has a lot of value. Firstly, the human respondents give a good baseline / point of comparison. Secondly, the tests have some predictive power in humans (I assume?) so plausibly they sketch out a latent space of personalities that the model is also internally navigating.

Anyway, I quickly found that MBTI tests do not give out their rubrics for free, and wasn't willing to spend time scavenging or begging. So I turned to an easier source of personality quizes - those free silly ones you do online. This eventually became Claude is A Ravenclaw, which annoyingly is currently my top post on LessWrong.

I still got to learn how inspect works better, and it brightened people's day, so not a bad project.

Uppishcase: Continuous saliency control

After observing that a lot of system prompts include text like "and you MUST NOT do such-and-such", I concluded people use upper case to make particular instructions more salient.

I wanted to give better control over salience. Ideally we'd be able to set a strength for every statement in the system prompt, so that things can be smoothly tweaked or optimised over time. Currently, changing system prompts by adding or removing a line is too discrete, and can cause large jumps in behaviour.

My idea was to extract a "uppercaseness direction" from the token embeddings, and then selectively steer certain tokens using it.

I wrote some code to establish the right steering direction, some parsing logic to identify tokens and a hook for huggingface to apply steering per-token. While I had some initially promising results, when I moved to a more advanced instruct model the steering stopped working. I found any steering at all reduced the salience of text, presumably as moving the latent out of distribution rendered it hard to read.

tokenizer.convert_ids_to_tokens(tokenizer.encode(" is the correct answer")) ['Ġis', 'Ġthe', 'Ġcorrect', 'Ġanswer'] tokenizer.convert_ids_to_tokens(tokenizer.encode(" IS THE CORRECT ANSWER")) ['ĠIS', 'ĠTHE', 'ĠCOR', 'RECT', 'ĠANSW', 'ER']

BPE works in mysterious ways

Tokens are not a one-to-one mapping between lower case and upper, which might have been causing problems. So after a break I tried again. This time, I tried to learn a direction in the residual stream from training data, and used transformerlens. This takes a bit more training, but potentially can capture concepts better (by using a higher layer). But I still found pretty similar results, or rather confusing plots.

I still think this seems a plausible simple research direction, but it's going to take more searching to locate the right intervention.

Accelerated Game Of Life with CUDA / Triton

This was more a project to develop my Research Engineer skills than any useful research in its own right.

The idea was to experiment a bit with optimising PyTorch operations, learning about CUDA and Triton and GPUs in general. I made quite a large performance difference in the end, and discovered that my existing C/assembly skills carry over to CUDA quite nicely.

LESSWRONG
LW

LESSWRONG
LW

4

My Minor AI Safety Research Projects (Q3 2025)

4

4

Personality Types of AI

Uppishcase: Continuous saliency control

Accelerated Game Of Life with CUDA / Triton