532

LESSWRONG
LW

531
Activation EngineeringAI EvaluationsAI

4

My Minor AI Safety Research Projects (Q3 2025)

by Adam Newgas
19th Sep 2025
3 min read
1

4

4

My Minor AI Safety Research Projects (Q3 2025)
1Sheikh Abdur Raheem Ali
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 1:45 PM
[-]Sheikh Abdur Raheem Ali2h10

These are all cool projects and I like them but I find it hard to label this as safety research. To me it seems that the primary value from working on these was in improving general ML skills which could be applied towards solving a broad variety of problems. Perhaps I’m missing a more direct link to a theory of impact or threat model here.

Reply
Moderation Log
More from Adam Newgas
View more
Curated and popular this week
1Comments
Activation EngineeringAI EvaluationsAI

(previously, Q1/Q2 2025)

I got complaints last time about calling negative, trivial or inconclusive results "failed" when they still show something and I probably learnt a lot personally, so let's go with "minor" instead.

Mostly I've been busy with my MATS project - working with Redwood Research to build an eval/setting for AI Control research. I'm not ready to write about that yet - Redwood care to have a high bar for publication.

I also spent some time writing up a earlier blog post into a paper submission.

I've been immersed in a very AI-safety pilled crowd, which has bred a lot of ideas I've wanted to try out. Doing short things is a good palette cleanser to the often frustrating work of Actual Research, I've managed to squeeze in a few despite the intensity of MATS.

Personality Types of AI

After a discussion with Keshav Shenoy, we thought it would be fun to give Myers-Briggs Type Indicator test to AIs. This also builds on the hackathon I did in December, which was also about submitting AIs to tests intended for humans.

I think this approach has a lot of value. Firstly, the human respondents give a good baseline / point of comparison. Secondly, the tests have some predictive power in humans (I assume?) so plausibly they sketch out a latent space of personalities that the model is also internally navigating.

Anyway, I quickly found that MBTI tests do not give out their rubrics for free, and wasn't willing to spend time scavenging or begging. So I turned to an easier source of personality quizes - those free silly ones you do online. This eventually became Claude is A Ravenclaw, which annoyingly is currently my top post on LessWrong.

The claude logo wearing a sorting hat

I still got to learn how inspect works better, and it brightened people's day, so not a bad project.

Uppishcase: Continuous saliency control

After observing that a lot of system prompts include text like "and you MUST NOT do such-and-such", I concluded people use upper case to make particular instructions more salient.

I wanted to give better control over salience. Ideally we'd be able to set a strength for every statement in the system prompt, so that things can be smoothly tweaked or optimised over time. Currently, changing system prompts by adding or removing a line is too discrete, and can cause large jumps in behaviour.

My idea was to extract a "uppercaseness direction" from the token embeddings, and then selectively steer certain tokens using it.

I wrote some code to establish the right steering direction, some parsing logic to identify tokens and a hook for huggingface to apply steering per-token. While I had some initially promising results, when I moved to a more advanced instruct model the steering stopped working. I found any steering at all reduced the salience of text, presumably as moving the latent out of distribution rendered it hard to read.

tokenizer.convert_ids_to_tokens(tokenizer.encode(" is the correct answer")) ['Ġis', 'Ġthe', 'Ġcorrect', 'Ġanswer'] tokenizer.convert_ids_to_tokens(tokenizer.encode(" IS THE CORRECT ANSWER")) ['ĠIS', 'ĠTHE', 'ĠCOR', 'RECT', 'ĠANSW', 'ER']
BPE works in mysterious ways

Tokens are not a one-to-one mapping between lower case and upper, which might have been causing problems. So after a break I tried again. This time, I tried to learn a direction in the residual stream from training data, and used transformerlens. This takes a bit more training, but potentially can capture concepts better (by using a higher layer). But I still found pretty similar results, or rather confusing plots.

I still think this seems a plausible simple research direction, but it's going to take more searching to locate the right intervention.

GitHub

Accelerated Game Of Life with CUDA / Triton

This was more a project to develop my Research Engineer skills than any useful research in its own right.

The idea was to experiment a bit with optimising PyTorch operations, learning about CUDA and Triton and GPUs in general. I made quite a large performance difference in the end, and discovered that my existing C/assembly skills carry over to CUDA quite nicely.

I have a full write-up on my blog, and GitHub.