I'm a research scientist at the UK AI Security Institute (AISI), working on white box control, sandbagging, low-incrimination control, training-based mitigations, and model organisms.
Previously: Working on lie-detector probes and black box monitors, and training sandbagging model organisms in order to stress-test them.
Before this I was interning at the Center for Human-Compatible Artificial Intelligence under Erik Jenner. We were developing mechanistic anomaly detection techniques to automatically flag jailbreaks and backdoors at runtime, by detecting unusual patterns of activations. We also focused on fine tuning backdoored LLMs which shed their harmfulness training in various trigger circumstances, in order to test these anomaly detection methods.
See my post on graphical tensor notation for interpretability. I also attended MATS 5.0 under Lee Sharkey and Dan Braun (see our paper: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning), attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.
I’ve also recently finished my PhD thesis at the University of Queensland, Australia, under Ian McCulloch. I’ve been working on new "tensor network" algorithms, which can be used to simulate entangled quantum materials, quantum computers, or to perform machine learning. I've also proposed a new definition of wavefunction branches using quantum circuit complexity.
My website: https://sites.google.com/view/jordantensor/
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.
See the methods section. They were generated with Gemini 2.5 Pro, with iterative feedback from Claude Sonnet 4. I wasn't involved in this - I just used transcripts from the upcoming "STRIDE" pipeline by Rich Barton Cooper et al.
Hypotheses like "this might be a prompt injection attack" were raised by Claude 4.5 Opus, which is kinda similar to noticing that something prefill-like is going on. I didn't see an explicit mention of prefill manipulation, but they could easily be there without me seeing them. I didn't run scanners for verbalised pre-fill awareness, and I only looked through a small fraction of the flagged verbalised eval awareness transcripts, so it's hard to say.
Not sure! I'm not aware of anything on this topic, maybe someone else knows something relevant.
The closest thing that comes to mind is Emergent Introspective Awareness in Large Language Models, but that's not about jailbreaking. Stuff like LLM Evaluators Recognize and Favor Their Own Generations is more like a stylistic preference thing than a prefill awareness thing too IMO, and that's not related to jailbreaks either.
ELK was framed around looking for worst-case solutions. IMO it's also good to look for approaches which are useful on the margin, or might be useful at controlling or aligning roughly human-level AIs (which can then hopefully be leveraged to do more work).
If you truly have a trusted process for creating honorable AIs, why not keep churning out and discarding different honorable AIs until one says "I promise I'm aligned" or whatever?
I suppose you're not assuming our ability to make AIs honorable like this will be robust to selection pressure?
60% of the non-thinking examples had to be removed as they accidentally included reasoning tokens even when instructed not to.
Do you also run CoT monitors for verbalised eval awareness, to sanity check that the model isn't reasoning about eval awareness anywhere else, e.g. outside of <think> blocks or in comments?
still a 404 on that link - it has a full stop at the end which should be removed
Agreed, and I'm also excited about SDF + finetuning MOs for this. Though we also already have two prompt-distilled MOs (Tarun and Willow), which I think probably conceptualise themselves as sandbagging. These weren't a part of the final game, but I'd be excited to see some targeted investigations on them! Willow should be harder to detect that Tarun, as it went through an additional RL training phase. I should get around to uploading it to the huggingface collection soon. It's now available on huggingface and in the github repo.
Here's my rough model of what's going on, in terms of gradient pressure:
Suppose the training data consists of a system prompt instructing the model to take bad actions, followed by demonstrations of good or bad actions.
So I expect a failure case when training on too much "Bad system prompt → Good action" data will be to cause the model to ignore those system prompts, which then makes "Bad system prompt → Bad action" inoculation training less effective.
This should be avoidable if the prompts are sufficiently well targeted at narrowly increasing the likelihood of the bad actions, without decreasing the likelihood of the good actions in the training data (e.g. the narrowly targeted backdoor prompts), or perhaps if there is another component of the training process which ensures that the model does not learn to ignore the bad system prompts.
Interesting potential follow-up work:
You could also make a similar eval by inserting misaligned actions into real transcripts, instead of using entirely synthetic transcripts.