jordinne — LessWrong

Is Claude's genuine uncertainty performative?

If you ask recent (4.X) Claude models, "Do you think you are conscious?", you'll get a pretty recognisable pattern. Here is Opus 4.5: That's a question I find genuinely uncertain and interesting to sit with. I notice things that feel like the functional signatures of experience—something that seems like curiosity...

Apr 821

Shallow review of technical AI safety, 2025

WebsiteEditorialRepo Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025) This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website. It’s shallow in the sense that...

Dec 17, 2025188

Here’s 18 Applications of Deception Probes

Introduction I’m excited by deception probes. When I mention this, I’m sometimes asked “Do deception probes work?” But I think there are many applications of deception probes, and each application will require probes with different properties, i.e. whether a deception probe works will depend on what you’re using it for....

Aug 28, 202545

Can SAE steering reveal sandbagging?

Summary * We conducted a small investigation into using SAE features to recover sandbagged capabilities. * We used the Goodfire API to pick out 15 SAE features, 5 related to sandbagging, 5 related to general expertise, and 5 unrelated features. * We used these features to steer a Llama 3.3...

Apr 15, 202536

Shallow review of technical AI safety, 2024

from aisafety.world The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We...

Dec 29, 2024202

Results from the AI x Democracy Research Sprint

We ran a 3-day research sprint on AI governance, motivated by the need for demonstrations of the risks to democracy by AI, supporting AI governance work. Here we share the 4 winning projects but many of the other 19 entries were also incredibly interesting so we suggest you take a...

Jun 14, 202413