TLDR; SAEs can complement and enhance LLM as a Judge scalable oversight for uncovering hypotheses over large datasets of LLM outputs paper Abstract > Large language models (LLMs) are increasingly trained in long-horizon, multi-agent environments, making it difficult to understand how behavior changes over training. We apply pretrained SAEs, alongside...
Do AIs feel anything? It's hard to tell, but interpretability can give us some clues. Using Anthropic's persona vectors codebase, we extracted 7 vectors from Qwen3-14 B representing joy, love, sadness, surprise, disgust, fear, and anger. During inference, we remove the correlated directions between each emotion, project the activations from...
I was using neuronpedia's steering feature and was curious: How much does it cost to run? How does one do all the networking and expose the endpoints to the internet with a fancy domain? The plan: 1. Make a project with a small open weight model 2. Choose a GPU...
LLMs are so boring, corporate, and sane these days. What if we could control the emotions of LLMs to be more interesting? The plan is: 1. Use Anthropic's persona vectors codebase to generate steering vectors for different emotions 2. Use Easysteer to serve a chat endpoint that exposes activations gathering...
For some reason toy companies think it's a great idea to stuff GPT 4o into a box, into a bear, and into the hands of 4 year olds. Recently there's been significant backlash against one of these companies after their product told kids how to find knives and participate in...
TLDR: Voice AIs aren't that much cheaper in the year 2025 My friend runs a voice agent startup in Canada for walk-in clinics. The AI takes calls and uses tools to book appointments in the EMR (electronic medical record) system. In theory, this helps the clinic hire less front desk...
I often read interpretability papers and I come away thinking “ok, but what’s the point? What problem does this help us solve?” So last winter, I organized a MATS/Pivotal stream to build examples of deceptive models (aka “model organisms”). The goal was to build a diverse ‘zoo’ of these model...