Aspiring AI safety researcher. Currently doing my PhD at Fraunhofer HHI in Berlin, focusing on LLM interpretability. Interested in the internal structure underlying safety-relevant behaviors in LLMs: prompt injections, jailbreaks, deception.
Comments