Esben Kran

Results from the interpretability hackathon

We ran a mechanistic interpretability hackathon (original post) with 25 projects submitted by ~70 participants. Here we share the winning projects but many of the others were also incredibly interesting. In summary: * An algorithm to automatically make the activations of a neuron in a Transformer much more interpretable. * Backup name mover heads from “Interpretability in the Wild” have backup heads and all of these are robust to the ablation distribution. * The specificity benchmark in the ROME and MEMIT memory editing papers does not represent specificity well. A simple modulation shows that factual association editing bleeds into related texts, representing "loud facts". * TCAV used on an RL agent for a connect four game can have its neural activation compared to the provably best solution as a pilot for comparing learned activations more generally to human-made solutions. Thank you to Sabrina Zaki, Fazl Barez, Thomas Steinthal, Joe Hardie, Erin Robertson, Richard Annilo, Itay Yona, other jam site organizers and all the participants for making it all possible. Investigating Neuron Behaviour via Dataset Example Pruning and Local Search By Alex Foote Abstract: This report presents methods for pruning and diversifying dataset examples that strongly activate neurons in a language model, to facilitate research into understanding the behaviour of these neurons. The pruning algorithm takes a dataset example that strongly activates a specific neuron and extracts the core sentence before iteratively removing words, to find the shortest substring that preserves a similar pattern and magnitude of neuron activation. This removes extraneous information, providing a much more concise input that is easier to reason about. The extracted substring, referred to as a Minimal Activating Example (MAE), is then used as a seed for local search in the input space. Using BERT, each word in the MAE is replaced by its most probable substitutes, and neuron activation is re-assess

81Nov 17, 2022

Esben Kran

Message

536

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

This blog was published by Jonathan Ng, Andrey Anurin, Connor Axiotes, Esben Kran. Apart Research's newest paper, Catastrophic Cyber Capabilities Benchmark (3cb): Robustly Evaluating LLM Agent Cyber Offense Capabilities (website), creates a novel cyber offense capability benchmark that engages with issues of legibility, coverage, and generalization in cyber offense benchmarks....

Nov 5, 20249

Can startups be impactful in AI safety?

With Lakera's strides in securing LLM APIs, Goodfire AI's path to scaling interpretability, and 20+ model evaluations startups among much else, there's a rising number of technical startups attempting to secure the model ecosystem. Of course, they have varying levels of impact on superintelligence containment and security and even with...

Sep 13, 202415

Finding Deception in Language Models

This June, Apart Research and Apollo Research joined forces to host the Deception Detection Hackathon. Bringing together students, researchers, and engineers from around the world to tackle a pressing challenge in AI safety; preventing AI from deceiving humans and overseers. The hackathon took place both online and in multiple physical...

Aug 20, 202420

Results from the AI x Democracy Research Sprint

We ran a 3-day research sprint on AI governance, motivated by the need for demonstrations of the risks to democracy by AI, supporting AI governance work. Here we share the 4 winning projects but many of the other 19 entries were also incredibly interesting so we suggest you take a...

Jun 14, 202413

Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon

TLDR; Participate online or in-person on the weekend 3rd to 5th May in an exciting and fun AI safety research hackathon focused on demonstrating and extrapolating risks to democracy from real-life threat models. We invite researchers, cybersecurity professionals, and governance experts to join but it is open for everyone, and...

Apr 19, 20245

Join the AI Evaluation Tasks Bounty Hackathon

How do we test when autonomous AI might become a catastrophic risk? One approach is to assess the capabilities of current AI systems in performing tasks relevant to self-replication and R&D. METR (formerly ARC Evals), a research group focused on this question, has: * developed a Task Standard, a standardized...

Mar 18, 202412

Identifying semantic neurons, mechanistic circuits & interpretability web apps

15 research projects on interpretability were submitted to the mechanistic interpretability Alignment Jam in January hosted with Neel Nanda. Here, we share the top projects and results. In summary: * Activation patching works on singular neurons, token vector and neuron output weights can be compared, and a high mutual congruence...

Apr 13, 202318

Load More (7/27)

LESSWRONG
LW

LESSWRONG
LW

Esben Kran

Esben Kran

Esben Kran

Results from the interpretability hackathon

Safety timelines: How long will it take to solve alignment?

Newsletter for Alignment Research: The ML Safety Updates

AI Safety Ideas: A collaborative AI safety research platform

Esben Kran

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Can startups be impactful in AI safety?

Finding Deception in Language Models

Results from the AI x Democracy Research Sprint

Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon

Join the AI Evaluation Tasks Bounty Hackathon

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Can startups be impactful in AI safety?

Finding Deception in Language Models

Results from the AI x Democracy Research Sprint

Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon

Join the AI Evaluation Tasks Bounty Hackathon

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Results from the interpretability hackathon

Safety timelines: How long will it take to solve alignment?

Newsletter for Alignment Research: The ML Safety Updates

AI Safety Ideas: A collaborative AI safety research platform