This blog was published by Jonathan Ng, Andrey Anurin, Connor Axiotes, Esben Kran. Apart Research's newest paper, Catastrophic Cyber Capabilities Benchmark (3cb): Robustly Evaluating LLM Agent Cyber Offense Capabilities (website), creates a novel cyber offense capability benchmark that engages with issues of legibility, coverage, and generalization in cyber offense benchmarks....
With Lakera's strides in securing LLM APIs, Goodfire AI's path to scaling interpretability, and 20+ model evaluations startups among much else, there's a rising number of technical startups attempting to secure the model ecosystem. Of course, they have varying levels of impact on superintelligence containment and security and even with...
This June, Apart Research and Apollo Research joined forces to host the Deception Detection Hackathon. Bringing together students, researchers, and engineers from around the world to tackle a pressing challenge in AI safety; preventing AI from deceiving humans and overseers. The hackathon took place both online and in multiple physical...
We ran a 3-day research sprint on AI governance, motivated by the need for demonstrations of the risks to democracy by AI, supporting AI governance work. Here we share the 4 winning projects but many of the other 19 entries were also incredibly interesting so we suggest you take a...
TLDR; Participate online or in-person on the weekend 3rd to 5th May in an exciting and fun AI safety research hackathon focused on demonstrating and extrapolating risks to democracy from real-life threat models. We invite researchers, cybersecurity professionals, and governance experts to join but it is open for everyone, and...
How do we test when autonomous AI might become a catastrophic risk? One approach is to assess the capabilities of current AI systems in performing tasks relevant to self-replication and R&D. METR (formerly ARC Evals), a research group focused on this question, has: * developed a Task Standard, a standardized...
15 research projects on interpretability were submitted to the mechanistic interpretability Alignment Jam in January hosted with Neel Nanda. Here, we share the top projects and results. In summary: * Activation patching works on singular neurons, token vector and neuron output weights can be compared, and a high mutual congruence...