One year ago, we nearly died. This is maybe an overdramatic statement, but long story short, nearly all of us underwent carbon monoxide (CO) poisoning[1]. The benefit is, we all suddenly got back in touch with a failure mode we had forgotten about, and we decided to make it a...
The Global Call for AI Red Lines was signed by 12 Nobel Prize winners, 10 former heads of state and ministers and over 300 prominent signatories. Launched at the UN General Assembly and presented to the UN Security Council. There is still much to be done, so we need to...
TL;DR: The EU’s Code of Practice (CoP) mandates AI companies to conduct state-of-the-art Risk Modelling. However, the current SoTA is has severe flaws. By creating risk models and improving methodology, we can enhance the quality of risk management performed by AI companies. This is a neglected area, hence we encourage...
The Global Call for AI Red Lines was released and presented in the opening of the first day of the 80th UN General Assembly high-level week. Maria Ressa, Nobel Peace Prize, announcing the call at the opening of the UN General Assembly: "We urge governments to establish clear international boundaries...
This is an extract from an appendix of one of my longer blog posts that I keep referring to. What is pain? Why is pain bad? It's the same trick: we shouldn't ask, "Why is pain negative," but "Why do we think pain is negative?" Here's the response in the...
TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it. Full paper...
Adapted from this twitter thread. See this as a quick take. Mitigation Strategies How to mitigate Scheming? 1. Architectural choices: ex-ante mitigation 2. Control systems: post-hoc containment 3. White box techniques: post-hoc detection 4. Black box techniques 5. Avoiding sandbagging We can combine all of those mitigation via defense-in-depth system...