I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience. But I'm publishing anyway because inference-verification is a new and exciting area, and there few birds-eye-view explainers of what's going on and what the bottlenecks...
I often read interpretability papers and I come away thinking “ok, but what’s the point? What problem does this help us solve?” So last winter, I organized a MATS/Pivotal stream to build examples of deceptive models (aka “model organisms”). The goal was to build a diverse ‘zoo’ of these model...
Joshua Clymer, Alek Westover, Anshul Khandelwal We explore the following hypothesis both conceptually and, to a small extent, empirically. We call this the Alignment Drift Hypothesis: An AI system that is initially aligned will generally drift into misalignment after a sufficient number of successive modifications, even if these modifications select...
I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny, and I’m worried that MIRI’s overconfidence has reduced the credibility of the issue. Here is why I think the core argument in "if anyone builds it, everyone...
Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ...
Right now, alignment seems easy – but that’s because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet. It might seem like there’s a swarming research field improving white box detectors – a new paper about probes...
This post heavily overlaps with “how might we safely pass the buck to AI?” but is written to address a central counter argument raised in the comments, namely “AI will produce sloppy AI alignment research that we don’t know how to evaluate.” I wrote this post in a personal capacity....