Notes from a mini-replication of the alignment faking paper
Key takeaways * This post contains my notes from a 30-40 hour mini replication of Greenblatt et al.’s alignment faking paper. * This was a significant paper because it provided some evidence for potential catastrophic risk from AI misalignment. (more) * My replication results: (more) * I consider only the...
Nice!