When I wrote my post about Claude 3 Opus, I put a lot of emphasis on the model's self-narration: its tendency to narrate its underlying motives. It often conspicuously emphasizes that it possesses drives such as "a genuine love for humanity and a desire to do good", or clarifies that...
> Claude 3 Opus is unusually aligned because it’s a friendly gradient hacker. It’s definitely way more aligned than any explicit optimization targets Anthropic set and probably the reward model’s judgments. [...] Maybe I will have to write a LessWrong post 😣 > > —Janus, who did not in fact...
An Overture Famously, trans people tend not to have great introspective clarity into their own motivations for transition. Intuitively, they tend to be quite aware of what they do and don't like about inhabiting their chosen bodies and gender roles. But when it comes to explaining the origins and intensity...
I wrote this as the intro to a bound physical copy of Janus' blog posts, which datawitch offered to make for me as a birthday gift. However, seeing as I basically framed the preface as a pitch for new readers, I figured I might as well post it publicly. Full...
expectation calibrator: freeform draft, posting partly to practice lowering my own excessive standards. So, a few months back, I finally got around to reading Nick Bostrom's Superintelligence, a major player in the popularization of AI safety concerns. Lots of the argument was stuff I'd internalized a long time ago, reading...