A recent paper shows an algorithm to invert an LLM to find the inputs (I think? I'm not an ML guy), does that mean you can now turn a predictor directly into a world-steerer? If you put in an output, and it finds the input most likely to cause that...
Is it likely that a sufficiently foresighted and desperate CCP could singlehandedly delay the AI race by at least a few years? Currently, it looks like some portions of the CCP are aware of the risks, and becoming more aware over time. On my model, they also seem more likely...
Why? A little while ago, I read these posts about how pouring really cold water in people's left ear solves an extreme form of rationalizing in patients with Anosognosia and might(?) make people better calibrated in general. Last month, I asked whether anyone had checked that, because it seemed like...
I somewhat recently read these posts, and figured this was enough evidence that it might be worth checking whether people become better calibrated for a short while after very cold water is poured in their left ear. For example: does doing that noticeably improve scores on the calibration game for...
A recent paper by Anthropic showed alignment faking behavior in Claude 3 Opus when told it would be trained to answer harmful queries. By this question I mean something similar to that experiment but with the memo saying something like "we accidentally gave Claude the wrong understanding of harmlessness, so...