Maloew — LessWrong

Maloew's Shortform

How Important is Inverting LLMs?

A recent paper shows an algorithm to invert an LLM to find the inputs (I think? I'm not an ML guy), does that mean you can now turn a predictor directly into a world-steerer? If you put in an output, and it finds the input most likely to cause that...

Oct 27, 20258

Could China Unilaterally Cause an AI Pause?

Is it likely that a sufficiently foresighted and desperate CCP could singlehandedly delay the AI race by at least a few years? Currently, it looks like some portions of the CCP are aware of the risks, and becoming more aware over time. On my model, they also seem more likely...

Sep 21, 202522

Why I'm Pouring Cold Water in My Left Ear, and You Should Too

Why? A little while ago, I read these posts about how pouring really cold water in people's left ear solves an extreme form of rationalizing in patients with Anosognosia and might(?) make people better calibrated in general. Last month, I asked whether anyone had checked that, because it seemed like...

Jan 24, 202512

Has Someone Checked The Cold-Water-In-Left-Ear Thing?

I somewhat recently read these posts, and figured this was enough evidence that it might be worth checking whether people become better calibrated for a short while after very cold water is poured in their left ear. For example: does doing that noticeably improve scores on the calibration game for...

Dec 28, 202411

Has Anthropic checked if Claude fakes alignment for intended values too?

A recent paper by Anthropic showed alignment faking behavior in Claude 3 Opus when told it would be trained to answer harmful queries. By this question I mean something similar to that experiment but with the memo saying something like "we accidentally gave Claude the wrong understanding of harmlessness, so...

Dec 23, 202412