John Steidley — LessWrong

Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")

(I work at Palisade)

I claim that your summary of the situation between Neel's work and Palisade's work is badly oversimplified. For example, Neel's explanation quoted here doesn't fully explain why the models sometimes subvert shutdown even after lots of explicit instructions regarding the priority of the instructions. Nor does it explain the finding that moving instructions from the user prompt to the developer prompt actually /increases/ the behavior.

Further, that CoT that Neel quotes has a bit in it about "and these problems are so simple", but Palisade also tested whether using harder problems (from AIME, iirc) had any effect on the propensity here and we found almost no impact. So, it's really not as simple as just reading the CoT and taking the model's justifications for its actions at face value (as Neel, to his credit, notes!).

Here's a twitter thread about this involving Jeffrey and Rohin: https://x.com/rohinmshah/status/1968089618387198406

Here's our full paper that goes into a lot of these variations: https://arxiv.org/abs/2509.14260

80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)

John Steidley1y1917

Because it's obviously annoying and burning the commons. Imagine if I made a bot that posted the same comment on every post of less wrong, surely that wouldn't be acceptable behavior.

Intuition for 1 + 2 + 3 + … = -1/12

John Steidley2y711

The finish was quite a jump for me. I guess I could go and try to stare at your parenthesis and figure it out myself, but mostly I feel somewhat abandoned at that step. I was excited when I found 1, 2, 4, 8... = -1 to be making sense, but that excitement doesn't quite feel sufficient for me to want to decode the relationships between the terms in those two(?) patterns and all the relevant values

"Is There Anything That's Worth More"

John Steidley2y10

Zack, the second line of your quoted lyrics should be "I guess *we already..."

3 Levels of Rationality Verification

John Steidley3y10

I'm currently one of the four members of the core team at CFAR (though the newest addition by far). I also co-ran the Prague Workshop Series in the fall of 2022. I've been significantly involved with CFAR since its most recent instructor training program in 2019.

I second what Eli Tyre says here. The closest thing to "rationality verification" that CFAR did in my experience was the 2019 instructor training program, which was careful to point out it wasn't verifying rationality broadly, just certifying the ability to teach one specific class.

NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

John Steidley4y10

I wasn't replying to Quintin

NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

John Steidley4y10

I can't tell what you mean. Can you elaborate?

Did anybody calculate the Briers score for per-state election forecasts?

John Steidley5y10

I think this comment would be better placed as a reply to the post that I'm linking. Perhaps you should put it there?

Did anybody calculate the Briers score for per-state election forecasts?

Answer by John SteidleyNov 10, 202070

https://www.lesswrong.com/posts/muEjyyYbSMx23e2ga/scoring-2020-u-s-presidential-election-predictions

Gifts Which Money Cannot Buy

John Steidley5y240

My summary: Give gifts using the parts of your world-model that are strongest. Usually the answer isn't going to end up being based on your understanding of their hobby.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments