peterr — LessWrong

LESSWRONG
LW

peterr — LessWrong

Replying toSpecial Persona Training: Hyperstition Progress Report 2

Special Persona Training: Hyperstition Progress Report 2

Have you thought about having the AI navigate stories/scenarios/environments in a CYOA fashion? It could involve picking between positive options and eventually opportunities to choose good options even when bad ones are easy or there is even strong pressure to choose them. Perhaps taking some inspiration from the kind of strategy used in Recontextualization Mitigates Specification Gaming https://arxiv.org/abs/2512.19027

peterr2mo

What about negative effects on the symbiotic microbiome?

Replying toNon-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy

peterr2mo

Non-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy

What if it's not worth seeking power? What if the world isn't worth taking over? Saints seem to devote their lives to teaching that it isn't worth getting caught up in ambitions and desires to control.

Replying toWeird Generalization & Inductive Backdoors

peterr2mo

Weird Generalization & Inductive Backdoors

At first I was really surprised by this because it seemed weird, but I find myself wondering if it's actually quite similar to an analogous form of behavior in humans: stereotyping. The model jumps to the most "obvious" looking conclusion based on its associations without necessarily reflecting on what it's doing or why. This makes me wonder if building in such loops with guidance on how to think about its own training could mitigate these effects.

Replying toInsights into Claude Opus 4.5 from Pokémon

peterr2mo

Insights into Claude Opus 4.5 from Pokémon

GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that's the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts?

peterr2mo

Yes, they both started with the same harness but there's room for each model to customize its own setup so I'm not sure how much they might have diverged over time. I have 4x speedup as probably an upper bound but I was only counting since the final 2.5 stable release in June, which might be too short. Gemini 2.5 has 6 badges now compared to yesterday, so it's probably too early to assume 4x is certain. But if it was 4x every 8 months then it should be able to match average human playtime by early 2027.

From the Gemini_Plays_Pokemon - Twitch:

"v2 centers on a smaller, flexible toolset (Notepad, Map Markers, code... (read more)

peterr2mo

Looking at the step count comparisons instead of time is interesting. Claude Opus 4.5 is currently at ~44,500 steps in Silph co., where it has been stuck for several days. So that should now be about 50% higher. The others look roughly right for Opus. It beat Mt. Moon in around 5 hours and was stuck at the Rocket Hideout for days.

I think the Gemini 3 Pro vs 2.5 Pro matchup in Pokemon Crystal was interesting. Gemini 3 cleared the game in ~424.5 hours last night while 2.5 only had 4/16 badges at 435 hours.

Replying toPostmodernism for STEM Types: A Clear-Language Guide to Conflict Theory

peterr3mo

Postmodernism for STEM Types: A Clear-Language Guide to Conflict Theory

This is a really valuable post that clarifies some things I've found hard to articulate to people on each side. I think it's difficult for people to balance when to use each of these epistemic frames without getting too sucked into one. And I imagine most people use these to different degrees at different times even if they may not realize it or one is rarer for them.

Looking forward to what you write next!

peterr3mo

Something similar I've been thinking about is putting models in environments with misalignment "temptations" like an easy reward hack and training them to recognize what this type of payoff pattern looks like (e.g. easy win but sacrifice principle) and NOT take it. Recent work shows some promising efforts getting LLMs to explain their reasoning, introspect, and so forth. I think this could be interesting to do some experiments with and am trying to write up my thoughts on why this might be useful and maybe what those could look like.

peterr3mo

Gotta account for wordflation since the old days. Might have been 1000 back then

Humanity AI Commits $500 million to AI and Democracy Protection, AI x Security, and more

peterr

4mo

It seems not focused on catastrophic risks going by the press release, but doesn't clearly exclude them and probably sensitive to near future current events, so I could imagine some people might get funding for really interesting democracy protection, security, or other work.

Should Open Philanthropy Make an Offer to Buy OpenAI?

peterr

Update: seems like earlier today the OpenAI Board rejected Musk's proposal and said OpenAI is "not for sale."

Epistemic status: thought about it briefly; seems like a longshot that's probably not worth it but curious what people think of the possibility.

You might have heard Sam Altman is trying to transition OpenAI to a for-profit company and has offered $40 billion to the nonprofit as compensation, despite fundraising rounds suggesting its valuation is already much higher, perhaps around $100 billion. The logic seems to be that this will allow OpenAI to scale faster as it prepares for larger and larger training runs. Elon Musk has offered $97.4 billion. Altman said "no thanks." The primary... (read more)

**In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley**

peterr

Anonymous post:

"I understand a common view in EA- or AI-governance land is that Toner, D'Angelo and McCauley (TDM for short) really messed things up at OpenAI, and AI, the fate of the world, etc. has gotten worse thanks to them. I am confident this is completely wrong: ex ante TDM have acquitted themselves with extraordinary ability and valor (instead of 'maybe-understandable maybe-not massive screw up); ex post, their achievements are consequential enough to vindicate the entire AI governance community as a whole.

I argue:

1) TDM's actions have left the situation at open AI as it stands considerably better than it would have been vs. the counterfactual where they did nothing.

2) In terms of... (read 2691 more words →)

In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley (OpenAI post)

peterr

This is close to some of how I've been thinking about what happened with the OpenAI board and I'm wondering what others think.

"I argue:

1) TDM's actions have left the situation at open AI as it stands considerably better than it would have been vs. the counterfactual where they did nothing.

2) In terms of the expected or realised good or bad outcomes, one should find the former pleasantly surprising and the latter essentially priced in, given the situation @ OpenAI was already very bad from a safety perspective.

3) Whether you're an 'honour and integrity-maxxer' or 'ruthless strategist', TDMs actions generally fare well-to-excellent by either light."