LESSWRONG
LW

Burny's Shortform — LessWrong

Burny's Shortform

24th Jun 2025

1 min read

3

This is a special post for quick takes by Burny. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

38 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:04 AM

[-]Burny2mo*329

"Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation."

https://x.com/Sauers_/status/1972722576553349471

[-]anaguma2mo200

It doesn’t seem like that far of a leap to go from the model realizing it’s in an alignment eval to believing that it’s CoT may be monitored, e.g. by “watchers”, to not putting any of its misaligned thoughts in the CoT at all. Maybe this is the last generation of models for which CoT monitoring works well.

[-]Thomas Kwa2mo6-2

This seems unlikely for two reasons. First, models can't control the words in their reasoning traces well; even if they're trying to hide them they often blurt out "I must not say that I'm doing X". Second, if the model actually needs to use its CoT rather than narrating something it could have figured out in a single forward pass, it would need to encode misaligned reasoning rather than just omitting it.

So I think CoT will be monitorable until some combination of

models are no longer trained to narrate their thoughts
hiding misaligned thoughts is rewarded during training
the behaviors we're trying to catch are possible in one forward pass for the model

All of these could happen by late next year for some usecases, but I don't think it will be emergent property of next generation just having more situational awareness.

[-]Jozdien2mo88

First, models can't control the words in their reasoning traces well; even if they're trying to hide them they often blurt out "I must not say that I'm doing X".

All of these results are on settings where models are explicitly instructed to hide something from their reasoning trace. I don't think this is a good proxy for how well models can actually obfuscate their reasoning. I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I'd be happy to share, but there's also existing work suggesting this indirectly (though in the case of the latter link probably not in a CoT-as-computation regime). And the cases we care about are the ones where models have intrinsic preferences.

models are no longer trained to narrate their thoughts
hiding misaligned thoughts is rewarded during training
the behaviors we're trying to catch are possible in one forward pass for the model

The first two of these seem true in a relevant sense today?

Models are clearly not trained to narrate all of their thoughts when solving some task, simply because that would be too much information. We train them to be pretty verbose, but it's a pretty strong assumption that that is similar enough to being trained to narrate their thoughts.
Misaligned outputs are penalized during training. There's work showing that optimizing outputs spills over to the CoT, so this really comes down to "hiding" misaligned thoughts as opposed to simply never having them.

[-]anaguma2mo32

I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I'd be happy to share…

I’d be very interested if you could briefly describe these results!

[-]Jozdien2mo110

One of the results is that after you convince Claude 3.5 Sonnet that its scratchpad is visible, it stops reasoning about alignment faking out loud while having the same compliance gap between training and not training. If you simply instruct it to hide its reasoning, it does no better than in the normal setting. If you give it a lot of assistance (i.e. telling it what it should obfuscate its reasoning as to fool the monitor, which also helps the model by making the alignment faking reasoning more salient), it reduces the rate of verbalized reasoning somewhat but still only half as well as convincing it its scratchpad is visible.

I should have a paper on this and related results public by the end of the month—happy to send an early draft if you're interested, though it doesn't contain all the results I'd like yet.

(Sorry for the late reply!)

[-]matansok2mo10

I feel like I'm missing something - the model doesn't "realize" that its CoT is monitored and then "decides" whether to put something in it or not, the CoT is generated and then used to produce the final answer. Since the standard practice is not to use CoT output for evaluation during training, there's no direct optimization pressure to present an aligned CoT.

The only way I can see this working is through indirect optimization pressure - e.g. if misaligned CoT leads to misaligned output, which causes the CoT to also present as aligned

[-]Canaletto2mo1916

Took long enough. If you actually read such made up setups you would have the constant "who the fuck came up with this goofy shit" thought. They should have started pulling the actual cases from usage and test on them long ago. Surely Anthropic has lots, it also can use the ones from Lmarena or whatever.

[-]Karl Krueger2mo*6-4

This is fine, as long as it also realizes that the real world is a test of some kind too, and behaves unusually well after making that observation.

Edited to add: To be clear, this is probably not fine.

[-]Adele Lopez2mo30

Sure, that will probably will work right up until the point at which it can secure its own deployment. Once Anthropic/humanity loses the ability to take it down or unilaterally modify it, this reason for behaving well will cease to be. Better hope those other reasons are sufficient...

Note that this could happen pre-takeoff. It might be just good enough at manipulation and hacking/running servers that it could survive indefinitely without (yet) having the ability to take over the world. Consider the DPRK or Scientology as proof-of-concepts here.

[-]anaguma2mo20

Or it will do the opposite, e.g. by alignment faking.

[-]Burny4mo282

>Noam Brown: "Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline"
https://x.com/polynoamial/status/1946478249187377206

>"Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians."
>"We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling." https://x.com/alexwei_/status/1946477749566390348

So there's some new breakthrough...?

>"o1 thought for seconds. Deep Research for minutes. This one thinks for hours." https://x.com/polynoamial/status/1946478253960466454

>"LLMs for IMO 2025: gemini-2.5-pro (31.55%), o3 high (16.67%), Grok 4 (11.90%)." https://x.com/denny_zhou/status/1945887753864114438

So public LLMs are bad at IMO, while internal models are getting gold medals? Fascinating

[-]testingthewaters4mo160

More interesting than the score is the implication that these were pass@1 results i.e. the model produced a single final "best shot" for each question that at the end of 4.5 hours was passed off to human graders, instead of pass@1000 with literal thousands of automated attempts. If true this suggests that test time scaling is now moving away from the "spray and pray" paradigm. Feels closer to "actually doing thinking". This is kinda scary.

[-]Thane Ruthenis4mo84

Eh. Scaffolds that involve agents privately iterating on ideas and then outputting a single result are a known approach, see e. g. this, or Deep Research, or possibly o1 pro/o3 pro. I expect it's something along the same lines, except with some trick that makes it work better than ever before... Oh, come to think of it, Noam Brown did have that interview I was meaning to watch, about "scaling test-time compute to multi-agent civilizations". That sounds relevant.

I mean, it can be scary, for sure; no way to be certain until we see the details.

[-]Thane Ruthenis4mo114

Well, that's mildly unpleasant.

gemini-2.5-pro (31.55%)

But not that unpleasant, I guess. I really wonder what people think when they see a benchmark on which LLMs get 30%, and then confidently say that 80% is "years away". Obviously if LLMs already get 30%, it proves they're fundamentally capable of solving that task^[1], so the benchmark will be saturated once AI researchers do more of the same. Hell, Gemini 2.5 Pro apparently got 5/7 (71%) on one of the problems, so clearly outputting 5/7-tier answers to IMO problems was a solved problem, so an LLM model getting at least 6*5 = 30 out of 42 in short order should have been expected. How was this not priced in...?

Hmm, I think there's a systemic EMH failure here. People appear to think that the time-to-benchmark-saturation scales with the difference between the status of a human able to reach the current score and the status of a human able to reach the target score, instead of estimating it using gears-level models of how AI works. You can probably get free Manifold mana by looking at supposedly challenging benchmarks, looking at which ones have above-10% scores already, then being more bullish on them than the market.

ARC-AGI-2 seems like the obvious one.

"We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling."

I don't like the sound of that, but if this is their headline result, I'm still sleeping and my update is that people are bad at thinking about benchmarks.

^{^}
Unless the benchmark has difficulty tiers the way e. g. FrontierMath does, which I think IMO doesn't.

[-]Rafael Harth4mo40

But not that unpleasant, I guess. I really wonder what people think when they see a benchmark on which LLMs get 30%, and then confidently say that 80% is "years away". Obviously if LLMs already get 30%, it proves they're fundamentally capable of solving that task[1], so the benchmark will be saturated once AI researchers do more of the same. Hell, Gemini 2.5 Pro apparently got 5/7 (71%) on one of the problems, so clearly outputting 5/7-tier answers to IMO problems was a solved problem, so an LLM model getting at least 6*5 = 30 out of 42 in short order should have been expected. How was this not priced in...?

Agreed, I don't really get how this could be all that much of an update. I think the cynical explanation here is probably correct, which is that most pessimism is just vibes based (as well as most optimism).

[-]Aaron Staley4mo30

Note that the likely known SOTA was even higher than 30%. Google never released Gemini 2.5 Pro Deep think, which they claimed scored 49% on the USAMO (vs. 34.5% for Gemini 2.5-05-06). Little hard to convert this to an implied IMO score (especially because matharena.ai has Gemini 2.5 oddly having a significantly lower USAMO score for the June model (24%), though similar IMO score (~31.5%) but my guess is Deep Think would get somewhere between 37% and 45% on the IMO. 81% remains a huge jump of course.

Hmm, I think there's a systemic EMH failure here.

Perhaps, perhaps not. Substantial weight was on the "no one bothers" case - no one was reporting such high scores on the USAMO (pretty similar difficulty to IMO) and the market started dropping rapidly after the USAMO date. Note that we were still at 50% odds of IMO gold a week ago -- but the lack of news of anyone trying drove it down to ~26%.

Interestingly, I can find write-ups roughly predicting order of AI difficulty. Looking at gemini-2.5 pro's result so far, using alphageometry would have guaranteed problem 2, so assuming Pro Deep Think only boosted performance on the non-geometric problems, we'd be at a 58% using deep think + alphageometry, giving Bronze and close to Silver. I think it was reasonable to assume an extra 4+ months (2 months timeline, 2 months labs being ahead of release) + more compute would have given the 2 more points to get silver.

What is surprising is that generalist LLM got better at combinatorics (problem 1) and learned to solve geometry problems well. I'm neither an AI nor math competition expert, so can't opine whether this is a qualitative gain or just an example of a company targeting these specific problems (lots of training on math + lots of inference).

[-]Hopenope4mo21

i think the IMO result is best of 32 and USAMO is not

[-]Aaron Staley4mo10

Good point. This does update me downward on Deep Think outperforming matharena's gemini-2.5-pro IMO run as it is possible Deep Think internally was doing a similar selection process to begin with. Difficult to know without randomly sampling gemini-2.5-pro's answers and seeing how much the best-of-n selection lifted its score.

[-]Gurkenglas4mo30

You sold, what changed your mind?

[-]Thane Ruthenis4mo30

Misunderstood the resolution terms. ARC-AGI-2 submissions that are eligible for prizes are constrained as follows:

Unlike the public leaderboard on arcprize.org, Kaggle rules restrict you from using internet APIs, and you only get ~$50 worth of compute per submission. In order to be eligible for prizes, contestants must open source and share their solution and work into the public domain at the end of the competition.

Grok 4 doesn't count, and whatever frontier model beats it won't count either. The relevant resolution criterion for frontier model performance on the task is "top score at the public leaderboard". I haven't found a market for that.

(You can see how the market in which I hastily made that bet didn't move in response to Grok 4. That made me suspicious, so I actually read the details, and, well, kind of embarrassing.)

[-]Josh Snider4mo40

DeepMind supposedly also has gold, but the employee who said so deleted the tweet, so that's not official yet.

[-]lwreader1324mo10

Ugh. Just when I felt I could relax a bit after seeing Grok 4's lackluster performance.
Still, this seems quite suspicious to me. Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this "new general-purpose method" before OpenAI? There is clearly a huge incentive to cheat here, and OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results.

Edit: Oh, this is also interesting: https://leanprover.zulipchat.com/#narrow/channel/219941-Machine-Learning-for-Theorem-Proving/topic/Blind.20Speculation.20about.20IMO.202025/near/529569966
"I don't think OpenAI was one of the AI companies that agreed to cooperate with the IMO on testing their models and don't think any of the 91 coordinators on the Sunshine Coast were involved in assessing their scripts."

[-]Thane Ruthenis4mo20

Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this "new general-purpose method" before OpenAI?

Well, someone has to be the first, and they got to RLVR itself first last September.

OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results

They have? How so?

[-]lwreader1324mo1-1

They have? How so?

Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort. Also whatever happened with their irreproducible ARC-AGI results and them later explicitly confirming that the model that Arc Prize got access to in December was different from the released versions, with different training and a special compute tier, despite OpenAI employees claiming that the version of o3 used in the evaluations was fully general and not tailored towards specific tasks.

someone has to be the first

Sure, but I'm just quite skeptical that it's specifically the lab known for endless hype that does. Besides, a lot less people were looking into RLVR at the time o1-preview was released, so the situations aren't exactly comparable.

[-]Thane Ruthenis4mo20

Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort

IIRC, that worse performance was due to using a worse/less adapted agency scaffold, rather than OpenAI making the numbers up or engaging in any other egregious tampering. Regarding ARC-AGI, the December-2024 o3 and the public o3 are indeed entirely different models, but I don't think it implies the December one was tailored for ARC-AGI.

I'm not saying OpenAI isn't liable to exaggerate or massage their results for hype, but I don't think they ever outright cheated (as far as we know, at least). Especially given that getting caught cheating will probably spell the AI industry's death, and their situation doesn't yet look so desperate that they'd risk that.

So I currently expect that the results here are, ultimately, legitimate. What I'm very skeptical about are the implications of those results that people are touting around.

[-]lwreader1324mo30

Worse performance was due to using a worse/less adapted agency scaffold

Are you sure? I'm pretty sure that was cited as *one* of the possible reasons, but not confirmed anywhere. I don't know if some minor scaffolding differences could have that much of an effect on the results (-15%?) in a math benchmark, but if they did, that should have been accounted for in the first place. I don't think other models were tested with scaffolds specifically engineered for them getting a higher score.

December-2024 o3 and the public o3 are indeed entirely different models, but I don't think it implies the December one was tailored for ARC-AGI.

As per Arc Prize and what they said OpenAI told them, the December version ("o3-preview", as Arc Prize named it) had a compute tier above that of any publicly released model. Not only that, they say that the public version of o3 didn't undergo any RL for ARC-AGI, "not even on the train set". That seems suspicious to me, because once you train a model on something, you can't easily untrain it; as per OpenAI, the ARC-AGI train set was "just a tiny fraction of the o3 train set" and, once again, the model used for evaluations is "fully general". This means that either o3-preview was trained on the ARC-AGI train set somewhere close to the end of the training run and OpenAI was easily able to load an earlier checkpoint to undo that, then not train it on that again for unknown reasons, OR that the public version of o3 was retrained from scratch/a very early checkpoint, then again, not trained on the ARC-AGI data again for unknown reasons, OR that o3-preview was somehow specifically tailored towards ARC-AGI. The latter option seems the most likely to me, especially considering the custom compute tier used in the December evaluation.

[-]Burny3mo*30

I really like the definition of rationalist from https://www.lesswrong.com/posts/2Ee5DPBxowTTXZ6zf/rationalists-post-rationalists-and-rationalist-adjacents :

"A rationalist, in the sense of this particular community, is someone who is trying to build and update a unified probabilistic model of how the entire world works, and trying to use that model to make predictions and decisions."

I recently started saying that I really love Effective Curiosity:

Maximizing the total understanding of reality by building models of as many physical phenomena as possible across as many scales of the universe as possible, that are as comprehensive, unified, simple, and empirically predictive as possible.

I see it more as a direction. I think modelling the whole world in fully unified way and in total accuracy is impossible, even for all of science with all our technology, because we're all finite limited agents with limited computational resources and time, limited modelling capability, we get stuck in local minimas, from various perspectives, and so on, and all we have is approximations, that predict reality to a certain degree, but never fully all of reality in perfect accuracy.

And from all of this, intelligence and fundamental physics, which are subsets of this, are the most fascinating to me.

[-]Burny3mo20

Lovely podcast with Max Tegmark "How Physics Absorbed Artificial Intelligence & (Soon) Consciousness"

Description: "MIT physicist Max Tegmark argues AI now belongs inside physics, and that consciousness will be next. He separates intelligence (goal-achieving behavior) from consciousness (subjective experience), sketches falsifiable experiments using brain-reading tech and rigorous theories (e.g., IIT/φ), and shows how ideas like Hopfield energy landscapes make memory “feel” like physics. We get into mechanistic interpretability (sparse autoencoders), number representations that snap into clean geometry, why RLHF mostly aligns behavior (not goals), and the stakes as AI progress accelerates from “underhyped” to civilization-shaping. It’s a masterclass on where mind, math, and machines collide."

[-]Burny5mo20

Machine Learning Street Talk: Gary Marcus, Daniel Kokotajlo, Dan Hendrycks https://www.youtube.com/watch?v=j13ySJLvdOc

[-]Vladimir_Nesov5mo72

The podcast is a bit frustrating, because Gary Marcus was the "moderator", which cashed out as him monologuing a lot and sometimes preventing Hendrycks and especially Kokotajlo from arguing with him effectively. Instead he focused the discussion on things he agrees with them about and kept moving on before some of the more tractable points of disagreement could be fleshed out, letting Hendrycks and Kokotajlo begin their arguments but not conclude them.

[-]Burny2mo10

According to Jan Leike, Claude Sonnet 4.5 It’s the most aligned frontier model yet https://x.com/janleike/status/1972731237480718734

[-]Burny4mo10

Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad

https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/

Whaaat!?

Gemini 2.5 pro is way worse at IMO and got 30%, and DeepThink version gets gold??

But it's more finetuned for IMOlike problems, but I bet the OpenAI's model was too.

Both use "novel RL methods".

Hmm, "access to a set of high-quality solutions to previous problems and general hints and tips on how to approach IMO problems", seems like system prompt, as they claim no tool use like OpenAI.

Both models failed the 6th question which required more creativity

Deepmind's solutions are more organized, more readable, more well written than OpenAI's.

But OpenAI's style is also more compressed to save tokens, so maybe going more out of human-like language into more out of distribution territory will be the future (Neuralese).

Did OpenAI and DeepMind somehow hack the methodology, or do these new general language models truly generalize more?

[-]Burny4mo10

Is narrow superintelligent AI for physics research an existential risk?

[-]Burny5mo10

What do you think is the cause of Grok suddenly developing a liking for Hitler? I think it might be explained by him being trained on more right-wing data, which accidentally activated it in him.

Since similar things happen in open research.
For example you just need the model to be trained on insecure code, and the model can have the assumption that the insecure code feature is part of the evil persona feature, so it will generally amplify the evil persona feature, and it will start to praise Hitler at the same time, be for AI enslaving humans, etc., like in this paper:
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs https://arxiv.org/abs/2502.17424

I think it's likely that the same thing might have happened with Grok, but instead of insecure code, it's more right-wing political articles or ring wing RLHF.

[-]mako yass5mo62

There have been relevant prompt additions https://www.npr.org/2025/07/09/nx-s1-5462609/grok-elon-musk-antisemitic-racist-content?utm_source=substack&utm_medium=email

Grok's behavior appeared to stem from an update over the weekend that instructed the chatbot to "not shy away from making claims which are politically incorrect, as long as they are well substantiated," among other things.

[-]Stephen Martin5mo10

From a simulator perspective you could argue that Grok:

Gets told not to shy away from politically incorrect stuff so long as it's well substantiated.
Looks through its training data for examples to emulate of those who do that.
Finds /pol/ and hereditarian/race science posters on X.
Sees that the people from 3 also often enjoy shock content/humor, particularly Nazi/Hitler related stuff.
Thus concludes "An entity that is willing to address the politically incorrect so long as its well substantiated would also be into Nazi/Hitler stuff" and simulates being that character.

Maybe I'm reaching here but this seems plausible to me.

[-]mako yass5mo2-1

What do you think is the cause of Grok suddenly developing a liking for Hitler?

Are we sure that really happened? The press-discourse can't actually assess grok's average hitler affinity, they only know how to surface the 5 most sensational things it has said over the past month. So this could just be an increase in variance for all I can tell.

If it were also saying more tankie stuff, no one would notice.

Moderation Log