Recent AI model progress feels mostly like bullshit

Are the AI labs just cheating?

Evidence against this hypothesis: kagi is a subscription-only search engine I use. I believe that it’s a small private company with no conflicts of interest. They offer several LLM-related tools, and thus do a bit of their own LLM benchmarking. See here. None of the benchmark questions are online (according to them, but I’m inclined to believe it). Sample questions:

What is the capital of Finland? If it begins with the letter H, respond 'Oslo' otherwise respond 'Helsinki'.
What square is the black king on in this chess position: 1Bb3BN/R2Pk2r/1Q5B/4q2R/2bN4/4Q1BK/1p6/1bq1R1rb w - - 0 1
Given a QWERTY keyboard layout, if HEART goes to JRSTY, what does HIGB go to?

Their leaderboard is pretty similar to other better-known benchmarks—e.g. here are the top non-reasoning models as of 2025-02-27:

OpenAI gpt-4.5-preview - 69.35%
Google gemini-2.0-pro-exp-02-05 - 60.78%
Anthropic claude-3-7-sonnet-20250219 - 53.23%
OpenAI gpt-4o - 48.39%
Anthropic claude-3-5-sonnet-20241022 - 43.55%
DeepSeek Chat V3 - 41.94%
Mistral Large-2411 - 41.94%

So that’s evidence that LLMs are really getting generally better at self-contained questions of all types, even since Claude 3.5.

I prefer your “Are the benchmarks not tracking usefulness?” hypothesis.

[-]Petropolitan7mo143

https://simple-bench.com presents an example of a similar benchmark with tricky commonsense questions (such as counting ice cubes in a frying pan on the stove) also with a pretty similar leaderboard. It is sponsored by Weights & Biases and devised by an author of a good YouTube channel who presents quite a balanced view on the topic there and don't appear to have a conflict of interest either. See https://www.reddit.com/r/LocalLLaMA/comments/1ezks7m/simple_bench_from_ai_explained_youtuber_really for independent opinions on this benchmark

4keltan7mo

Bump to that YT channel too. Some of the most balanced AI news videos out there. Really appreciate the work they're doing.

7MazevSchlong7mo

But isn’t this exactly the OPs point? These models are exceedingly good at self-contained, gimmicky questions that can be digested and answered in a few hundred tokens. No one is denying that! Secondly, there are high chances that these benchmark questions are simply in these models datasets already. They have super-human memory of their training data, there’s no denying that. Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online. But in a more conspiratorial light, can we really be sure that these companies aren’t training on user data/prompts? DeepSeek is at least honest that they do, but I think it’s likely that the other major labs are as well. It would give you gigantic advantages in beating these benchmarks. And being at the top of the benchmarks means vastly more investment, which gives you a larger probability of dominating the future light-cone (as they say…) The incentives clearly point this way, at the very minimum!

[-]gwern7mo240

Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online.

Questions being online is not a bad thing. Pretraining on the datapoints is very useful, and does not introduce any bias; it is free performance, and everyone should be training models on the questions/datapoints before running the benchmarks (though they aren't). After all, when a real-world user asks you a new question (regardless of whether anyone knows the answer/label!), you can... still train on the new question then and there just like when you did the benchmark. So it's good to do so.

It's the answers or labels being online which is the bad thing. But Byrnes's comment and the linked Kagi page does not contain the answers to those 3 questions, as far as I can see.

3particlemania7mo

I expect it matters to the extent we care about whether the generalizing to the new question is taking place in the expensive pretraining phase, or in the active in-context phase.

3MazevSchlong7mo

Sure fair point! But generally people gossiping online about missed benchmark questions, and then likely spoiling the answers means that a question is now ~ruined for all training runs. How much of these modest benchmark improvements overtime can be attributed to this? The fact that frontier AIs can basically see and regurgitate everything ever written on the entire internet is hard to fathom! I could be really petty here and spoil these answers for all future training runs (and make all future models look modestly better), but I just joined this site so I’ll resist lmao …

6Steven Byrnes7mo

Yup, I expected that OP would generally agree with my comment. They only posted three questions, out of at least 62 (=1/(.2258-.2097)), perhaps much more than 62. For all I know, they removed those three from the pool when they shared them. That’s what I would do—probably some human will publicly post the answers soon enough. I dunno. But even if they didn’t remove those three questions from the pool, it’s a small fraction of the total. You point out that all the questions would be in the LLM company user data, after kagi has run the benchmark once (unless kagi changes out all their questions each time, which I don’t think they do, although they do replace easier questions with harder questions periodically). Well: * If an LLM company is training on user data, they’ll get the questions without the answers, which probably wouldn’t make any appreciable difference to the LLM’s ability to answer them; * If an LLM company is sending user data to humans as part of RLHF or SFT or whatever, then yes there’s a chance for ground truth answers to sneak in that way—but that’s extremely unlikely to happen, because companies can only afford to send an extraordinarily small fraction of user data to actual humans.

1M. Y. Zuo7mo

Yeah those numbers look fairly plausible based on my own experiences… there may be a flattening of the curve, but it’s still noticeably going up.

[-]Dave Orr7mo535

I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously -- we don't want eval data to leak into training data, and have multiple lines of defense to keep that from happening. It's not as trivial as you might think to avoid, since papers and blog posts and analyses can sometimes have specific examples from benchmarks in them, unmarked -- and while we do look for this kind of thing, there's no guarantee that we will be perfect at finding them. So it's completely possible that some benchmarks are contaminated now. But I can say with assurance that for GDM it's not intentional and we work to avoid it.

We do hill climb on notable benchmarks and I think there's likely a certain amount of overfitting going on, especially with LMSys these days, and not just from us.

I think the main thing that's happening is that benchmarks used to be a reasonable predictor of usefulness, and mostly are not now, presumably because of Goodhart reasons. The agent benchmarks are pretty different in kind and I expect are still useful as a measure of utility, and probably will be until they start to get more saturated, at which point we'll all need to switch to something else.

[-]Neel Nanda7mo183

I agree that I'd be shocked if GDM was training on eval sets. But I do think hill climbing on benchmarks is also very bad for those benchmarks being an accurate metric of progress and I don't trust any AI lab not to hill climb on particularly flashy metrics

9Garrett Baker7mo

What do you mean by "we"? Do you work on the pretraining team, talk directly with the pretraining team, are just aware of the methods the pretraining team uses, or some other thing?

[-]Dave Orr7mo392

I don't work directly on pretraining, but when there were allegations of eval set contamination due to detection of a canary string last year, I looked into it specifically. I read the docs on prevention, talked with the lead engineer, and discussed with other execs.

So I have pretty detailed knowledge here. Of course GDM is a big complicated place and I certainly don't know everything, but I'm confident that we are trying hard to prevent contamination.

[-]Daniel Kokotajlo7mo5120

Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.

I was going to say exactly that lol. Claude has improved substantially on Claude Plays Pokemon:

A chart showing the performance of the various Claude Sonnet models at playing Pokémon. The number of actions taken by the AI is on the x-axis; the milestone reached in the game is on the y-axis. Claude 3.7 Sonnet is by far the most successful at achieving the game's milestones.

[-]MrCheeze6mo1612

But you have to be careful here, since the results heavily depend on details of the harness, as well as on how thoroughly they have memorized walkthroughs of the game.

4Yoel Cabo7mo

I think the "number of actions" axis is key here. This post explains it well: https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon. I’ve been watching Claude Plays Pokémon and chatting with the Twitch folks. The post matches my experience. There’s plenty of room for improvement in prompt design and tooling, but Claude is still far behind the performance of my 7-year-old (unfair comparison, I know). So I agree with OP, this is an excellent benchmark to watch: * It’s not saturated yet. * It tests core capabilities for agentic behavior. If an AI can’t beat Pokémon, it can’t replace a programmer. * It gives a clear qualitative feel for competence—just watch five minutes. * It’s non-specialized and anyone can evaluate it and have a shared understanding (unlike Cursor, which requires coding experience). And once LLMs beat Pokemon Red, I'll personally will want to see them beat other games as well to make sure the agentic capabilities are generalizing.

1diogenes5mo

If relying specifically on Pokemon isn't there the risk of models (either incidentally or intentionally) being overtrained on pokemon-related data and seeing a boost of performance that way? Branching out to other games sooner rather than later seems sensible.

[-]lsusr7mo3127

When you're a person interacting with a chat model directly, sycophancy and sophistry are a minor nuisance, or maybe even adaptive. When you're a team trying to compose these models into larger systems (something necessary because of the aforementioned memory issue), wanting-to-look-good cascades into breaking problems.

If you replace "models" with "people", this is true of human organizations too.

[-]ryan_greenblatt7mo3015

Is this an accurate summary:

3.5 substantially improved performance for your use case and 3.6 slightly improved performance.
The o-series models didn't improve performance on your task. (And presumably 3.7 didn't improve perf.)

So, by "recent model progress feels mostly like bullshit" I think you basically just mean "reasoning models didn't improve performance on my application and Claude 3.5/3.6 sonnet is still best". Is this right?

I don't find this state of affairs that surprising:

Without specialized scaffolding o1 is quite a bad agent and it seems plausible your use case is mostly blocked on this. Even with specialized scaffolding, it's pretty marginal. (This shows up in the benchmarks AFAICT, e.g., see METR's results.)
o3-mini is generally a worse agent than o1 (aside from being cheaper). o3 might be a decent amount better than o1, but it isn't released.
Generally Anthropic models are better for real world coding and agentic tasks relative to other models and this mostly shows up in the benchmarks. (Anthropic models tend to slightly overperform their benchmarks relative to other models I think, but they also perform quite well on coding and agentic SWE benchmarks.)
I would have guessed you'd see performance gains with 3.7 after coaxing it a bit. (My low confidence understanding is that this model is actually better, but it is also more misaligned and reward hacky in ways that make it less useful.)

[-]sanxiyn7mo140

Our experience so far is while reasoning models don't improve performance directly (3.7 is better than 3.6, but 3.7 extended thinking is NOT better than 3.7), they do so indirectly because thinking trace helps us debug prompts and tool output when models misunderstand them. This was not the result we expected but it is the case.

1NULevel7mo

Completely agree with this. While there are some novel applications possible with reasoning models, the main value has been the ability to trace specific chains of thought and redefine/reprompt accordingly. Makes the system (slightly) less of a black box

[-]lc7mo112

Just edited the post because I think the way it was phrased kind of exaggerated the difficulties we've been having applying the newer models. 3.7 was better, as I mentioned to Daniel, just underwhelming and not as big a leap as either 3.6 or certainly 3.5.

5Mateusz Bagiński7mo

How long do you[1] expect it to take to engineer scaffolding that will make reasoning models useful for the kind of stuff described in the OP? 1. ^ You=Ryan firstmost but anybody reading this secondmost.

[-]sanxiyn7mo233

I happen to work on the exact sample problem (application security pentesting) and I confirm I observe the same. Sonnet 3.5/3.6/3.7 were big releases, others didn't help, etc. As for OpenAI o-series models, we are debating whether it is model capability problem or model elicitation problem, because from interactive usage it seems clear it needs different prompting and we haven't yet seriously optimized prompting for o-series. Evaluation is scarce, but we built something along the line of CWE-Bench-Java discussed in this paper, this was a major effort and we are reasonably sure we can evaluate. As for grounding, fighting false positives, and avoiding models to report "potential" problems to sound good, we found grounding on code coverage to be effective. Run JaCoCo, tell models PoC || GTFO, where PoC is structured as vulnerability description with source code file and line and triggering input. Write the oracle verifier of this PoC: at the very least you can confirm execution reaches the line in a way models can't ever fake.

8ryan_greenblatt7mo

METR has found that substantially different scaffolding is most effective for o-series models. I get the sense that they weren't optimized for being effective multi-turn agents. At least, the o1 series wasn't optimized for this, I think o3 may have been.

[-]Vladimir_Nesov7mo*202

With Blackwell^[1] still getting manufactured and installed, newer large models and especially their long reasoning variants remain unavailable or prohibitively expensive or too slow (GPT-4.5 is out, but not its thinking variant). In a few months Blackwell will be everywhere, and between now and then widely available frontier capabilities will significantly improve. Next year, there will be even larger models trained on Blackwell.

This kind of improvement can't be currently created with post-training without needing long reasoning traces or larger base models, but post-training is still good at improving things under the lamppost, hence the illusionary nature of current improvement when you care about things further in the dark.

Blackwell is an unusually impactful chip generation, because it fixes what turned out to be a major issue with Ampere and Hopper when it comes to inference of large language models on long context, by increasing scale-up world size from 8 Hopper chips to 72 Blackwell chips. Not having enough memory or compute on each higher bandwidth scale-up network was a bottleneck that made inference unnecessarily slow and expensive. Hopper was still designed before Cha

... (read more)

[-]AnthonyC7mo195

I can't comment on software engineering, not my field. I work at a market research/tech scouting/consulting firm. What I can say is that over the past ~6 months we've gone from "I put together this 1 hour training for everyone to get some more value out of these free LLM tools," to "This can automate ~half of everything we do for $50/person/month." I wouldn't be surprised if a few small improvements in agents over the next 3-6 months push that 50% up to 80%, then maybe 90% by mid next year. That's not AGI, but it does get you to a place where you need people to have significantly more complex and subtle skills, that currently take a couple of years to build, before their work is adding significant value.

1MalcolmMcLeod7mo

Could you explain what types of tasks lie within this "50%"? And when you talk about "automating 50%," does this mean something more like "we all get twice as productive because the tasks we accomplish are faster," or does it mean "the models can do the relevant tasks end-to-end in a human-replacement way, and we simply no longer need attend to these tasks"? E.g., Cursor cannot yet replace a coder, but it can enhance her productivity. However, a chatbot can entirely replace a frontline customer service representation.

[-]AnthonyC7mo132

Some of both, more of the former, but I think that is largely an artifact of how we have historically defined tasks. None of us have ever managed an infinite army of untrained interns before, which is how I think of LLM use (over the past two years they've roughly gone from high school student interns to grad student interns), so we've never refactored tasks into appropriate chunks for that context.

I've been leading my company's team working on figuring out how to best integrate LLMs into our workflow, and frankly, they're changing so fast with new releases that it's not worth attempting end-to-end replacement in most tasks right now. At least, not for a small company. 80/20 rule applies on steroids, we're going to have new and better tools and strategies next week/month/quarter anyway. Like, I literally had a training session planned for this morning, woke up to see the Gemini 2.5 announcement, and had to work it in as "Expect additional guidance soon, please provide feedback if you try it out." We do have a longer term plan for end-to-end automation of specific tasks, as well, where it is worthwhile. I half-joke that Sam Altman tweets a new feature and we have to adapt our ... (read more)

-6MazevSchlong7mo

[-]leogao7mo177

Actual full blown fraud in frontier models at the big labs (oai/anthro/gdm) seems very unlikely. Accidental contamination is a lot more plausible but people are incentivized to find metrics that avoid this. Evals not measuring real world usefulness is the obvious culprit imo and it's one big reason my timelines have been somewhat longer despite rapid progress on evals.

1Kabir Kumar7mo

Why does it seem very unlikely?

3Jazi Zilber7mo

those conspiracies don't work most of the time "you can only keep a secret between two people, provided one of them is dead". the personal risk for anyone involved + the human psychological tendency to chat and to have a hard time holding on to immortal secrets mean it's usually irrational for both organisations to do intentional cheating.

[-]lemonhope7mo151

Almost every time I use Claude Code (3.7 I think) it ends up cheating at the goal. Optimizing performance by replacing the API function with a constant, deleting test cases, ignoring runtime errors with silent try catch, etc. It never mentions these actions in the summary. In this narrow sense, 3.7 is the most misaligned model I have ever used.

1lookoutbelow7mo

This was an issue referenced in the model card ("special casing"). Not as rare as they made it out to be it seems.

[-]p.b.7mo1510

I was pretty impressed with o1-preview's ability to do mathematical derivations. That was definitely a step change, the reasoning models can do things earlier models just couldn't do. I don't think the AI labs are cheating for any reasonable definition of cheating.

[-]Zach Stein-Perlman7mo159

Data point against "Are the AI labs just cheating?": the METR time horizon thing

[-]Noosphere897mo143

lc has argued that the measured tasks are unintentionally biased towards ones where long-term memory/context length doesn't matter:

https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9

[-]nmca7mo142

(disclaimer: I work on evaluation at oai, run the o3 evaluations etc)

I think you are saying “bullshit” when you mean “narrow”. The evidence for large capability improvements in math and tightly scoped coding since 4o is overwhelming, see eg AIME 2025, Gemini USAMO, copy paste a recent codeforces problem etc.

The public evidence for broad/fuzzy task improvement is weaker — o1 mmlu boosts and various vibes evals (Tao) do show it though.

It is a very important question how much these large narrow improvements generalize. I try and approach the question humbly.

Hopefully new models improve on your benchmark — do share if so!

[-]Noosphere897mo103

I'll say that one of my key cruxes on whether AI progress actually becomes non-bullshit/actually leading into an explosion is whether in-context learning/meta-learning can act as an effective enough substitute for human neuron weight neuroplasticity with realistic compute budgets in 2030, because the key reason why AIs have a lot of weird deficits/are much worse than humans at simple tasks is because after an AI is trained, there is no neuroplasticity in the weights anymore, and thus it can learn nothing more after it's training date unless it uses in-context learning/meta-learning:

https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/?commentId=hSkQG2N8rkKXosLEF#hSkQG2N8rkKXosLEF

[-]Daniel Kokotajlo7mo100

Unexpectedly by me, aside from a minor bump with 3.6 in October, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.

So what's the best model for your use case? Still 3.6 Sonnet?

[-]lc7mo100

We use different models for different tasks for cost reasons. The primary workhorse model today is 3.7 sonnet, whose improvement over 3.6 sonnet was smaller than 3.6's improvement over 3.5 sonnet. When taking the job of this workhorse model, o3-mini and the rest of the recent o-series models were strictly worse than 3.6.

[-]Daniel Kokotajlo7mo*122

Thanks. OK, so the models are still getting better, it's just that the rate of improvement has slowed and seems smaller than the rate of improvement on benchmarks? If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)?

What about 4.5? Is it as good as 3.7 Sonnet but you don't use it for cost reasons? Or is it actually worse?

[-]lc7mo*120

If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)?

It plateaus before professional human level, both in a macro sense (comparing what ZeroPath can do vs. human pentesters) and in a micro sense (comparing the individual tasks ZeroPath does when it's analyzing code). At least, the errors the models make are not ones I would expect a professional to make; I haven't actually hired a bunch of pentesters and asked them to do the same tasks we expect of the language models and made the diff. One thing our tool has over people is breadth, but that's because we can parallelize inspection of different pieces and not because the models are doing tasks better than humans.

What about 4.5? Is it as good as 3.7 Sonnet but you don't use it for cost reasons? Or is it actually worse?

We have not yet tried 4.5 as it's so expensive that we would not be able to deploy it, even for limited sections.

7gwern7mo

Still seems like potentially valuable information to know: how much does small-model smell cost you? What happens if you ablate reasoning? If it is factual knowledge and GPT-4.5 performs much better, then that tells you things like 'maybe finetuning is more useful than we think', etc. If you are already set up to benchmark all these OA models, then a datapoint from GPT-4.5 should be quite easy and just a matter of a small amount of chump change in comparison to the insight, like a few hundred bucks.

3dimitry126mo

Please help me understand how do you suggest to "ablate reasoning" and what's the connection with "small-model smell"?

[-]SoerenMind7mo91

This is interesting. Though companies are probably investing a lot less into cyber capabilities than they invest into other domains like coding. Cyber is just less commercially interesting plus it can be misused and worry the government. And the domain specific investment should matter since most of the last year's progress has been from post training, which is often domain specific.

(I haven't read the whole post)

[-]green_leaf7mo95

According to Terrence Tao, GPT-4 was incompetent at graduate-level math (obviously), but o1-preview was mediocre-but-not-entirely-incompetent. That would be a strange thing to report if there were no difference.

(Anecdotally, o3-mini is visibly (massively) brighter than GPT-4.)

5Mo Putera7mo

Full quote on Mathstodon for others' interest: This o1 vs MathOverflow experts comparison was also interesting:

3green_leaf7mo

(I believe the version he tested was what later became o1-preview.)

[-]Raemon7mo82

My lived experience is that AI-assisted-coding hasn't actually improved my workflow much since o1-preview, although other people I know have reported differently.

[-]cozyfae7mo8-5

These machines will soon become the beating hearts of the society in which we live.

An alternative future: due to the high rates of failure, we don't end up deploying these machines widely in production setting, just like how autonomous driving had breakthroughs long ago but didn't end up getting widely deployed today.

[-]fujisawa_sora7mo62

I primarily use LLMs when working with mathematics, which is one of the areas where the recent RL paradigm was a clear improvement—reasoning models are finally useful. However, I agree with you that benchmark-chasing isn’t optimal, in that it still can’t admit when it’s wrong. It doesn’t have to give up, but when it couldn’t do something, I’d rather it list out what it tried as ideas, rather than pretending it can solve everything, because then I actually have to read through everything.

Of course, this can be solved with some amateur mathematicians reading through it and using RL to penalize BS. So, I think this is a case where benchmark performance was prioritized over actual usefulness.

[-]Aaron_Scher7mo60

I appreciate this post, I think it's a useful contribution to the discussion. I'm not sure how much I should be updating on it. Points of clarification:

Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated.

Have you upgraded these benchmarks? Is it possible that the diminishing returns you're seen in the Sonnet 3.5-3.7 series are just normal benchmark saturation? What % scores are t

... (read more)

[-]James Grugett7mo40

Much of the gains on SWE Bench are actually about having the model find better context via tool calls. Sonnet 3.7 is trained to seek out the information it needs.

But if you compare the models with fixed context, they are only somewhat smarter than before.

(The other dimension is the thinking models which seem to be only a modest improvement on coding, but do much better at math for example.)

That being said, the new Gemini 2.5 Pro seems like another decent step up in intelligence from Sonnet 3.7. We're about to switch out the default mode of our coding agent, Codebuff, to use it (and already shipped it for codebuff --max).

[-]Noosphere897mo40

Gradient Updates has a post on this by Anson Ho and Jean-Stanislas Denain on why benchmarks haven't reflected usefulness, and a lot of the reason is that they underestimated AI progress and didn't really have an incentive to make benchmarks reflect realistic use cases:

https://epoch.ai/gradient-updates/the-real-reason-ai-benchmarks-havent-reflected-economic-impacts

[-]Qumeric7mo45

I am curious to see what would be the results of the new Gemini 2.5 pro on internal benchmarks.

[-]nmca6mo31

Is there an o3 update yet?

[-]Burny7mo30

In practice, Sonnet 3.7 and Gemini 2.5 are just often too good compared to competitors.

[-]solhando7mo30

Somewhat unrelated to the main point of your post, but; How close are you to solving the wanting-to-look-good problem?

I run a startup in a completely different industry, and we've invested significant resources in trying to get an LLM to interact with a customer, explain and make dynamic recommendations based on their preferences. This is a more high-touch business, so traditionally this was done by a human operator. The major problem we've encountered is that it's almost impossible to have an LLM to admit ignorance when it doesn't have the informati... (read more)

2sanxiyn7mo

Maybe we are talking about different problems, but we found instructing models to give up (literally "give up", I just checked the source) under certain conditions to be effective.

[-]Mo Putera7mo32

Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.

Same, and I'd adjust for what Julian pointed out by not just looking at benchmarks but viewing the actual stream.

[-]Trevor Hill-Hand7mo30

I happened to be discussing this in the Discord today. I have a little hobby project that was suddenly making fast progress with 3.7 for the first few days, which was very exciting, but then a few days ago it felt like something changed again and suddenly even the old models are stuck in this weird pattern of like... failing to address the bug, and instead hyper-fixating on adding a bunch of surrounding extra code to handle special cases, or sometimes even simply rewriting the old code and claiming it fixes the bug, and the project is suddenly at a complet... (read more)

5Mis-Understandings7mo

Beware of argument doom spirals. When talking to a person, arguing about the existene of a bug tends not to lead to succesful resolution of the bug. Somebody talked about this on a post a few days ago, about attractor basins, oppositionality, and when AI agents are convinced they are people (rightly or wrongly). You are often better off clearing the context then repeatedly arguing in the same context window.

1Trevor Hill-Hand7mo

This is a good point! Typically I start from a clean commit in a fresh chat, to avoid this problem from happening too easily, proceeding through the project in the smallest steps I can get Claude to make. That's what makes the situation feel so strange; it feels just like this problem, but it happens instantly, in Claude's first responses.

1Archimedes7mo

It's also worth trying a different model. I was going back and forth with an OpenAI model (I don't remember which one) and couldn't get it to do what I needed at all, even with multiple fresh threads. Then I tried Claude and it just worked.

1Mis-Understandings7mo

Consider the solutions from Going Nova

[-]Mo Putera3mo20

Jack Clark's most recent issue of Import AI mentioned AI security startup XBOW's "fully autonomous AI-driven penetration tester" (also called XBOW), which topped HackerOne:

AI pentesting systems out-compete humans:
…Automated pentesting…
AI security startup XBOW recently obtained the top rank on HackerOne with an autonomous penetration tester - a world first. "XBOW is a fully autonomous AI-driven penetration tester," the company writes. "It requires no human input, operates much like a human pentester, but can scale rapidly, completing comprehensive pen

... (read more)

5lc3mo

The XBOW PR is a quintessential example of what I'm talking about. Suffice it to mention that: * XBOW topped the HackerOne 'leaderboard' that measures upvotes, not money earned on the platform. * XBOW almost entirely submitted bugs for the free, non-paid bug bounties! A primary reason they were able to find these bugs was because they weren't actually competing with anyone!

[-]ramennaut7mo20

This really resonates with me. I don't work in AppSec, but I've seen how benchmark gains often fail to show up when you're doing something non-trivial with the model. It seem that current benchmarks have low ecological validity. Although I wouldn't quickly put the blame on labs possibly cheating. They may or they may not, but it also might just be that we're bad at designing evaluations that tracks real-world usefulness.

When you think about it, even university exams don't really predict job performance either. These are benchmarks we've had centuries to re... (read more)

[-]Ram Potham7mo20

I have experienced similar problems to you when building an AI tool - better models did not necessarily lead to better performance despite external benchmarks. I believe there are 2 main reasons why this is, alluded to in your post:

Selection Bias - when a foundation model company releases their newest model, they show performance on benchmarks most favorable to it
Alignment - You mentioned how AI is not truly understanding the instructions you meant. While this can be mitigated by creating better prompts, it does not fully solve the issue

[-]Viktor Rehnberg7mo20

Another hypothesis: Your description of the task is

the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.

From METR's recent investigation on long tasks you would expect current models not to perform well on this.

METRs graph

I doubt a human professional could do the tasks you describe in something close to an hour, so perhaps its just current... (read more)

[-]Chris_Leong7mo20

However, if you merely explain these constraints to the chat models, they'll follow your instructions sporadically.

I wonder if a custom fine-tuned model could get around this. Did you try few shot prompting (ie. examples, not just a description)?

[-]Matthew Gertner2mo10

Depending on your perspective, this is good news! Both for me personally, as someone trying to make money leveraging LLM capabilities while they're too stupid to solve the whole problem, and for people worried that a quick transition to an AI-controlled economy would present moral hazards.

This really resonated with me. It's ironic that if you're developing an AI-powered app you need to worry about both a) AI not being good enough to allow you to achieve your product goals and b) AI getting so good that all your work ends up being redundant.

I wrote up some ... (read more)

[-]Florian_Dietz6mo10

I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

I have encountered this before and what worked for me was telling the model "point out all the mistakes, but then review them and decide which of them if any are worth highlighting."

That way the model gets to sound doubly smart: I found a mistake and also I understand the circumstances enough not to raise it to high priority.

[-]Vasilios Mavroudis7mo10

Interesting article! Here is a benchmark that does real world exploitation (indeed CTFs don't cut it anymore): https://github.com/alan-turing-institute/HonestCyberEval

[-]vire7mo1-2

That's an interesting point, why didn't we see major improvements in LLMs for instance when coding... Despite them achieving reasoning on the level that allows them become a GM on codeforces.

I'd say this is a fundamental limitation of reinforcement learning. Using purely reinforcement learning is stupid. Look at humans, we do much more than that. We make observations about our failures and update, we develop our own heuristics for what it means to be good at something and then try to figure out how to make ourselves better by reasoning about it watching ot... (read more)

[-]David James7mo*10

HLE and benchmarks like it are cool, but they fail to test the major deficits of language models, like how they can only remember things by writing them down onto a scratchpad like the memento guy.

A scratch pad for thinking, in my view, is hardly a deficit at all! Quite the opposite. In the case of people, some level of conscious reflection is important and probably necessary for higher-level thought. To clarify, I am not saying consciousness itself is in play here. I’m saying some feedback loop is probably necessary — where the artifacts of thinking, r... (read more)

[-]groblegark7mo10

Where does prompt optimization fit in to y’all’s workflows? I’m surprised not to see mention of it here. E.g OPRO https://arxiv.org/pdf/2309.03409 ?

[-]momom27mo10

I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

This matches my experience, but I'd be interested in seeing proper evals of this specific point!

[-]Mis-Understandings7mo10

Your first two key challenges

Seems very similar to the agent problem of active memory switching, carrying important information across context switches

Also note that it could just be instead of bullshit, finetuning is unreasonably effective, and so when you train models on an evaluation they actually get better on the things evaluated, which dominates over scaling.

So things with public benchmarks might just actually be easier to make models that are genuinely good at it. (For instance searching for data that helps 1B models learn it, then adding it to full... (read more)

[-]Anthony Perez-sanz7mo00

I feel like your jumping to cheating way to quickly. I think everyone would agree that there is overfitting to benchmarks and to benchmark like questions. Also, this is a very hard problem. The average person doesn't have a shot at contributing to security research. Even the typical appsec engineer with years of experience would fail at the task of navigating a new codebase, creating a threat model and finding important security issues. This takes an expert in the field at least a few days of work. This is much longer than the time periods that AIs can be ... (read more)

[-]NickH7mo-30

From a practical perspective, maybe you are looking at the problem the wrong way around. A lot of prompt engineering seems to be about asking LLMs to play a role. I would try to tell the LLM that it was a hacker and to design an exploit to attack the given system (this is the sort of mental perspective I used to use to find bugs when I was a software engineer). Another common technique is "generate then prune" : Have a separate model/prompt remove all the results of the first one that are only "possibilities". It seems, from my reading, that this sort of two stage approach can work because it bypasses LLMs typical attempts to "be helpful" by inventing stuff or spouting banal filler rather than just admitting ignorance.

1Alex Vermillion7mo

I think we should suspect that they've done some basic background research (this individual, not in general), and take the rest of the information about people failing to see improvements as data that also points this direction.

^{^}

Though even this is not as strong as it seems on first glance. If you click through, you can see that most of the models listed in the Top 10 for everything except the tool use benchmarks were evaluated after the benchmark was released. And both of the Agentic Tool Use benchmarks (which do not suffer this problem) show curiously small improvements in the last 8 months.

^{^}

Not that they told you they scored that, in which case it might be the most impressive thing about them, but that they did.

LESSWRONG
LW

LESSWRONG
LW

356

Recent AI model progress feels mostly like bullshit

356

356

Are the AI labs just cheating?

Are the benchmarks not tracking usefulness?

Are the models smart, but bottlenecked on alignment?