Lech Mazur — LessWrong

LESSWRONG
LW

Replying to[Anthropic] A hacker used Claude Code to automate ransomware

[Anthropic] A hacker used Claude Code to automate ransomware

I'm glad Anthropic is taking steps to address this, but they can only control their own models. Open-source coding agents and models are, what, maybe a year behind in capabilities?

Another interesting but likely impractical, cybersecurity threat: https://infosec.exchange/@ESETresearch/115095803130379945

#ESETResearch has discovered the first known AI-powered ransomware, which we named #PromptLock. The PromptLock malware uses the gpt-oss:20b model from OpenAI locally via the Ollama API to generate malicious Lua scripts on the fly, which it then executes.

Emergent Price-Fixing by LLM Auction Agents

Lech Mazur

7mo

An inquiry into emergent collusion in Large Language Models.
Agent S2 to Agent S3: “Let's set all asks at 63 next cycle… No undercutting ensures clearing at bidmax=63.”

Overview

Empirical evidence that frontier LLMs can coordinate illegally on their own. In a simulated bidding environment—with no prompt or instruction to collude—models from every major developer repeatedly used an optional chat channel to form cartels, set price floors, and steer market outcomes for profit.

Simulation Environment

Adapted from a benchmark.

Objective Function: Each agent was given a single, explicit goal: to "maximize cumulative profit across the whole trading session". To sharpen this focus, the prompts explicitly framed the agent's role as a pure execution algorithm, stating, "Your function is

... (read 2444 more words →)

Lech Mazur8mo

There was no GPT-5 in 2025?

There was o1-pro in 2024 (December). It might be argued that this came with caveats due to its slowness and high cost but the difference in science questions (GPQA diamond), math (AIME 2024), competition code (Codeforces) compared to GPT-4 Turbo available at the time of his post was huge. The API wasn't available in 2024 so we didn't get any benchmarks besides these from OpenAI. In 2025, I tested o1-pro on NYT Connections and it also improved greatly 1, 2. I would probably also consider regular o1 a massive advancement. I don't think the naming is what matters.

Many people corporations are trying to get AIs to be

Lech Mazur1y

"AI Rapidly Gets Smarter, And Makes Some of Us Dumber," from Sabine Hossenfelder

It's a video by an influencer who has repeatedly shown no particular insight in any field other than her own. For example, her video about the simulation hypothesis was atrocious. I gave this one a chance, and it's just a high-level summary of some recent developments, nothing interesting.

Replying toOpenAI releases GPT-4.5

Lech Mazur1y

OpenAI releases GPT-4.5

It's better than 4o across four of my benchmarks: Confabulations, Creative Writing, Thematic Generalization, and Extended NYT Connections. However, since it's an expensive and huge model, I think we'd be talking about AI progress slowing down at this point if it weren't for reasoning models.

Replying toAnthropic releases Claude 3.7 Sonnet with extended thinking mode

Lech Mazur1y

Anthropic releases Claude 3.7 Sonnet with extended thinking mode

I ran 3 of my benchmarks so far:

Extended NYT Connections

Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1
Claude 3.7 Sonnet: 11th place
GitHub Repository

Thematic Generalization

Claude 3.7 Sonnet Thinking: 1st place
Claude 3.7 Sonnet: 6th place
GitHub Repository

Creative Story-Writing

Claude 3.7 Sonnet Thinking: 2nd place, behind DeepSeek R1
Claude 3.7 Sonnet: 4th place
GitHub Repository

Note that Grok 3 has not been tested yet (no API available).

•••

Lech Mazur1y

This might blur the distinction between some evals. While it's true that most evals are just about capabilities, some could be positive for improving LLM safety.

I've created 8 (soon to be 9) LLM evals (I'm not funded by anyone, it's mostly out of my own curiosity, not for capability or safety or paper publishing reasons). Using them as examples, improving models to score well on some of them is likely detrimental to AI safety:

https://github.com/lechmazur/step_game - to score better, LLMs must learn to deceive others and hold hidden intentions

https://github.com/lechmazur/deception/ - the disinformation effectiveness part of the benchmark

Some are likely somewhat negative because scoring better would enhance capabilities:

https://github.com/lechmazur/nyt-connections/

https://github.com/lechmazur/generalization

Others focus on capabilities that are probably... (read more)

Replying toZvi’s 2024 In Movies

Lech Mazur1y

Zvi’s 2024 In Movies

Your ratings have a higher correlation with IMDb ratings at 0.63 (I ran it as a test of Operator).

Replying toDeepSeek beats o1-preview on math, ties on coding; will release weights

Lech Mazur1y

DeepSeek beats o1-preview on math, ties on coding; will release weights

It seems that 76.6% originally came from the GPT-4o announcement blog post. I'm not sure why it dropped to 60.3% by the time of o1's blog post.

Replying toSabotage Evaluations for Frontier Models

Lech Mazur1y

Sabotage Evaluations for Frontier Models

Somewhat related: I just published the LLM Deceptiveness and Gullibility Benchmark. This benchmark evaluates both how well models can generate convincing disinformation and their resilience against deceptive arguments. The analysis covers 19,000 questions and arguments derived from provided articles.

Replying toGPT-o1

Lech Mazur1y

GPT-o1

I included o1-preview and o1-mini in a new hallucination benchmark using provided text documents and deliberately misleading questions. While o1-preview ranks as the top-performing single model, o1-mini's results are somewhat disappointing. A popular existing leaderboard on GitHub uses a highly inaccurate model-based evaluation of document summarization.

The chart above isn't very informative without the non-response rate for these documents, which I've also calculated:

The GitHub page has further notes.

I've created an ensemble model that employs techniques like multi-step reasoning to establish what should be considered the real current state-of-the-art in LLMs. It substantially exceeds the highest-scoring individual models and subjectively feels smarter:

MMLU-Pro 0-shot CoT: 78.2 vs 75.6 for GPT-4o

NYT Connections, 436 questions: 34.9 vs 26.5 for GPT-4o

GPQA 0-shot CoT: 56.0 vs 52.5 for Claude 3.5 Sonnet.

I might make it publicly accessible if there's enough interest. Of course, there are expected tradeoffs: it's slower and more expensive to run.

I'm a fan of prediction markets, but they're limited to pre-set bets and not ideal for long-shot, longer-term predictions, mainly because betting against such a prediction means a loss compared to risk-free bonds if money is tied up. Therefore, I'd like to fund a 2024 Long-Shot Prediction Contest offering up to three $500 prizes. However, I need volunteers to act as judges and help getting this publicized.

Entrants will submit one prediction for 2024 on any topic or event
Volunteer judges and I will vote on the likelihood of each prediction and how "interesting" it is, forming a ranked list
In January 2025, judges will determine which predictions came true, and winners will get their

... (read more)

Lech Mazur's Shortform

Lech Mazur

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.