LESSWRONG
LW

968
Lech Mazur
425Ω1121010
Message
Dialogue
Subscribe

Advameg, Inc. CEO 

Founder, city-data.com 

https://twitter.com/LechMazur

Author: County-level COVID-19 machine learning case prediction model. 

Author: AI assistant for melody composition.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
4Lech Mazur's Shortform
2y
4
[Anthropic] A hacker used Claude Code to automate ransomware
Lech Mazur14d10

I'm glad Anthropic is taking steps to address this, but they can only control their own models. Open-source coding agents and models are, what, maybe a year behind in capabilities?

Another interesting but likely impractical, cybersecurity threat: https://infosec.exchange/@ESETresearch/115095803130379945

#ESETResearch has discovered the first known AI-powered ransomware, which we named #PromptLock. The PromptLock malware uses the gpt-oss:20b model from OpenAI locally via the Ollama API to generate malicious Lua scripts on the fly, which it then executes.

Reply
Habryka's Shortform Feed
Lech Mazur3mo10

There was no GPT-5 in 2025?

There was o1-pro in 2024 (December). It might be argued that this came with caveats due to its slowness and high cost but the difference in science questions (GPQA diamond), math (AIME 2024), competition code (Codeforces) compared to GPT-4 Turbo available at the time of his post was huge. The API wasn't available in 2024 so we didn't get any benchmarks besides these from OpenAI. In 2025, I tested o1-pro on NYT Connections and it also improved greatly 1, 2. I would probably also consider regular o1 a massive advancement. I don't think the naming is what matters.

Many people corporations are trying to get AIs to be useful in California, fewer elsewhere, and I'm not convinced these will last.

Lately, I've been searching for potential shorting opportunities in the stock market among companies likely to suffer from AI-first competition. But it's been tougher than I expected, as nearly every company fitting this description emphasizes their own AI products and AI transformations. Of course, for many of these companies, adapting won't be quite that easy, but the commitment is clearly there.

The data appears to support this:

"Adoption is deepening, too: The average number of use cases in production doubled between October 2023 and December 2024" - Bain Brief - Survey: Generative AI’s Uptake Is Unprecedented Despite Roadblocks.

"On the firm side, the Chamber of Commerce recorded a 73 percent annualized growth rate between 2023 and 2024.4 The Census BTOS survey shows a 78.4 percent annualized growth rate.5 Lastly, the American Bar Association reported a 38 percent annualized growth rate.6 Among individual-level surveys, Pew is the only source showing changes over time, with an annualized growth rate of 145 percent.7 Overall, these findings suggest that regardless of measurement differences in the levels adoption is rising very rapidly both at the individual and firm-level. " - Measuring AI Uptake in the Workplace.

Reply
"AI Rapidly Gets Smarter, And Makes Some of Us Dumber," from Sabine Hossenfelder
Lech Mazur7mo70

It's a video by an influencer who has repeatedly shown no particular insight in any field other than her own. For example, her video about the simulation hypothesis was atrocious. I gave this one a chance, and it's just a high-level summary of some recent developments, nothing interesting.

Reply
OpenAI releases GPT-4.5
Lech Mazur7mo911

It's better than 4o across four of my benchmarks: Confabulations, Creative Writing, Thematic Generalization, and Extended NYT Connections. However, since it's an expensive and huge model, I think we'd be talking about AI progress slowing down at this point if it weren't for reasoning models.

Reply
Anthropic releases Claude 3.7 Sonnet with extended thinking mode
Lech Mazur7mo70

I ran 3 of my benchmarks so far:

Extended NYT Connections

  • Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1
  • Claude 3.7 Sonnet: 11th place
    GitHub Repository

Thematic Generalization

  • Claude 3.7 Sonnet Thinking: 1st place
  • Claude 3.7 Sonnet: 6th place
    GitHub Repository

Creative Story-Writing

  • Claude 3.7 Sonnet Thinking: 2nd place, behind DeepSeek R1
  • Claude 3.7 Sonnet: 4th place
    GitHub Repository

Note that Grok 3 has not been tested yet (no API available).

Reply22
Quinn's Shortform
Lech Mazur7mo70

This might blur the distinction between some evals. While it's true that most evals are just about capabilities, some could be positive for improving LLM safety.

I've created 8 (soon to be 9) LLM evals (I'm not funded by anyone, it's mostly out of my own curiosity, not for capability or safety or paper publishing reasons). Using them as examples, improving models to score well on some of them is likely detrimental to AI safety:

https://github.com/lechmazur/step_game - to score better, LLMs must learn to deceive others and hold hidden intentions

https://github.com/lechmazur/deception/ - the disinformation effectiveness part of the benchmark

Some are likely somewhat negative because scoring better would enhance capabilities:

https://github.com/lechmazur/nyt-connections/

https://github.com/lechmazur/generalization

Others focus on capabilities that are probably not dangerous:

https://github.com/lechmazur/writing - creative writing

https://github.com/lechmazur/divergent - divergent thinking in writing

However, improving LLMs to score high on certain evals could be beneficial:

https://github.com/lechmazur/goods - teaching LLMs not to overvalue selfishness

https://github.com/lechmazur/deception/?tab=readme-ov-file#-disinformation-resistance-leaderboard - the disinformation resistance part of the benchmark

https://github.com/lechmazur/confabulations/ - reducing the tendency of LLMs to fabricate information (hallucinate)

I think it's possible to do better than these by intentionally designing evals aimed at creating defensive AIs. It might be better to keep them private and independent. Given the rapid growth of AI capabilities, the lack of apparent concern for an international treaty (as seen in the recent Paris AI summit), and the competitive race dynamics among companies and nations, specifically developing an AI to protect us from threats from other AIs or AIs + humans might be the best we can hope for.

Reply
Zvi’s 2024 In Movies
Lech Mazur8mo10

Your ratings have a higher correlation with IMDb ratings at 0.63 (I ran it as a test of Operator).

Reply
DeepSeek beats o1-preview on math, ties on coding; will release weights
Lech Mazur10mo111

It seems that 76.6% originally came from the GPT-4o announcement blog post. I'm not sure why it dropped to 60.3% by the time of o1's blog post.

Reply2
Sabotage Evaluations for Frontier Models
Lech Mazur11mo10

Somewhat related: I just published the LLM Deceptiveness and Gullibility Benchmark. This benchmark evaluates both how well models can generate convincing disinformation and their resilience against deceptive arguments. The analysis covers 19,000 questions and arguments derived from provided articles.

Reply
GPT-o1
Lech Mazur1y10

I included o1-preview and o1-mini in a new hallucination benchmark using provided text documents and deliberately misleading questions. While o1-preview ranks as the top-performing single model, o1-mini's results are somewhat disappointing. A popular existing leaderboard on GitHub uses a highly inaccurate model-based evaluation of document summarization.

Confabulations Rate Chart


The chart above isn't very informative without the non-response rate for these documents, which I've also calculated:

Non-Response Rate Chart

The GitHub page has further notes.

Reply
Load More
13Emergent Price-Fixing by LLM Auction Agents
2mo
0
4Lech Mazur's Shortform
2y
4