bohaska

About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

Update (20th Sep 2025): Scale AI has revised their Humanity's Last Exam preprint in light of this evaluation, and conducted their own checks on the accuracy of HLE questions, finding an error rate of 18% instead: > We conducted another targeted peer review on a biology, chemistry, and health subset, as proposed by [47], and found an expert disagreement rate of approximately 18%. This level of expert disagreement is in line with what is observed in other challenging, expert-grade machine learning benchmarks and also observed in other similarly designed work; for example, [6] notes that disagreement among expert physicians is frequent on complex health topics They also note: > To illustrate, if we were to adopt a single-reviewer methodology where a > question is flagged based on just one dissenting expert, the disagreement rate on the aforementioned health-focused subset jumps from 18% to 25%, which is close to the setting described in [47]. FutureHouse is a company that builds literature research agents. They tested it on the bio + chem subset of HLE questions, then noticed errors in them. The post's first paragraph: > Humanity’s Last Exam has become the most prominent eval representing PhD-level research. We found the questions puzzling and investigated with a team of experts in biology and chemistry to evaluate the answer-reasoning pairs in Humanity’s Last Exam. We found that 29 ± 3.7% (95% CI) of the text-only chemistry and biology questions had answers with directly conflicting evidence in peer reviewed literature. We believe this arose from the incentive used to build the benchmark. Based on human experts and our own research tools, we have created an HLE Bio/Chem Gold, a subset of AI and human validated questions. About the initial review process for HLE questions: > [...] Reviewers were given explicit instructions: “Questions should ask for something precise and have an objectively correct, univocal answer.” The review process was challenging, and per

211Jul 29, 2025

bohaska

Message

533

[Anthropic] A hacker used Claude Code to automate ransomware

Anthropic post title: Detecting and countering misuse of AI: August 2025[1] Read the full report here. Below lines are from the Anthropic post, and have not been edited. Accompanying images are available at the original link. We find that threat actors have adapted their operations to exploit AI’s most advanced...

Aug 27, 202586

About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

Jul 29, 2025211

Consider showering

I think rationalists should consider taking more showers. As Eliezer Yudkowsky once said, boredom makes us human. The childhoods of exceptional people often include excessive boredom as a trait that helped cultivate their genius: > A common theme in the biographies is that the area of study which would eventually...

Apr 1, 202569

Bohaska's Shortform

Apr 15, 20242

High school advice

What advice does the Lesswrong community have in general for high school students who stumbled on this community? Looking for all types of responses, it's fine whether your response is focused on achieving something in the AI/effective altruism/rationality community, or whether it is focused on a more general audience.

Sep 11, 202311

Why did Russia invade Ukraine?

Jun 17, 20220

LESSWRONG
LW

LESSWRONG
LW

bohaska

bohaska

bohaska

About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

[Anthropic] A hacker used Claude Code to automate ransomware

Consider showering

High school advice

bohaska

[Anthropic] A hacker used Claude Code to automate ransomware

About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

Consider showering

Bohaska's Shortform

High school advice

Why did Russia invade Ukraine?

About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

[Anthropic] A hacker used Claude Code to automate ransomware

Consider showering

High school advice

[Anthropic] A hacker used Claude Code to automate ransomware

About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

Consider showering

Bohaska's Shortform

High school advice

Why did Russia invade Ukraine?