The time it takes an AI or a Human+AI team (a "cyborg") to complete a task is a key aspect of what we care about when we talk about capabilities. The relationship between AI capabilities and cyborg capabilities is very useful for forecasting AI timelines. We simply don’t have good...
The low-background steel problem Modern steel is slightly radioactive. We did a lot of atomic testing in the 40s and 50s, and now our atmosphere has some amount of radioactive particles, which make their way into steel during production. This is mostly fine, but some scientific instruments require steel that...
[EDIT: LawrenceC, who works at METR, responds to this.] In this post, we’ll discuss three major problems with the METR eval and propose some solutions. Problem 1: The METR eval produces results with egregious confidence intervals, and the METR chart misleadingly hides this. Problem 2: There's a lack of sample...
Pangram is ostensibly very good. They claim their program detects all LLMs, outperforms trained humans, and has 99.98% accuracy. This means that Pangram correctly identifies AI-written text 99.98% of time. They claim a very low false positive rate, somewhere between 0.01% and 0.16%. They claim that Pangram detects "humanized text",...
I see people repeatedly make the mistake of referencing a very low liquidity prediction market and using it to make a nontrivial point. Usually the implication when a market is cited is that its number should be taken somewhat seriously, that it's giving us a highly informed probability. Sometimes a...
Have you guys heard about this Epstein stuff? Shit's pretty crazy. Note: I'm not going to provide a summary of the situation or talk about evidence; this piece is for people that already know these things. I'm going to avoid specifics about what Epstein and co did, and instead will...