LESSWRONG
LW

jsd
438Ω64430
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2jsd's Shortform
3y
8
No wikitag contributions to display.
Bohaska's Shortform
jsd1mo*20

[I work at Epoch AI]

Thanks for your comment, I'm happy you found the logs helpful! I wouldn't call the evaluation broken - the prompt clearly states the desired format, which the model fails to follow. We mention this in our Methodology section and FAQ ("Why do some models underperform the random baseline?"), but I think we're also going to add a clarifying note about it in the tooltip.

While I do think "how well do models respect the formatting instructions in the prompt" is also valuable to know, I agree that I'd want to disentangle that from "how good are models at reasoning about the question". Adding a second, more flexible scorer (likely based on an LLM-judge, like we have for OTIS Mock AIME) is in our queue, we're just pretty strapped on engineering capacity at the moment :)

ETA: since it's particularly extreme in this case I plan to hide this evaluation until we have the new scorer added

Reply
Ram Potham's Shortform
jsd3mo50

there's this https://github.com/Jellyfish042/uncheatable_eval

Reply
Win/continue/lose scenarios and execute/replace/audit protocols
jsd8moΩ990

This distinction reminds me of Evading Black-box Classifiers Without Breaking Eggs, in the black box adversarial examples setting. 

Reply
Fabien's Shortform
jsd1y45

Well that was timely

Reply1
Scaling of AI training runs will slow down after GPT-5
jsd1y96

Amazon recently bought a 960MW nuclear-powered datacenter. 

I think this doesn't contradict your claim that "The largest seems to consume 150 MW" because the 960MW datacenter hasn't been built (or there is already a datacenter there but it doesn't consume that much energy for now)?

Reply
The Best Tacit Knowledge Videos on Every Subject
jsd1y30

Related: Film Study for Research

Reply
The Best Tacit Knowledge Videos on Every Subject
jsd1y20

Domain: Mathematics

Link: vEnhance

Person: Evan Chen

Background: math PhD student, math olympiad coach 

Why: Livestreams himself thinking about olympiad problems

Reply1
The Best Tacit Knowledge Videos on Every Subject
jsd1y20

Domain: Mathematics

Link: Thinking about math problems in real time

Person: Tim Gowers

Background: Fields medallist

Why: Livestreams himself thinking about math problems

Reply
Scenario Forecasting Workshop: Materials and Learnings
jsd1y*10

From the Rough Notes section of Ajeya's shared scenario: 

Meta and Microsoft ordered 150K GPUs each, big H100 backlog. According to Lennart's BOTECs, 50,000 H100s would train a model the size of Gemini in around a month (assuming 50% utilization)

Just to check my understanding, here's my BOTEC of the number of FLOPs for 50k H100s during a month: 5e4 H100s * 1e15 bf16 FLOPs/second * 0.5 utilization * (3600 * 24 * 30) seconds/month = 6.48e25 FLOPs.

This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs). ETA: see clarification in Eli's reply.

I'm curious if we have info about the floating point format used for these training runs: how confident are we that labs are using bf16 rather than fp8?

Reply
Some heuristics I use for deciding how much I trust scientific results
jsd1y10

Thanks, I think this is a useful post, I also use these heuristics. 

I recommend Andrew Gelman’s blog as a source of other heuristics. For example, the Piranha problem and some of the entries in his handy statistical lexicon.

Reply
Load More
19A Theory of Unsupervised Translation Motivated by Understanding Animal Communication
2y
0
291Notes on Teaching in Prison
2y
13
2jsd's Shortform
3y
8
11"Acquisition of Chess Knowledge in AlphaZero": probing AZ over time
4y
9