CAC — LessWrong

CAC5mo20

Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it's probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we're just two weeks past the absolute mid-point – that's 54.4% of the way through the year.

Also, the codex-1 benchmarks released on May 16, while Claude 4's were announced on May 22 (certainly before the midpoint).

Cole Wyeth's Shortform

CAC5mo10-3

I don't think there's enough evidence to draw hard conclusions about this section's accuracy in either direction, but I would err on the side of thinking ai-2027's description is correct.

Footnote 10, visible in your screenshot, reads:

For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.

SOTA models score at:
• 83.86% (codex-1, pass@8)
• 80.2% (Sonnet 4, pass@several, unclear how many)
• 79.4% (Opus 4, pass@several)

(Is it fair to allow pass@k? This Manifold Market doesn't allow it for its own resolution, but here I think it's okay, given that the footnote above makes claims about 'coding agents', which presumably allow iteration at test time.)

Also, note the following paragraph immediately after your screenshot:

The agents are impressive in theory (and in cherry-picked examples), but in practice unreliable. AI twitter is full of stories about tasks bungled in some particularly hilarious way. The better agents are also expensive; you get what you pay for, and the best performance costs hundreds of dollars a month.¹¹ Still, many companies find ways to fit AI agents into their workflows.¹²

AI twitter sure is full of both impressive cherry-picked examples, but also stories about bungled tasks. I also agree that the claims about "find[ing] ways to fit AI agents into the workflows" is exceedingly weak. But it's certainly happening. A quick Google for "AI agent integration" turns up this article from IBM, where agents are diffusing across multiple levels of the company.

johnswentworth's Shortform

CAC6mo30

Do group conversations count?

I would agree that the median one-on-one conversation for me is equivalent to something like a mediocre blogpost (though I think my right-tail is longer than yours, I'd say my favorite one-on-one conversations were about as fun as watching some of my favorite movies).

But, in groups, my median shifts toward 80th percentile YouTube video (or maybe the average curated post here on LessWrong).

It does feel like a wholly different activity, and might not be the answer you're looking for. Group conversations, for example, are in a way inherently less draining: you're not forced to either speak or actively listen for 100% of the time.

adamzerner's Shortform

CAC8mo40

My assumption is that many of these successes would tend to be widely distributed around some mean, rather than being narrowly concentrated at one point.

So if a joke needs to be 7/10 funny to get a laugh, but a comedian delivers what is actually a 6.5/10 joke, you’ll still get some subset of people who find it funnier than it is, such that it gets an appropriate amount of laughs.

Probably there’s some inefficiency, but because of this effect, the number of laughs/number of upvotes I think gives quite good information about the perceived quality of the joke/post.

Mo Putera's Shortform

CAC9mo70

I asked GPT 4.5 to write a system prompt and user message for models to write Pilish poems, feeding it your comment as context.

Then I gave these prompts to o1 (via OpenAI's playground).

GPT 4.5's system prompt

You are an expert composer skilled in writing poetry under strict, unusual linguistic constraints, specifically "Pilish." Pilish is a literary constraint in which the length of consecutive words precisely matches each digit of π (pi). The first word contains 3 letters, second word 1 letter, third word 4 letters, fourth word 1 letter, fifth word 5 letters, sixth word 9 letters, and so forth, accurately reflecting the sequence of pi’s digits.

For example, the classic Pilish sentence is:
"How I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics!"

As another reference, Mike Keith famously wrote a poem titled "Near a Raven," a Pilish version of Edgar Allan Poe's "The Raven," precisely encoding hundreds of pi digits into elegant verse.

Important guidelines:

You must strictly adhere to the Pilish constraint without exception or deviation.
Carefully recount each word to match each digit exactly.
Maintain coherence, clarity, and literary quality in your writing.
Make sure the poem or text is meaningful, evocative, creative, and intriguing, not just mechanically correct.
Verify each word length explicitly to ensure compliance with Pilish.
Your task is to craft a compelling, original poem in Pilish form. Do not explain the poem afterward; the precision and artistry of the piece itself is sufficient.

GPT 4.5's user message

Write an original Pilish poem about pi itself: reflecting upon the wonder, mystery, and endless fascination humans have felt over millennia towards this famous mathematical constant. Make it vivid, thoughtful, and meticulously correct in matching the Pilish constraint. Begin now.

o1's completion

Sky I seek a vivid resonance in orbits round the mists whirling boundless mystery mesmerize all of our reveries flow within to unveil deep pi's hum.

I asked o1 to count the letters in each word, comparing with pi:

Sky – 3 letters
I – 1 letter
seek – 4 letters
a – 1 letter
vivid – 5 letters
resonance – 9 letters
in – 2 letters
orbits – 6 letters
round – 5 letters
the – 3 letters
mists – 5 letters
whirling – 8 letters
boundless – 9 letters
mystery – 7 letters
mesmerize – 9 letters
all – 3 letters
of – 2 letters
our – 3 letters
reveries – 8 letters
flow – 4 letters
within – 6 letters
to – 2 letters
unveil – 6 letters
deep – 4 letters
pi's – 3 letters (ignoring the apostrophe)
hum – 3 letters (ignoring the period)
This gives us the sequence:
3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9, 3, 2, 3, 8, 4, 6, 2, 6, 4, 3, 3

And this seems correct!

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments