Feedback welcome: www.admonymous.co/mo-putera
Long-time lurker (c. 2013), recent poster. I also write on the EA Forum.
For my own reference: some "benchmarks" (very broadly construed) I pay attention to.
Another whimsical "benchmark": Terry Tao wrote on Mathstodon that
There are now sufficiently many different examples of Erdos problems that have been resolved with various amounts of AI assistance and formal verification (see https://github.com/teorth/erdosproblems/wiki/AI-contributions-to-Erd%C5%91s-problems for a summary) that one can start to discern general trends.
Broadly speaking, we now see an empirical tradeoff between the level of AI involvement in the solution, and the difficulty or novelty of that solution. In particular, the recent solutions have spanned a spectrum roughly describable as follows:
1. Completely autonomous AI solutions to Erdos problems that are short and largely follow a standard technique. (In many, but not all, of these cases, some existing literature was found that proved a very similar result by a similar method.)
2. AI-powered modifications of existing solutions (which could be either human-generated or AI-generated) that managed to improve or modify these solutions in various ways, for instance by upgrading a partial solution to a full solution, or optimizing the parameters of the proof.
3. Complex interactions between humans and AI tools in which the AI tools provided crucial calculations, or proofs of key steps, allowing the collaboration to achieve moderately complicated and novel solutions to open problems.
4. Difficult research-level papers solving one or more Erdos problems by mostly traditional human means, but for which AI tools were useful for secondary tasks such as generation of code, numerics, references, or pictures.
The seeming negative correlation between the amount of AI involvement and the depth of result is somewhat reminiscent of statistical paradoxes such as Berkson's paradox https://en.wikipedia.org/wiki/Berkson%27s_paradox or Simpson's paradox https://en.wikipedia.org/wiki/Simpson%27s_paradox . One key confounding factor is that highly autonomous AI workflows are much more scaleable than human-intensive workflows, and are thus better suited for being systematically applied to the "long tail" of obscure Erdos problems, many of which actually have straightforward solutions. As such, many of these easier Erdos problems are now more likely to be solved by purely AI-based methods than by human or hybrid means.
Given the level of recent publicity given to these problems, I expect that over the next few weeks, pretty much all of the outstanding Erdos problems will be quietly attempted by various people using their preferred AI tool. Most of the time, these tools will not lead to any noteworthy result, but such failures are unlikely to be reported on any public site. It will be interesting to see what (verified) successes do emerge from this, which should soon give a reasonably accurate picture of what proportion of currently outstanding Erdos problems are simple enough to be amenable to current AI tools operated with minimal human intervention. (My guess is that this proportion is on the order of 1-2%.) Assessing the viability of more hybridized human-AI approaches will take significantly longer though, as human expert attention will remain a significant bottleneck.
So I'll whimsically define the "Erdos problems benchmark" to be "the proportion of currently outstanding Erdos problems amenable to current AI tools operated with minimal human intervention", and the current "SOTA" to be Tao's guess of 1-2% as of Jan 2026. My guess is it won't be saturated in ~2 years like every other benchmark because open math problems can be unboundedly hard, but who knows?
I bet that Steve Byrnes can point out a bunch of specific sensory evidence that the brain uses to construct the status concept (stuff like gaze length of conspecifics, or something?), but the human motivation system isn't just optimizing for those physical proxy measures, or people wouldn't be motivated to get prestige on internet forums where people have reputations but never see each other's faces.
Curious to see what Steven Byrnes would actually say here. I fed your comment and Byrnes' two posts on social status to Opus 4.5, it thought for 3m 40s (!) and ended up arguing he'd disagree with your social status example:
Byrnes explicitly argues the opposite position in §2.2.2 of the second post. He writes:
"I don't currently think there's an innate drive to 'mostly lead' per se. Rather, I think there's an innate drive that we might loosely describe as 'a drive to feel liked / admired'... and also an innate drive that we might loosely describe as 'a drive to feel feared'. These drives are just upstream of gaining an ability to 'mostly lead'."
And more pointedly:
"I'm avoiding a common thing that evolutionary psychologists do (e.g. Secret of Our Success by Henrich), which is to point to particular human behaviors and just say that they're evolved—for example, they might say there's an 'innate drive to be a leader', or 'innate drive to be dominant', or 'innate drive to imitate successful people', and so on. I think those are basically all 'at the wrong level' to be neuroscientifically plausible."
So Byrnes is explicitly rejecting the claim that evolution installed "status-seeking" as a goal at the level Eli describes.
Byrnes proposes a three-layer architecture: first, there are primitive innate drives—"feel liked/admired" and "feel feared"—which are still at the feeling level, not the abstract-concept level. Second, there's a very general learning mechanism (which he discusses extensively in his valence series, particularly §4.5–4.6) that figures out what actions and situations produce those feelings in the local environment. Third, there are some low-level sensory adaptations (like "an innate brainstem reflex to look at people's faces") that feed into the learning system. Status-seeking behavior emerges from this combination, but "status" itself isn't the installed goal.
Why does this matter? Eli presents something like a dichotomy: either (A) evolution can only do sensory-level proxies that break in novel contexts (like male preferences for big breasts, which don't update when you learn a woman is infertile), or (B) evolution can install abstract concepts as goals (like status). Byrnes' model offers a third option: evolution installs feeling-level drives plus a general learning mechanism. The learning mechanism explains why status-seeking transfers to internet forums—the primitive drive to "feel liked/admired" still triggers when you get upvotes, and the learning system figures out how to get more of that—without requiring that "status" itself be the installed goal. This third option actually supports the original claim Eli is arguing against. Evolution didn't need to install "status" as a concept; it installed feelings + learning, and the abstract behavior emerged.
(mods, let me know if this is slop and I'll take it down)
Not exactly comparable to the AI Village's open-ended long-horizon tasks above, but it's interesting that Cursor found out that
GPT-5.2 models are much better at extended autonomous work: following instructions, keeping focus, avoiding drift, and implementing things precisely and completely. Opus 4.5 tends to stop earlier and take shortcuts when convenient, yielding back control quickly.
on their project to build a web browser from scratch (GitHub), totaling >1M LoC across 1k files, running "hundreds of concurrent agents" for a week. This is the opposite of what I'd have predicted just from how much more useful Claude is vs comparable-benchmark models. Also: "GPT-5.2 is a better planner than GPT-5.1-codex, even though the latter is trained specifically for coding", what's up with that?
Interesting quote on the downstream consequences of local speedup of output production by LLMs in business processes by Rafa Fernández, host of the Protocols for Business special interest group (SIG), from his essay Finding Fault Lines within the Firm:
AI is usually discussed in terms of automation or productivity. Those framings are not wrong, but they miss what makes AI adoption particularly revealing from a protocol perspective. While much of the public discussion frames AI in terms of cost-savings or new markets, our SIG has been focusing on the pressure it places on current coordination systems by changing the speed and scale at which work is produced ...
Across the SIG’s discussions, interviews, and readings, a consistent pattern has emerged. Under AI adoption, the first thing that stops working smoothly seemed unintuitive: time.
This became clear when our group reviewed Blake Scholl’s writing on Boom Supersonic. Here, Scholl distinguishes between at least two clocks operating inside the same organization. The first is the calendar: project timelines, milestones, and delivery dates. The second is what he calls the Slacker Index: the amount of time engineers spend waiting – on inputs, approvals, dependencies, or external constraints – rather than building. Even in well-run, safety-critical organizations, these clocks coexist.
Under stable conditions and in mature industries, this alignment is usually implicit. Engineering velocity, supplier lead times, regulatory review cycles, and internal decision-making rhythms evolve together. At Boom, hardware design, simulation, testing, and supplier manufacturing are paced to one another. Slower clocks constrain faster ones in predictable ways. Waiting is visible, expected, and priced into the system.
As Scholl points out, AI-enabled production changes the speed and scale of production. Certain forms of work – design iteration, analysis, documentation, internal review – can suddenly accelerate by orders of magnitude. From the perspective of the Slacker Index, local waiting collapses. Yet the calendar will not automatically follow. Supplier lead times remain fixed. Certification processes still unfold at human and institutional speeds. External partners continue to operate on contractual and regulatory time.
The consequence of AI-enabled opportunity is temporal divergence (a topic explored in depth by SIG member Sachin). Some clocks speed up sharply while others remain unchanged. At Boom, this would mean design teams outrunning suppliers, simulations outrunning manufacturing feedback, or internal decision cycles outrunning the capacity of external partners to respond. The Slacker Index may improve locally – less waiting to produce – but worsen systemically as downstream dependencies fall behind.
AI systems further amplify this effect in two ways. One, because they generate outputs without passing through the durations that normally situate work, creating a dizzying orientation. ... Knowledge accumulates faster than it can be evaluated, integrated, or acted upon.
Second, AI software using LLMs can be contextually misaligned. They draw on data that’s often years apart (a model trained up to 2024, used in 2026) and produced outside the local business context. From this lens, the recent focus on improving AI product memory seems intuitive. Efforts such as RAG, MCP, skills, and even “undo” prompt features become attempts to realign probabilistic software into business context, tempo, and authority.
Safety-critical organizations like Boom make these dynamics visible precisely because they cannot simply collapse time. Hardware, suppliers, and regulators enforce non-negotiable rhythms. When AI accelerates internal work without moving those external clocks, coordination strain surfaces quickly. Slack accumulates in unfamiliar places, with no protocols available to redistribute it.
When time regimes fall out of alignment, coordination problems and opportunities change form. Delays no longer appear as isolated errors that can be corrected locally. Instead, organizations experience escalating tensions: pressure to act without corresponding capacity to review, decide, or remember.
So how have orgs adapted? Three categories of examples:
When shared assumptions about time lose coherence, organizations first adapt within current structures. Work continues by absorbing friction rather than resolving its source.
One visible form of this absorption is Boom’s solution: integrate vertically. The critical move was purchasing their own large-scale manufacturing equipment rather than continuing to rely on external suppliers whose lead times dominated the schedule. Supplier queues and fabrication delays had become the governing clock for the entire program, producing a high Slacker Index: engineers were ready to iterate, but progress stalled while waiting on parts. By acquiring the machine, Boom internalized that bottleneck and converted supplier wait time into an internal, controllable process. This collapsed a multi-month external dependency into a shorter, iterable internal cycle, allowing design, testing, and manufacturing to co-evolve rather than queue sequentially.
Another response was novel translation work. The SIG discussed the fast growing Forward Deployed Engineer role, emerging to help mediate between fast-moving demands and slower-moving infrastructure. Their task is not to eliminate mismatch, but to work across it and leverage it – adjusting scope, translating intent, and negotiating constraints as they appear. This work allows organizations to keep operating even as tempos diverge, and gain a competitive advantage in the process. At its best, the work defines the operating model. This is the case for Palantir and large AI labs like OpenAI and Anthropic.
Other adaptations the SIG encountered took the form of operational formalization: AI usage guidelines, governance documents, digitized ontologies. These measures make previously tacit constraints visible without altering the structures that produced the misalignment. They stabilize behavior at the margin while leaving underlying coordination regimes intact.
Was the editor's note written by an LLM?
Gemini 3 Pro analogized Scott Alexander to a beaver when I asked it to make sense of him, because "Scott is a keystone individual" and "in ecology, a keystone species (like the beaver) exerts influence disproportionate to its abundance because it creates the ecosystem in which others live":
- He built the dam (The Community/Lighthaven) that pooled the water.
- He signaled where the food was (Grants/Open Threads).
- He warned of the predators (Moloch/AI Risk).
This was mildly funny. It was also striking how many factual details it erred in (the rest of the response that is, not the beaver analogy), which to an outsider might sound plausible if dramatic.
The essay's examples show how those types of payments decorrelate at the extremes, which makes it useful to think about improving one's "payment allocation".
Bit tangential: re: your sequence name "civilization is FUBAR", I get the FU, but why BAR? Maybe I'm just in too much of a progress-vibed bubble?
I asked a bunch of LLMs with websearch to try and name the classic mistake you're alluding to:
To be honest these just aren't very good, they usually do better at naming half-legible vibes.
I learned about ALARA from this post, so I wanted to note this here -- Reading List 01/17/2026 by Brian Potter at Construction Physics notes that ALARA may be taken down, maybe, not sure, can't find the memo to confirm: