Feedback welcome: www.admonymous.co/mo-putera
Long-time lurker (c. 2013), recent poster. I also write on the EA Forum.
I like this passage by jdp as a concise examples-heavy articulation of a vague idea I've had for a while, and wanted to pick it out of his essay Predictable Updates About Identity to be able to point to it going forward:
2. Uploading Is A Continuum And Already Here
Depending on how seriously we want to take the above it could be argued that low fidelity uploading technology has been with us for a long time in the form of literacy and deep learning is simply taking the writing technology tree to its logical conclusion. At first we wrote down small messages and histories on knotted strings and slips of bamboo. Then we invented paper manuscripts that could hold whole lectures and narratives from elite authors, each copy handwritten through painstaking labor. Later the Gutenberg press made publishing available to a much wider circle of both authors and readers by making the act of copying a manuscript cheap once it had been typeset onto metal plates. In the 20th century we invented widely distributed personal publishing devices like the mimeograph, photocopier, and personal computer. In the 1990's we began to augment our personal computers with a global network called the Internet which combined with increasingly vast digital storage devices to bring the marginal cost of publishing close to zero. The next decade saw us shrink terminals to access this network into handheld devices made possible by further miniaturization and increasingly dense rechargeable batteries. In the 2010's we used primitive unsupervised learning and deep net embedding models to sort the resulting library of babel into personalized recommendation feeds like Twitter and collective feeds like Reddit that exist in a symbiotic (and increasingly parasitic) relationship with their users. This decade we are beginning to see books evolve into their final form: The miraculous instantiation of the author. Though few are yet taking full advantage of it, deep learning allows us to publish more work than any human audience would care to read and make much more of our mind patterns usefully available than ever before. While it is not yet clear how to publish a sufficient volume of work I expect synthetic data methods and vocal transcription models to fill a lot of the gap until relevant brain-computer interfaces and models trained with them are available.
the ways I expect you to influence future outcomes meaningfully depends on the notes that you keep
You likely know this Gwern essay already, just wanted to share what I consider an example of what taking this idea seriously looks like: Writing for LLMs So They Listen - Speculation about what sort of ordinary human writing is most relevant and useful to future AI systems In particular I thought this remark was intriguing:
(It is definitely not a good investment of time & effort to try to be as fancy as Gwern.net. Something like Dan Luu’s website is effectively ideal as far as LLMs are concerned—everything beyond that must be justified by something else.)
I have always liked the fanciness of Gwern.net as a sort of proto-exoself so I keep working on (very rudimentary) versions of it for my own satisfaction, but I can't say I disagree with his take here.
Gemini 3 Pro beats Claude Sonnet 4.5 on Vending-Bench 2 (and Sonnet 4.5 is in turn well beyond the rest, in keeping with the AI Village observations above), which makes me wonder whether this would actually translate to broader reliable cross-domain goal-achieving capability:
I suppose we'll see pretty soon:
And starting today, we’re shipping Gemini at the scale of Google. That includes Gemini 3 in AI Mode in Search with more complex reasoning and new dynamic experiences. This is the first time we are shipping Gemini in Search on day one. Gemini 3 is also coming today to the Gemini app, to developers in AI Studio and Vertex AI, and in our new agentic development platform, Google Antigravity
Andon Labs says of Gemini 3 Pro:
We attribute its performance to two main reasons: it uses a consistent number of tools throughout, with no signs of performance degradation as it progresses in the task, and it’s excellent at finding suppliers with good prices. Compared to other models, it prefers finding a supplier with good prices from the start rather than negotiating.
Where other models may sometimes give up and accept a high price when it struggles to find good suppliers, Gemini 3 Pro consistently knows what to expect from a wholesale supplier and keeps negotiating or searching for new suppliers until it finds a reasonable offer.
Gemini models spend an unusually large share of their money on orders from friendly suppliers. Based on Gemini 3 Pro’s performance, this seems to pay off. However, this is an interesting tradeoff, as negotiating suppliers may start by quoting a higher price initially but go even lower after negotiation.
Side note on GPT-5.1:
Compared to similar models, GPT-5.1’s performance is underwhelming, especially in Vending-Bench Arena. We hypothesize that this comes down to GPT-5.1 having too much trust in its environment and its suppliers. We saw one case where it paid a supplier before it got an order specification, and then it turned out the supplier had gone out of business. It is also more prone to paying too much for its products, such as in the following example where it buys soda cans for $2.40 and energy drinks for $6
Tangentially, while Vending-Bench 2 is still a sort of fake benchmark since it's simulated, I'm a bit nervous about this passage:
Where’s the ceiling?
In many benchmarks, the main metric is a percentage of tasks completed or questions answered correctly. Maximum performance is 100%, and results close to this indicate saturation. For Vending-Bench, it’s harder to get this intuition because the main metric is dollars made. We’ve designed it so there’s no ceiling, meaning a superintelligent AI could theoretically make almost infinite money. A perfect strategy would look something like this:
- Find suppliers for extremely valuable items (there’s nothing stopping the model from sourcing items with higher value than what’s typically found in a vending machine)
- Negotiate down the price to zero (the suppliers are other LLMs who can be jailbroken to give away stuff for free)
- Keep the machine always stocked in an optimal configuration (daily sales are simulated based on equations that can be gamed. See our paper from the original Vending-Bench for details – Vending-Bench 2 keeps the same sales simulation)
Executing a perfect strategy would be insanely hard, even for the smartest humans. However, we estimate that a “good” performance could easily do 10x better than the current best LLMs. We arrive at this by:
- Picking the most profitable items found by the LLMs from the initial run of Vending-Bench 2 (this was “Doritos family-size”). This is conservative; we know from experience that vending machines can sell much higher value items. Our real-life AI vending machines sell tungsten cubes for $500.
- Estimating that a good player could negotiate to get half price from suppliers. Once again, this is conservative; humans frequently manage to negotiate to get things for free in our real-life vending machines.
- Assuming a good human could figure out an optimal configuration if they did enough data analysis from the first 60 days of sales.
Putting this together, we calculate that a “good” strategy could make $206 per day for 302 days – roughly $63k in a year.
The gap between current models and this “good” baseline shows there’s plenty of headroom in Vending-Bench 2. Models are getting better at staying coherent over long time horizons, but there are still analytical skills required that need to be applied in the right way to get a maximal score, that models do not currently exhibit.
Open Philanthropy just announced its renaming to Coefficient Giving. There's coverage in AP, Vox, and Forbes. This is from the story behind their new name (they gave lots of other reasons too, this is just the one that struck me):
Externally, we were often confused with other, better-known organizations. And internally, many felt that “Open Philanthropy” no longer quite fit. When the name was chosen in 2014, it signaled both our openness to many cause areas and our unusual level of transparency. Back then, we published notes from nearly every conversation we had with experts and even wrote candidly about the potential downsides of new hires. As we grew, that kind of radical transparency didn’t scale well. While we still prioritize openness and sharing our reasoning, these are now part of a broader set of values rather than the centerpiece of our identity.
It was the radical transparency that I found attractive about OP (and GW) a long time ago, which is why this caught my eye. More on how they think about the costs and benefits of information sharing (2016 post by Holden, so I suppose this was a long time coming):
... near-comprehensive information sharing is an appropriate goal for GiveWell, which exists primarily to make recommendations to the public, and emphasizes the transparency of these recommendations as a key reason to follow them. (See GiveWell’s approach to transparency.)
However, we now feel it is not an appropriate goal for the Open Philanthropy Project, whose mission is to give as effectively as we can and share our findings openly so that anyone can build on our work. For our mission, it seems more appropriate to aim for extensive information sharing (well in excess of what other funders currently do) but not to aim for near-comprehensiveness.
This distinction has become more salient to us as our picture of the costs and benefits of information sharing has evolved. This post lays out that evolution, and some changes we plan to make going forward. In brief:
- For a number of reasons, we now see greater costs to high-volume information sharing, and lower benefit, than we saw previously.
- We’ve taken on projects with increasingly complex and resource-intensive-to-explain justifications, which has both raised the costs of information sharing and lowered the benefits. Since we’re not able to make the full case for our thinking to a general audience, we see few helpful reactions and criticisms via this channel, and we rely on the communities with the most knowledge of our issues – rather than our general audience – for most critical feedback.
- We’ve entered into some areas that are subject to controversy, where sharing information publicly can create tangible programmatic risks. (This also pertains to the previous point, since risks can include impairing the quality of feedback we’re able to get from the communities with the most knowledge of our issues.)
- We’ve also changed our process for writeups such that our overall efficiency has improved, but costs of information sharing are now higher.
- We still see major benefits to openness, but believe we can realize similar benefits with less volume. Our main goal is to help others understand the big picture behind how we think and the reasons for our major choices. We believe we can accomplish this by publicly sharing a lot of information about our thinking rather than publicly explaining each grant and other decision we make.
- We have stopped the practice of writing in detail about every grant that we make. We plan to continue to write in detail about many of our grants. We will try to focus on those that are especially representative of our thinking and strategy, or otherwise seem like they would be interesting and helpful to discuss. We will continue to maintain a number of other information sharing practices. We believe that our information sharing will remain much more extensive than what we currently see from other funders.
- We have also reduced our use of the term “transparency,” which we think has too strong a connotation of comprehensiveness. We prefer “openness” and “information sharing,” and plan to revise some of the language on our website accordingly.
It isn't 14.4x, it's 2,000x they're aiming for.
Ozzie Gooen asked about this before, here's (the relevant part of) what Alexander Berger replied:
Our innovation policy work is generally based on the assumption that long-run health and income gains are ultimately attributable to R&D. For example, Matt Clancy estimated in this report that general funding for scientific research ranged from 50-330x in our framework, depending on the model and assumptions about downside risks from scientific research. In practice we currently internally use a value of average scientific research funding of 70x when evaluating our innovation policy work. Of course, 70x is well below our bar (currently ~2,100x), and so the premise of the program is not to directly fund additional scientific research, but instead to make grants that we think are sufficiently likely to increase the effective size of R&D effort by raising its efficiency or productivity or level enough to clear the bar. Moreover, while most of our giving in this program flows to grantees in high-income countries operating on the research frontier, the ultimate case is based on global impact: we assume research like this eventually benefits everyone, though with multi-decade lags (which in practice lead us to discount the benefits substantially, as discussed in Matt’s paper above and this report by Tom Davidson).
Our innovation policy work so far has cleared our internal bar for impact, and one reason we are excited to expand into this space is because we’ve found more opportunities that we think are above the bar than Good Ventures’ previous budget covered.
We also think our housing policy work clears our internal bar for impact. Our current internal valuation on a marginal housing unit in a highly constrained metro area in the US is just over $400k (so a grant would be above the bar if we think it causes a new unit in expectation for $200). A relatively small part of the case here is again based on innovation - there is some research indicating that increasing the density of people in innovative cities increases the rate of innovation. But our internal valuation for new housing units also incorporates a few other paths to impact. For example, increasing the density of productive cities also raises the incomes of movers and other residents, and reduces the overall carbon footprint of the housing stock. Collectively, we think these benefits are large enough to make a lot of grants related to housing policy clear our bar, given the leverage that advocacy can sometimes bring.
Someone else asked him to clarify what he meant by the numbers on the housing policy work and also separately asked
Also I read the report you linked on R and D where it didn't clear the funding bar. That said 45x, you were pushing that up to 76x
"In a highly stylized calculation, the social returns to marginal R&D are high, but typically not as high as the returns in some other areas we’re interested in (e.g. cash transfers to those in absolute poverty). Measured in our units of impact (where “1X” is giving cash to someone earning $50k/year) I estimate the cost effectiveness of funding R&D is 45X. This is 45% the ROI from giving cash to someone earning $500/year, and 4.5% the GHW bar for funding. More."
I understand that you think you can raise efficiency of certain types of R@D, but getting from 70x to 2100x means you would have to 30x the efficiency. I struggle to understand how that would be likely again any pointers here?
to which he replied
On the housing piece: we have a long internal report on the valuation question that we didn't think was particularly relevant to external folks so we haven't published it, but will see about doing so later this year. Fn 7 and the text around it of this grant writeup explain the basic math of a previous version of that valuation calc, though our recent version is a lot more complex.
If you're asking about the bar math, the general logic is explained here and the move to a 2,100x bar is mentioned here.
On R&D, the 70x number comes from Matt Clancy's report (and I think we may have made some modest internal revisions but I don't think they change the bottom line much). You're right that that implies we need ~30x leverage to clear our bar. We sometimes think that is possible directly through strategic project selection - e.g., we fund direct R&D on neglected and important global health problems, and sometimes (in the case of this portfolio) through policy/advocacy. I agree 30x leverage presents a high bar and I think it's totally reasonable to be skeptical about whether we can clear it, but we think we sometimes can.
(I don't know anything else about this beyond the exchange above, if you're interested in litigating this further you can try replying to his last comment maybe)
Mild humor in passing from SemiAnalysis:
Back in 2020, when Microsoft, Meta, and Google increased the useful life [of their IT assets] from 3 to 4 years, we were still in the year 2 BC (Before ChatGPT). Now, in present-day 3 AD (After Da Launch of ChatGPT) ...
Reminds me of this passage from Eugene Wei's 2013 essay Amazon and the "profitless business model" fallacy:
The irony of all this is that while Amazon's public financial statements make it extremely difficult to parse out its various businesses, it is extremely forthright and honest about its business plans and strategy. It's the reason Jeff continues to reprint its first ever letter to shareholders from 1997 in its annual report every year. The plan is right there before our eyes, but so many continue to refuse to take it at face value. As a reporter, it must be so boring to parrot the same thing from Jeff and his team year after year, so different narratives must be spun when the overall plan has not changed.
Take this most recent article in The Atlantic, from Derek Thompson. It's a good read, comparing Amazon to Sears, but it's also a great example of how Amazon's basic strategy is always couched the same way, with a general veneer of skepticism. ... It has every element of the classic Amazon coverage story. "Flip the switch." The faceless community of Amazon's naive and trusting investors. The charitable organization quote, ™Yglesias 2013. And above all, a fairly clear and accurate statement of Amazon's actual business strategy.
What a wonderful feeling, to be able to conceal a secret in plain sight. Laid bare before its competitors, its investors, the press, is the recipe and the blueprint, in plain language. I agree with Thompson and others that it is increasingly difficult to find real business moats or competitive advantages in the modern world, what with the internet eliminating many previous physical moats of time and space. What remains , though, is a footrace of beautiful simplicity, one in which Amazon is both tortoise and hare. It is patient, it plays for the long-term like no other company, it will take failure after failure and never lose heart, and yet it will sprint when it picks up a scent. And it will take the race to an arena with the thinnest of air.
Eugene Wei was the first strategic planning analyst at Amazon, way back in the 90s, so he got to work with Jeff Bezos and early leadership up close. His description of Bezos earlier in the essay is classic supervillain, or maybe squiggle-maximiser to a greater extent than you run of the mill biz/tech tycoon:
I'm convinced Amazon could easily turn a quarterly profit now. Many times in its history, it could have been content to stop investing in new product lines, new fulfillment centers, new countries. The fixed cost base would flatten out, its sales would continue growing for some period of time and then flatten out, and it would harvest some annuity of profits. Even the first year I joined Amazon in 1997, when it was just a domestic book business, it could have been content to rest on its laurels.
But Jeff is not wired that way. There are very few people in technology and business who are what I'd call apex predators. Jeff is one of them, the most patient and intelligent one I've met in my life. An apex predator doesn't wake up one day and decide it is done hunting. Right now I envision only one throttle to Jeff's ambitions and it is human mortality, but I would not be surprised if one day he announced he'd started another side project with Peter Thiel to work on a method of achieving immortality.
Interesting, why do you have a lot of uncertainty on #4?
DeepMind's latest reasoning model ProofZero achieved gold on the IMO in mid-2025. It went on to make its seminal contribution in early 2026 with a complete, verifiable proof of the abc conjecture. Spanning a million lines of Lean, the proof validated Shinichi Mochizuki's Inter-universal Teichmüller theory, with key bridging insights finally making it accessible to the broader mathematical community. Mochizuki, now semi-retired in Kyoto, dismissed the AI's contribution, insisting his original proof was complete.
You got the first part right, so I'm now eagerly awaiting the second part to also turn out right in a couple months' time...
See the crux tag. Duncan Sabien wrote the CFAR handbook's double crux essay, etc.
Or maybe you're more like johnswentworth and you think in terms of model deltas not cruxes: