david reinstein1mo10

The Unjournal (unjournal.org) is considering commissioning this post for expert evaluation in our applied stream.

Looking for any feedback (here or privately) on whether this would be high-value, how to go about it (what particular issues/expertise), whether other research in this domain is higher value, etc.

How Well Does RL Scale?

david reinstein2mo10

I asked some LLM models/agents to consider this post, in preparation for considering this for some form of Unjournal.org evaluation. FWIW:

1. Conversation started with GPTPro

2. RoastmyPoast.org "epistemic audit" (result: C+, 68/100, which is a bit below average iirc)

3. Claude 4.5 Opus

My take: they saw this post as plausible, no major errors, and generating some useful insights, but with some important limitations, and the main claims are not all 'obviously demonstrated'

Below. Some overall syntheses/pulled quotes that seemed relevant to me. All folded content is LLM

GPTPro

What holds up (probability ~0.6–0.8):

Public evidence supports that inference budgets buy big gains on current reasoning benchmarks, and RL post-training scaling appears meaningfully less compute-efficient (often by ~2 extra decades to cover similar 20→80 improvements).

RL post-training is now plausibly reaching “pretraining-scale” at least at xAI (and maybe elsewhere soon), so “RL is no longer a trivially cheap add-on” is real.

What’s uncertain / overconfident (probability ~0.2–0.5)
The specific conversion “100× training ≈ 1,000× inference” as a general rule, and thus the specific “1,000,000× RL for a GPT-level jump.” This rests on a non-robust mapping and then exponentiates it.

The implication that we’re “near the effective limit” of RL training gains, given recent public RL-scaling work emphasizing recipe dependence and improved efficiency/asymptotes.

...[Verdict] Ord is on solid ground that current reasoning improvements rely heavily on inference budgets ... he is on weak ground when he turns that into a near-term “end of scaling” claim via a brittle 1,000,000× extrapolation.

I asked what aspects were missing in the comments on LW and EA Forum; it noted a lack of discussion of ...

lack of discussion of ...

how sensitive the RL-vs-inference scaling gap is to model size, data quality, reuse, training recipe, domain/task type;
how recent empirical RLHF / RL‑post‑training research (on open, small-scale, or controlled setups) might affect that gap;
the analogy of “inefficiency gap = fundamental ceiling” vs. “inefficiency may be engineering‑level problem, solvable with better algorithms/research”;
the degree of uncertainty involved in extrapolating over many orders of magnitude;
the possibility that RL‑post-training inefficiency might be significantly reduced in the future (with better methodology).

So in short: the public conversation has touched some of the major “skeptical” themes, but not with the depth, technical framing, or caution that a more expert‑oriented review might use.how sensitive the RL-vs-inference scaling gap is to model size, data quality, reuse, training recipe, domain/task type;

how recent empirical RLHF / RL‑post‑training research (on open, small-scale, or controlled setups) might affect that gap;
the analogy of “inefficiency gap = fundamental ceiling” vs. “inefficiency may be engineering‑level problem, solvable with better algorithms/research”;
the degree of uncertainty involved in extrapolating over many orders of magnitude;
the possibility that RL‑post-training inefficiency might be significantly reduced in the future (with better methodology).

So in short: the public conversation has touched some of the major “skeptical” themes, but not with the depth, technical framing, or caution that a more expert‑oriented review might use

Claude 4.5 Opus

Bottom Line Assessment
Aspect Assessment
Methodological rigor Decent for available data; not peer-reviewed ML research
Alignment with expert consensus Broadly consistent with Sutskever, partially at odds with Epoch AI's optimism
Potential blind spots Algorithmic improvements, insider knowledge, IDA possibilities
Originality Useful synthesis, but not breakthrough technical analysis
Should you trust it? Trust it as informed policy analysis, not as definitive ML research
The honest answer: Ord is probably directionally correct that RL scaling is less efficient than pre-training was, and that we're approaching limits. But the specific numbers (10,000x, 1,000,000x) should be held loosely. Actual ML researchers at frontier labs know things that aren't public, and algorithmic breakthroughs could change the picture.

Aspect	Assessment
Methodological rigor	Decent for available data; not peer-reviewed ML research
Alignment with expert consensus	Broadly consistent with Sutskever, partially at odds with Epoch AI's optimism
Potential blind spots	Algorithmic improvements, insider knowledge, IDA possibilities
Originality	Useful synthesis, but not breakthrough technical analysis
Should you trust it?	Trust it as informed policy analysis, not as definitive ML research

RoastMyPoast Epistemic Audit (C+, 68/100)

Uses agents and claude-sonnet-4-5-20250929

Noted overconfidence

Noted "Overconfidence" about

Causal claims about what RL "unlocks" or "allows" without establishing mechanism

.... Long-range extrapolations presented as reliable estimates

And "Single points of failure":

The assumption that observed slopes continue far beyond measured ranges
The causal interpretation of RL "unlocking" inference rather than teaching token-intensive strategies
The representativeness of OpenAI's published data

RE Unjournal.org potentially commissioning this for an evaluation of some form, we might consider

Is this post highly influential on its own (are funders and labs using this to guide important policy choices)?
Is there further expertise we could unlock that is not reflected in these comments? (The LLMs suggested some evaluators, but we sometimes find it hard to get people to accept the assignment and follow through)
Is there a more formal research output that covers this same ground, coming from ML researchers, scaling experts, etc.?

A Defense of Peer Review

david reinstein3mo10

I agree but I would not frame this as review in terms of thumbs up/thumbs down -- we can do better. In economics, for example, most people post their research in a fairly polished format online long before it makes it through the journal peer-review process. People can host their work in a variety of interesting and useful formats that actually go beyond what you can put in a frozen PDF of course.

Then we can have continuous public evaluation of this work, both crowdsourced and managed - at unjournal.org we do the latter, we pay experts to write detailed reports explaining the strengths, weaknesses, credibility, and usefulness of the research, and to give a benchmarked quantitative rating of this both overall and across a range of categories, as well as claim assessment. You can see our output at unjournal.pubpub.org and on our ratings dashboard -- https://unjournal.shinyapps.io/uj-dashboard/

Authors can continue to improve the research and extend it in the same place and then seek an updated evaluation and rating .

Unjournal evaluation of "Towards best practices in AGI safety & governance" (2023), quick take

david reinstein5mo10

Naturally this paper is several years old. But it still seems like the most prominent work on this, with 61 citations etc.

My own take: we need more work in this area... perhaps follow-up work doing a similar survey, taking sample selection and question design more seriously.

I hope we can identify & evaluate such work in a timely fashion.

E.g., there is some overlap with
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5021463

which "focuses on measures to mitigate systemic risks associated with general-purpose AI models, rather than addressing the AGI scenario considered in this paper".

I'm eager to hear other suggestions for relevant work to consider and evaluate.

Unjournal evaluation of "Towards best practices in AGI safety & governance" (2023), quick take

david reinstein5mo10

My own take on this is that this suggests we need more work in this area. We need a follow-up with some follow-up work doing a similar survey, taking sample selection and question design more seriously.

Representative quotes from the evaluations

Sampling bias/selection issues

Potential sampling bias – particularly over-representation of safety-minded respondents
There’s a risk that the sample reflects the views of those already more inclined to endorse stringent safety norms. This is particularly important in light of the sample composition, as over 40% of respondents are affiliated with AGI labs

Question selection/design, need for sharper questions

As the authors note, this [agreement] may be partly due to the high-level and generally uncontroversial framing of the statements (e.g., “AGI labs should conduct pre-deployment risk assessments”). But in their current form, the items mostly capture agreement in principle, rather than forcing respondents to grapple with the kinds of tradeoffs that real-world governance inevitably entails.
For example, would respondents still support red teaming or third-party audits if they significantly delayed product releases?

[Emphasis added]

the paper states that the selected practices are extracted from (1) current practices at individual AGI labs and (2) planned practices at individual labs, among other sources.
... These results might suggest a selection bias where statements selected from labs practices are agreed on by labs themselves,

[suggestion to] introduce an inclusion/exclusion criterion to provide a better justification as to why some statements are selected.

Overstated claims for consensus?
[Paper] “findings suggest that AGI labs need to improve their risk management practices. In particular, there seems to be room for improvement when it comes to their risk governance.”

While one can agree with such claim, it is difficult to see how this conclusion can be reached from the paper’s results.

GovAI: Towards best practices in AGI safety and governance: A survey of expert opinion

david reinstein5mo10

By the way, just flagging that The Unjournal did an evaluation of this (post/discussion here -- I'll extend this with some more opinionated comments now). Overall it was taken to be a strong step but with important limitations and need for further caveats and further work.

GovAI: Towards best practices in AGI safety and governance: A survey of expert opinion

david reinstein5mo10

By "this is now the canonical collection" do you mean the ideas surveyed in the paper? Do you think it's still canonical or is it now ~out-of-date?

The Unjournal's "Pivotal Questions" project

david reinstein7mo50

Unjournal’s first Pivotal Question, focusing on the viability of cultured meat — This post also gives concrete details of our process and proposed approach.

Announcing the Double Crux Bot

david reinstein7mo10

I was indeed looking for something that could be used in a live conversation.

Announcing the Double Crux Bot

david reinstein11mo10

Is there a version of this bot (or something similar) that one can use in an LLM model or website? I want to use this on a podcast without having to link this to a Slack

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments

GPTPro

Claude 4.5 Opus

RoastMyPoast Epistemic Audit (C+, 68/100)

Representative quotes from the evaluations

Sampling bias/selection issues

Question selection/design, need for sharper questions

Overstated claims for consensus?