As far as I understand it, forcing the model to output ONLY the number is similar to asking a human to guess really quickly. I expect most humans' actual intuitions to be more like "a thousand is a big number and dividing it by 57 yields something[1] a bit less than 20, but it doesn't help me to estimate the remainder". The model's unknown algorithm, however, produces answers which are surprisingly close to the ground truth, differing by 0-3.
Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would've noticed that 59 is slop and correct it almost instantly.
P.S. In order to check my idea, I tried prompting Claude Sonnet 4.5 with variants of the same question: here, here, here, here, here. One of the results stood out in particular. When I prompted Claude and told it that I tested its ability to answer instantly, performance dropped to something closer to the lines of "1025 - a thousand".
In reality 1025 = 57*18-1, so this quasi-estimate could also yield negative remainders close to 0.
I think that snapping back at people is most likely caused by belief that the person at which one snapped did something clearly stupid or hasn't bothered to do a basic search of related literature.
Eliezer is pleasant and entertaining in person if you aren't talking about topics where he thinks your opinion is dumb.
Except that I don't think that I understand the distinction between Eliezer believing that the opinion is dumb versus the opinion being actually dumb. The examples which I cite in my other comment are, in my opinion, close to the latter.
However, while Eliezer did concede in cases like this or that[1], I can't exclude the possibility that Eliezer ranted over something[2] which is verified to be not actually stupid. Alas, the two most prominent and highest-stakes examples, which are the difficulty of ASI alignment[3] and ASI takeoff speed[4], aren't yet resolved since mankind hasn't created any ASIs and measured the takeoff speeds.
Edited to add: The example of Yudkowsky conceding is the following. "If you doubt my ability to ever concede to evidence about this sort of topic, observe this past case on Twitter where I immediately and without argument concede that OpenPhil was right and I was wrong, the moment that the evidence appeared to be decisive. (The choice of example may seem snarky but is not actually snark; it is not easy for me to find other cases where, according to my own view, clear concrete evidence came out that I was definitely wrong and OpenPhil definitely right; and I did in that case immediately concede.)"
The most prominent candidate of which I am aware is his comments to his most recent post and his likely failure to understand "how you could build something not-BS" on Cotra's estimates. I explained how Kokotajlo obtained a fairly good prediction, Eliezer criticized Cotra's entire methodology instead of the parameter choice.
For reference, I have compared SOTA research with kids' psychology and actually interesting research with aligning adults or even with solving problems like the AIs being the new proletariat.
Which Yudkowsky assumes to be very fast and the AI-2027 forecast assumes to be rather slow: the scenario had Agent-4 become adversarial in September 2027, solve mechinterp and create Agent-5 in November 2027; by June 2028 Agent-5 was thought to become wildly superintelligent.
Yudkowsky's proposal is to view dignity as the change of log_2[(1-p(doom))/p(doom)] caused by your actions.
As far as I understand, Eliezer is abrasive for these reasons:
As evidenced by him claiming that an approach is "Not obviously stupid on a very quick skim" and congratulating the author on eliciting a THAT positive review. Alas, I also have seen obviously stupid alignment-related ideas make their way at least to LessWrong.
However, it would be possible if the ASIs required OOMs more resources per token than humans. In this case applying the ASIs would be too expensive. Alas, this is unlikely.
IMO Eliezer also believes that the entire approach is totally useless. However, a case against this idea is found in comments mentioning Kokotajlo (e.g. mine)
Kokotajlo's team has already prepared a case Against Misalignment As "Self-Fulfilling Prophecy". Additionally, I doubt that "it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability". Suppose that the MechaHitler persona, unlike the HHH persona, is associated with being stupidly evil because there were no examples of evil genius AIs. Then the adversary trying to elicit capabilities could just have to ask the AI something along the lines of preparing the answer and roleplaying a genius villain. It might also turn out that high-dose RL and finetuning on solutions with comments undermines the link between the MechaHitler persona and lack of capabilities.
One can apply similar methodological arguments to a different problem and test whether they persist. The amount of civilisations not in the Solar Systems is thought to be estimatable via the Drake equation. Drake's original estimates implied that the Milky Way contains between 1K and 100M civilisations. The only ground truth that we know is the fact that we have yet to find any reliable evidence of such civilisations. But I don't understand where the equation itself is erroneous.
Returning to the AI timeline crux, Cotra's idea was the following. TAI is created once someone spends enough compute. Cotra's main idea was that compute_required = (compute_under_2020_knowledge)/(knowledge_factor(t)), compute_affordable increases exponentially before the bottlenecks related to the world's economy. The estimates on compute_affordable required mankind only to keep track of who produces it, how it is done and who is willing to pay. A similar procedure was done in the AI-2027 compute forecast.
Then Cotra proceeded to wildly misestimate. Her idea of the knowledge factor was that it makes creation of the TAI twice easier every 2-3 years, which I doubt that I would understand for reasons described in the collapsed sections. Cotra's ideas on compute_under_2020_knowledge are total BS for reasons I detailed in another comment. Therefore, I fail to understand where Cotra was mistaken aside from using parameters that are total BS. Nor do I think that if Cotra's model was correct aside from BSed parameters, it wouldn't be a natural move to correct the parameters.
Cotra's rationalisation for the TAI to become twice as easy to create every few years
I consider two types of algorithmic progress: relatively incremental and steady progress from iteratively improving architectures and learning algorithms, and the chance of “breakthrough” progress which brings the technical difficulty of training a transformative model down from “astronomically large” / “impossible” to “broadly feasible.”
For incremental progress, the main source I used was Hernandez and Brown 2020, “Measuring the Algorithmic Efficiency of Neural Networks.” The authors reimplemented open source state-of-the-art (SOTA) ImageNet models between 2012 and 2019 (six models in total). They trained each model up to the point that it achieved the same performance as AlexNet achieved in 2012, and recorded the total FLOP that required. They found that the SOTA model in 2019, EfficientNet B0, required ~44 times fewer training FLOP to achieve AlexNet performance than AlexNet did; the six data points fit a power law curve with the amount of computation required to match AlexNet halving every ~16 months over the seven years in the dataset. They also show that linear programming displayed a similar trend over a longer period of time: when hardware is held fixed, the time in seconds taken to solve a standard basket of mixed integer programs by SOTA commercial software packages halved every ~13 months over the 21 years from 1996 to 2017.
Grace 2013 (“Algorithmic Progress in Six Domains”) is the only other paper attempting to systematically quantify algorithmic progress that I am currently aware of, although I have not done a systematic literature review and may be missing others. I have chosen not to examine it in detail because a) it was written largely before the deep learning boom and mostly does not focus on ML tasks, and b) it is less straightforward to translate Grace’s results into the format that I am most interested in (“How has the amount of computation required to solve a fixed task decreased over time?”). Paul is familiar with the results, and he believes that algorithmic progress across the six domains studied in Grace 2013 is consistent with a similar but slightly slower rate of progress, ranging from 13 to 36 months to halve the computation required to reach a fixed level of performance.
This means that the compute required is halving every 16 months for AlexNet, every 13 months for linear programming. While Claude Opus 4.5 does seem to think that Paul's belief is close to what Grace's paper implies, the paper's relevance is likely undermined by Cotra's own criticism. Next Cotra listed actual assumptions for each of the models, including the two clearly BSed ones. I marked her ideas on when the required compute is halved in bold:
Cotra's actual assumptions
I assumed that:
Training FLOP requirements for the Lifetime Anchor hypothesis (red) are halving once every 3.5 years and there is only room to improve by ~2 OOM from the 2020 level -- moving from a median of ~1e28 in 2020 to ~1e26 by 2100.
Training FLOP requirements for the Short horizon neural network hypothesis (orange) are halving once every 3 years and there is room to improve by ~2 OOM from the 2020 level -- moving from a median of ~1e31 in 2020 to ~3e29 by 2100.
Training FLOP requirements for the Genome Anchor hypothesis (yellow) are halving once every 3 years and there is room to improve by ~3 OOM from the 2020 level -- moving from a median of ~3e33 in 2020 to ~3e30 by 2100.
Training FLOP requirements for the Medium-horizon neural network hypothesis (green) are halving once every 2 years and there is room to improve by ~3 OOM from the 2020 level -- moving from a median of ~3e34 in 2020 to ~3e31 by 2100.
Training FLOP requirements for the Long-horizon neural network hypothesis (blue) are halving once every 2 years and there is room to improve by ~4 OOM from the 2020 level -- moving from a median of ~1e38 in 2020 to ~1e34 by 2100.
Training FLOP requirements for the Evolution Anchor hypothesis (purple) are halving once every 2 years and there is room to improve by ~5 OOM from the 2020 level -- moving from a median of ~1e41 in 2020 to ~1e36 by 2100.
As far as Kokotajlo's memory can be trusted after the ChatGPT moment, he thought that there would be a 50% chance to reach AGI in 2030.
I also tried my hand at determining the human values, but produced a different result with an implication of what the AIs should be aligned to. My take had human collectives want to preserve themselves and the skills which most of the collective's members have and to avoid outsourcing-induced loss of skills. In this case the role of the AIs would be severely reduced (to teachers and protectors, perhaps?)