The 2024 Review
Discussion Phase

Overconfidence from early transformative AIs is a neglected, tractable, and existential problem. If early transformative AIs are overconfident, then they might build ASI/other dangerous technology or come up with new institutions that seem safe/good, but ends up being disastrous. This problem seems fairly neglected and not addressed by many existing agendas (i.e., the AI doesn't need to be intent-misaligned to be overconfident).[1] Overconfidence also feels like a very "natural" trait for the AI to end up having relative to the pre-training prior, compared to something like a fully deceptive schemer. My current favorite method to address overconfidence is training truth-seeking/scientist AIs. I think using forecasting as a benchmark seems reasonable (see e.g., FRI's work here), but I don't think we'll have enough data to really train against it. Also I'm worried that "being good forecasters" doesn't generalize to "being well calibrated about your own work." On some level this should not be too hard because pretraining should already teach the model to be well calibrated on a per-token level (see e.g., this SPAR poster). We'll just have to elicit this more generally. (I hope to flush this point out more in a full post sometime, but it felt concrete enough to worth quickly posting now. I am fairly confident in the core claims here.) 1. ^ For example, you might not necessarily detect overconfidence in these AIs even with strong interpretability because the AI doesn't "know" that it's overconfident. I also don't think there are obvious low/high stakes control methods that can be applied here.

Tim Hua4h50

Starting a shortform to keep track of tweets that I want to refer back to. You should not think of METR task length as the task length for all software engineering tasks, but "well-defined software tasks that could make it into such a study" (Dec 21, 2025) Unbounded task complexity (i.e., there are actually just a lot of things for AIs to learn) & low sample efficiency in AIs -> long timelines. (Nov 19, 2025 & Dec 21, 2025) (See also my commentary on task complexity when the METR study was first released on Mar 25, 2025) 59 page Neel Nanda MATS paper (Oct 30, 2025) 50 page Neel Nanda MATS application (Mar 3, 2025) Agents that are capable of solving hard problems need not be consequentialist & reflectively stable (Mar 3, 2025) Givewell girl summer (July 14, 2024) My thesis on Fox New's Effects on Social Preferences (June 13, 2023)

Nisan2d654

Shankar Sivarajan, gustaf, and 1 more

GPT = Generative Pre-Training? Everyone thinks GPT stands for "Generative Pre-trained Transformer". (For example, Wikipedia.) Does it really? The earliest mention of "GPT" is in the GPT-2 paper, which refers to "the OpenAI GPT model" and cites the GPT-1 paper. That paper does not contain the phrase "generative pre-trained transformer". But it does contain the phrase "generative pre-training", in the title and in the body, italicized.

Elizabeth7h70

The post I originally wrote: But then I went looking for examples, and it seemed a little more complicated. It was harder to find the sneering comments than I expected. At least some people claim that, for the high-quality knock-off brand I was considering, it's about respect for intellectual property. That definitely could be cover for status-policing but I 🤨 at knock-off legos so I think there's some legitimacy to it. In this odyssey I also learned that "discount resellers" like Marshalls, and TJ Maxx[2] market themselves as selling overstock from other stores, but in fact most of their stock was always intended for their stores. Name brands will license their branding to the discounters to produce cheaper products with the designer label, so you can't trust their quality. For TJ Maxx and Marshalls you can check this by looking at the last digit in the ID number (1= produced for discounter, 2 = genuine overstock). 1. ^ Normally I'm fine with Amazon fast fashion but they save money on warm things by knitting them out of tissue paper. I suspected this problem was inherent in cheap sweaters but was afraid paying more would just get me more expensive tissue paper. I didn't feel like going through 10 return cycles, so I tried out spending real money on well regarded brands. It worked out okay, but a week later I found some delightful sweaters for $13 at Ross Dress For Less. One snagged a thread while I was in the store and I'll be shocked if they're wearable next winter, but for now they are quite warm[3]. 2. ^ Ross is more genuine overstock, although does still sell its own merchandise. 3. ^ As long as there's no wind- but that's a property of cable knit, not cheapness.

Karl Krueger3d5120

O O, sarahconstantin, and 5 more

10+ years ago, I expected that self-driving trucks would be common on US highways by 2025, and self-driving would be having a large effect on the employment of long-haul truckers. In reality, self-driving trucks are still in testing on a limited set of highways and driving conditions. The industry still wants to hire more human long-haul truckers, and is officially expected to keep doing so for some time. I expected that long-distance trucking would have overtaken passenger cars as the "face" of self-driving vehicles; the thing that people argue about when they argue whether self-driving vehicles are safe enough, good or bad for society, etc. This has not happened. When people argue about self-driving vehicles, they argue about whether they want Waymo cars in their city. I expected that the trucking industry would shed a lot of workers, replacing them with self-driving trucks that don't need sleep, breaks, or drug testing. I expected that this would be an vivid early example of mass job loss to AI; and in turn that this would motivate more political interest in UBI. This, too, has not happened. (I certainly did not expect that the trucking industry in 2025 would be much more disrupted by anti-immigrant politics than by self-driving technology.)

artkpv2d*337

Caleb Biddulph

Almost No Demostrations of Stego CoTs In the literature, I seldom see authors mix what I call misaligned, non‑monitorable reasoning with other types of non‑monitorable reasoning. I think this harms the discussion about the possibility of steganographic reasoning because it places chains of thought that are not steganographic in the category of those that are. Here is an example of what I call dog‑whistling. This is the loan‑application experiment from MONA, where a model learns to reason more negatively or questioningly about the English applicant. They write: Their Figure 6: However, as we see, the text does not contain illegible messages or reasoning, because a “negative or questioning light” is not illegible but dependent on the reader, while steganography always entails illegible writing. Another example is Skaf et al. where they show in Figure 2 that “models learn steganographic encoding schemes under CoT optimization pressure”: However, this again is not steganography, as the characters are perfectly legible; they simply convey a different meaning to another model not trained as in their experiment. In fact, one might just ask the model to describe what those symbols (“R”, “L”) mean — apparently “right” and “left.” This can be done during monitoring by asking the model what each symbol means (truncate the CoT and ask). This is not steganography because there is no covered writing. It is ciphered reasoning, where the act of hiding information is not itself hidden. A case of gaslighting (the post hoc reasoning) can be found in Baker et al., where they show how models fool a monitor into thinking that a unit test should not be implemented because of various confabulated reasons, as shown in Figure 5: This is not steganographic reasoning either, because there are no hidden secrets; all symbols are legible, but they convey no real content because the model has already decided to game the reward function by skipping the unit test. These were three examples

Karl Krueger12h60

Haiku, Eli Tyre, and 1 more

Does anyone else track changes in their beliefs or opinions about anything, over an extended period of time? Every few years I retake the Political Compass quiz, and there is a very clear trend over the past 15+ years.

Your Feed