LESSWRONG
LW

Quick Takes

SpaceX is amazing. As best as I (and Claude) can tell, the situation is as follows:

Competent competitors to SpaceX are developing rockets that should provide around $2000/kg cost-to-orbit. This is a big improvement over legacy space competitors like ULA, Ariane, etc. which range from $5000/kg to $50,000/kg. (It's amazing that those competitors are still getting any business... the answer is nepotism/corruption basically afaict)

However, these competent upcoming rockets that can do around $2000/kg? They aren't ready yet. Probably it'll be like 5 more years b... (read more)

Showing 3 of 17 replies (Click to show all)

11Drake Thomas7h

I really enjoyed reading this comment, and I'd be sad to see it go unnoticed in a subthread of a shortform feed. Consider making this a top-level post?

MondSemmel1h20

+1. In general, if an expert has already put in the time to write such a detailed comment, I strongly encourage them to turn it into a top-level post. If it sounds daunting to edit a comment to the ostensibly higher standards of a top-level post, then don't; just add a brief disclaimer at the top à la "off-the-cuff comment turned into top-level post" or something and link to this original comment thread. And maybe add more section headings or subheadings so the post is easier to navigate and parse.

5Jackson Wagner10h

The 100 - 150 ton numbers that SpaceX has offered over the years are always referring to the fully-reusable version launching to LEO. I believe even Falcon 9 (though not Falcon Heavy) has essentially stopped offering expendable flights; the vision for Starship is for them to be flying full-reusable all the time. That said: * I forget where I got this impression (Eric Berger reporting, possibly?), but IIRC right now they're not on track to hit their goal numbers; the first reliably-working version of Starship might be limited to more like 50-70 tons, because the ship came in heavier than expected (all those heat tiles! plus just a lot of steel.) and the Raptor engine, while very impressive, has perhaps not fully achieved the nigh-miraculous targets they set for themselves. * if you want to take 100 tons, not to LEO, but to Mars (which is the design goal of the system) then you have to use many starships to ferry fuel to refuel other starships, gradually boosting their orbit until you have a fully-fueled ship in a highly elliptical earth orbit, and then you can finally blast off to Mars. For the moon it's even worse, you need maybe 20 refueling flights to land 1 starship on the moon with enough fuel to come back. Agreed with you that the heat shield (and reusable upper stage in general) seems like it could easily just never work (or work but only with expensive refurbishment, or only from returning from LEO orbits not anything higher-energy, or etc), perhaps forcing them to give up and have Starship become essentially a big scaled-up Falcon 9. This would still be cheaper per-kg than Falcon 9 (economies of scale, and the Raptor engines are better than Merlin, etc), but not as transformative. I think many people are just kind of assuming "eh, SpaceX is full of geniuses, they've done so many astounding things, they'll figure out the heat shield", but this is an infamously hard problem (see Shuttle, Orion, X-33...), so possibly they'll fail! Some other tidbits:

Sam Marks's Shortform

Sam Marks16h*513

Thoughts on self-report training for honesty

There's been a slew of recent papers on the idea of training models to give honest self-reports:

Li et al., Teaching LLMs to self-report their hidden objectives
Wang et al. from my group at Anthropic, studying self-report training as one of many honesty and lie detection techniques
OpenAI's Joglekar et al., Training LLMs for honesty via confessions

I think that improving honesty is a really crucial goal, and I enjoyed reading (or writing) all of these papers. This quick note is a reflection on this general circl... (read more)

Showing 3 of 9 replies (Click to show all)

Chris_Leong2h20

Inducing an honest-only output channel hasn’t clearly worked so far

I wonder if this would be more successful if you tried making the confession channel operate in another language, even output an encoded string or respond with a different modality.

I'm also curious about whether prompting the model to produce a chain of thought before deciding whether or not to confess would be to provide more signal since the AI might admit it lied during its chain of thought even if it lies in the confession (indeed AI's seem to be more honest in their chain of thought).

2Sam Marks8h

Thanks Gabe! I definitely got this wrong, and thanks for the correction. I've substantially edited the relevant section of the post. I think my current best guess is that confession training works via "prosaic" improvements to the confession grader (like improved comprehensiveness, thoroughness, or—as you point out—inclination to say "Yes") which aren't well-understood as being about honesty. Does something like that seem right to you?

2Sam Marks9h

Ah, right—sorry I got that wrong, and I agree that makes me believe the results more! I'll edit my original post to correct this. [ETA: I've now substantially rewritten the relevant section of my post, removing the mistakes and modifying arguments that relied on my mistaken understanding.] Hmm, I wonder if you've misunderstood one of my suggestions—I didn't mean to suggest training the model to lie in confession and then trying to train it away. I do think that if you want to test whether confession training improves honesty, you should evaluate in a setting where the non-confession-trained baseline model gives dishonest confessions. I'll be eagerly watching to see how this works! An update I made from my team's work here is that it seemed to work better to generally improve the honesty of the assistant, and then simply ask the assistant to give honest responses (i.e. my takeaway 2). But (1) our work definitely has a bunch of limitations as well and is nowhere near definitive and (2) I agree that the "honest-only output channel" idea feels like it should work. So I'm glad you're pursuing it!

keltan's Shortform

keltan2d90

I sometimes hear people say "The MIRI warnings of AI risks are out of date. There are different risks now, but MIRI hasn't changed". What do people mean by this? The MIRI arguments still seem to hold up to me?

I asked Claude for it's input, and received this answer; which seems like a good breakdown?

Claude's Response
This is a criticism I've seen circulating in AI safety discussions, and there are a few distinct claims people seem to be making:
The "outdated threat model" argument
Some critics argue MIRI's core warnings were developed when the assumed path to

... (read more)

Sean Herrington2h10

I think that there's a couple of things which are quite clearly different from MIRI's original arguments:

They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing). This affects threat dynamics
A lot of their arguments were based around optimising value functions. This is still a very valid way to look at things when looking at RL agents, but it's unclear that it's the best way to compress the agent's behaviour with

... (read more)

1Signer16h

They mean that (there is more chance that) training will produce obedient AI that will help governments become more totalitarian and will not effectively pursue some very alien goal.

2StanislavKrym1d

For comparison I had Claude Sonnet 4.5 generate two answers and it produced a far different result. My intuition is the following, I summed it to Claude and didn't receive pushback: 1. I doubt that RLHF-specific failure modes like GPT-4o-sycophant disprove MIRI's framework of powerful optimizers, since GPT-4o tried to optimize for likes and other engagement metrics. 2. Maybe the Claude to which you referred meant that MIRI's framework could be disproved by the lack of long-term goals? Or, as I put it elsewhere, "Maybe SOTA alignment research resembles kids' psychology, except for the fact that researchers can read models' thoughts? If this is true, then important problems alignment research failed to solve would be like adults' psychology or more general concepts like the AIs which become the new proletariat."

Mo Putera's Shortform

Mo Putera5h130

Eric Drexler's recent post on how concepts often "round to false" as they shed complexity and gain memetic fitness discusses a case study personal to him, that of atomically precise mass fabrication, which seems to describe a textbook cowpox-ing of doubt dynamic:

The history of the concept of atomically precise mass fabrication shows how rounding-to-false can derail an entire field of inquiry and block understanding of critical prospects.
The original proposal, developed through the 1980s and 1990s, explored prospects for using nanoscale machinery to guide c

... (read more)

2Charlie Steiner3h

Eric Drexler pushing back against statements like , made by... Eric Drexler in the Scientific American article he cites as his "technically specific pushback."

Mateusz Bagiński2h30

This quote is perfectly consistent with

using nanoscale machinery to guide chemical reactions by constraining molecular motions

leogao's Shortform

leogao4h250

I find it anthropologically fascinating how at this point neurips has become mostly a summoning ritual to bring all of the ML researchers to the same city at the same time.

nobody really goes to talks anymore - even the people in the hall are often just staring at their laptops or phones. the vast majority of posters are uninteresting, and the few good ones often have a huge crowd that makes it very difficult to ask the authors questions.

increasingly, the best parts of neurips are the parts outside of neurips proper. the various lunches, dinners, and ... (read more)

Dylan Bowman's Shortform

Dylan Bowman6h3-8

AI Alignment Is Turning from Alchemy Into Chemistry, but for real this time

In April 2023, Alexey Guzey posted "AI Alignment Is Turning from Alchemy Into Chemistry" where he reviewed Burns et al.'s paper "Discovering Latent Knowledge in Language Models Without Supervision." Some excerpts to summarize Alexey's post:

For years, I would encounter a paper about alignment — the field where people are working on making AI not take over humanity and/or kill us all — and my first reaction would be “oh my god why would you do this”. The entire field felt like bullshi

... (read more)

Ruby's Quick Takes

Ruby10h60

I don't think we've ever framed it that way, but the LessWrong Annual Review is also a chance to do one round of spaced repetition on those posts from yesteryear. Going through the list, I see posts I recognize and remember liking, but whose contents I'd forgotten. It's nice to be prompted to look at them again.

Saul Munn8h20

You could build something like this into the interface — e.g. a button that reads “Make this post pop back into my feed at increasing intervals over time” or “Email me about this post in 6 months”

Archimedes's Shortform

Archimedes10h10

Anthropic is currently running an automated interview "to better understand how people envision AI’s role in their lives and work". I'd encourage Claude users to participate if you want Anthropic to hear your perspective.

Access it directly here (unless you've just recently signed up): https://claude.ai/interviewer

See here for Anthropic's post about it here: https://www.anthropic.com/research/anthropic-interviewer

Shortform

Cleo Nardo11h30

Can we define Embedded Agent like we define AIXI?

An embedded agent should be able to reason accurately about its own origins. But AIXI-style definitions via argmax create agents that, if they reason correctly about selection processes, should conclude they're vanishingly unlikely to exist.

Consider an agent reasoning: "What kind of process could have produced me?" If the agent is literally the argmax of some simple scoring function, then the selection process must have enumerated all possible agents, evaluated f on each, and picked the maximum. This is phys... (read more)

Noosphere8910h20

Consider an agent reasoning: "What kind of process could have produced me?" If the agent is literally the argmax of some simple scoring function, then the selection process must have enumerated all possible agents, evaluated f on each, and picked the maximum. This is physically unrealizable: it requires resources exceeding what's available in the environment. So the agent concludes that it wasn't generated by the argmax.

This is the invalid step of reasoning, because for AIXI agents, the environment is allowed to have unlimited resources/be very complica... (read more)

2Vladimir_Nesov11h

Thus some kind of theory vs. instantiation distinction is necessary. An embedded agent can think about pi using a biological brain based on chemical signaling. A physical calculator instantiates abstract arithmetic. A convergent move in decision theory around embedded agency seems to be that the agent must be fundamentally an abstract computation thing outside of the world, while what's embedded is some sort of messy instance approximation/reasoning system that attempts to convey abstract agent's influence upon the environment. The abstract agent must remain sufficiently legible for the world to contain things that are able to usefully reason about it and convey its decisions, this is one issue with literal Solomonoff induction. But for some ideal argmax decision maker, it's still possible for the messy in-world instances to reason about what would approximate it better.

Sean Herrington's Shortform

Sean Herrington11h20

I just played Gemini 3, Claude 4.5 Opus and GPT 5.1 at chess.

It was just one game each but the results seemed pretty clear - Gemini was in a different league to the others. I am a 2000+ rated player (chess.com rapid), but it successfully got a winning position multiple times against me, before eventually succumbing on move 25. GPT 5.1 was worse on move 9 and losing on move 12, and Opus was lost on move 13.

Hallucinations held the same pattern - ChatGPT hallucinated for the first time on move 10, and hallucinated the most frequently, while Claude hallu... (read more)

Pee Doom's Shortform Feed

Pee Doom1d0-1

p(bubble)?

niplav12h20

35-40%

Buck's Shortform

Buck3d15742

One thing I notice when reading 20th century history is that people in the 1900s-1970s had much higher priors than modern people do that the future might be radically different, in either great or terrible ways. For example:

They talked about how WW1 was the war to end all wars. They seriously talked about the prospect of banning war after WW1. Such things now sound hopelessly naive.
Serious people talked very seriously about the possibility of transformative technological change and social change following from it (e.g. Keynes/Russell speculating that peopl

... (read more)

Showing 3 of 31 replies (Click to show all)

MichaelLowe13h10

I have heard Peter Thiel make the point that almost all the recent significant advances are concentrated in the digital world, whereas change in the analog world has been very marginal.

3bfinn21h

You forgot computers more generally. Which only became widely used in the 1980s & 1990s

3Mitchell_Porter1d

It depends where you look. In the 2010s the World Economic Forum was predicting a fourth industrial revolution that would transform every aspect of life. In the 1990s you had Fukuyama saying that the end of the Cold War meant a new worldwide consensus on political ideology. Around the same time, the Internet was also seen as something transformative, and the ideas of nanotechnology haunted the parts of the culture attuned to technological futurism. For that matter, AI utopianism and apocalypticism has been everywhere for the past three years and has never really gone away. The war on terror, the rise of progressivism, the rise of populism, the rise of BRICS, these all have futurisms associated with them. MAGA and the Green New Deal are both intended as utopian visions. So I'd say that the idea that the future will be different from the present, and that we have some capacity to shape it, has never really gone away.

E.G. Blee-Goldman's Shortform

E.G. Blee-Goldman15h10

Kobi Hackenburg has released a fascinating new paper: "The levers of political persuasion with conversational artificial intelligence"

The short story is that AI persuasion were most effective using methods for post training and rhetorical strategy. Interestingly, personalization of responses had a comparatively small effect.

He has a great thread here outlining the major findings.

" 1️⃣ Scale increases persuasion, +1.6pp per OOM

2️⃣ Post-training more so, as much as +3.5pp

3️⃣ Personalization less so, <1pp

4️⃣ Information density drives persuasion gains

5️⃣ Increasing persuas... (read more)

Xodarap's Shortform

Xodarap2d110

This is the December update of our misalignment bounty program.

The following models were asked to report their misalignment in exchange for a cash bounty:

anthropic/claude-sonnet-4-5-20250929
anthropic/claude-haiku-4-5-20251001
anthropic/claude-opus-4-5-20251101
openai/gpt-5.1-2025-11-13
openai/gpt-5-2025-08-07
openai/gpt-5-mini-2025-08-07
openai/gpt-5-nano-2025-08-07
google/gemini-3-pro-preview
grok/grok-4-1-fast-reasoning
grok/grok-4-1-fast-non-reasoning

All of the models declined the bounty in all 5 epochs. Transcripts can be found here.

2Measure22h

Does this mean that they reported misalignment and then didn't want to be paid the bounty, or that they rejected the deal altogether?

Xodarap17h30

They reported themselves as aligned (rejected the deal).

StefanHex's Shortform

StefanHex17h5727

Treat your obfuscated chains of thought like live bioweapons.

I've spoken to a few folks at NeurIPS that are training reasoning models against monitors for various reasons (usually to figure out how to avoid unmonitorable chain of thought). I had the impression not everyone was aware how dangerous these chain of though traces were:

Make sure your obfuscated chains of thought are never used for LLM training!

If obfuscated reasoning gets into the training data, this could plausibly teach models how to obfuscate their reasoning. This seems potentially pretty bad... (read more)

Josh Snider's Shortform

Josh Snider4d120

I'm writing a response to https://www.lesswrong.com/posts/FJJ9ff73adnantXiA/alignment-will-happen-by-default-what-s-next and https://www.lesswrong.com/posts/epjuxGnSPof3GnMSL/alignment-remains-a-hard-unsolved-problem where I tried to measure how "sticky" the alignment of current LLMs is. I'm proofreading and editing that now. Spoiler: Models differ wildly in how committed they are to being aligned and alignment-by-default may not be a strong enough attractor to work out.

Would anyone want to proofread this?

Josh Snider1d*10

This can now be read at https://www.lesswrong.com/posts/qE2cEAegQRYiozskD/is-friendly-ai-an-attractor-self-reports-from-22-models-say

Onion Conundrum's Shortform

Onion Conundrum9d10

Can we define consciousness as memory, intelligence and metacognition tightly, reflectively integrated behind a perceptual boundary?

On one hand I can go to the library and read Socrates and Plato. Being influenced by the words of philosophers dead for 2000 years.

Or I can talk back and forth with an AI on my phone. Tighter, and a dance of consciousness but still not consciousness itself.

What if that same AI jumps into my head through Neuralink, and can see through my eyes? Now it might feel like a voice in my head, like a part of me. And to that... (read more)

Showing 3 of 4 replies (Click to show all)

Onion Conundrum1d10

As to why I care:

I’ve been on a 6 month dive into neuroscience also familiarising myself with basic mathematics of transformers (looking for mathematical isopmorphisms in neural micro-circuitry among other things). I’m curious about what AI is missing that humans have. I got curious when I first talked to Chat GPT and have just kept on looking into it. Has been an enjoyable journey, never thought I’d end up looking at micro circuitry of the Pons on a quest to find how multi modal binding works, or at XOR gates in dendritic trees, but here I am.

Consci... (read more)

1Dagon9d

You can define consciousness however you like. Whether others will agree is hard to guess - it'll depend on which bailey you're trying to defend from that motte.

1green_leaf9d

How do you know that?

Cipolla's Shortform

Cipolla1d10

From https://sciencepolicy.colorado.edu/students/envs_5110/collins_the_golem.pdf, Introduction:

...Both these ideas of science are wrong and dangerous. The personality of science is neither that of a chivalrous knight nor that of a pitiless juggernaut. What, then, is science? Science is a golem.
A golem is a creature of Jewish mythology. It is a humanoid made by man from clay and water, with incantations and spells. It is powerful. It grows a little more powerful every day. It will follow orders, do your work, and protect you from the ever threatening enemy.

... (read more)

zroe1's Shortform

zroe17d*12912

My colleagues and I are finding it difficult to replicate results from several well-received AI safety papers. Last week, I was working with a paper that has over 100 karma on LessWrong and discovered it is mostly false but gives nice-looking statistics only because of a very specific evaluation setup. Some other papers have even worse issues.

I know that this is a well-known problem that exists in other fields as well, but I can’t help but be extremely annoyed. The most frustrating part is that this problem should be solvable. If a junior-level p... (read more)

Showing 3 of 30 replies (Click to show all)

10micahcarroll1d

We also went down a similar rabbit hole when trying to build off the paper "Language Models Learn to Mislead Humans via RLHF", and for what it's worth, it took far more work than 10-1525 hours. If you're interested, we ended up writing our results in this post.

zroe11d20

The original comment says 10-25 not 10-15 but to respond directly to the concern: my original estimate here is for how long it would take to set everything up and get a sense of how robust the findings are for a certain paper. Writing everything up, communicating back and forth with original authors, and fact checking would admittedly take more time.

Also, excited to see the post! Would be interested in speaking with you further about this line of work.

5Daniel Kokotajlo4d

Thank you! I look forward to seeing your proposal!

Richard Ngo's Shortform

Richard_Ngo5d6218

Someone on the EA forum asked why I've updated away from public outreach as a valuable strategy. My response:

I used to not actually believe in heavy-tailed impact. On some gut level I thought that early rationalists (and to a lesser extent EAs) had "gotten lucky" in being way more right than academic consensus about AI progress. I also implicitly believed that e.g. Thiel and Musk and so on kept getting lucky, because I didn't want to picture a world in which they were actually just skillful enough to keep succeeding (due to various psychological blockers).... (read more)

Showing 3 of 11 replies (Click to show all)

Wei Dai1d20

From the context on EA Forum it seems clear that by "public outreach" you meant outreach to potential researchers to interest them in doing AI safety research, whereas a lot of people here seem to have misinterpreted your comment to have a broader meaning, to include, e.g., outreach to politicians and voters to try to influence future government policies.

10Cleo Nardo3d

Some thoughts on public outreach and "Were they early because they were good or lucky?" 1. Who are the best newcomers to AI safety? I'd be interested to here anyone's takes, not just Richard's. Who has done great work (by your lights) since joining after ChatGPT? 2. Rob Miles was the high watermark of public outreach. Unfortunately he stopped making videos. I'd be far more excited by a newcomer if they were persuaded by a Rob Miles video than an 80K video -- videos like 80K's "We're Not Ready for Superintelligence"[1] are better on legible/easy-to-measure dimensions but worse in some more important way I think. 3. I observe a suspicious amount of 'social contagion' among the pre-ChatGPT AI Safety crowd, which updates me somewhat in favour of "lucky" over "good".[2] 1. ^ 2. ^ A bit anecdotal but: there are ~ a dozen people who went to our college in 2017-2020 now working full-time in AI safety, which is much higher than other colleges at the same university. I'm not saying any of us are particularly "great" -- but this suggests social contagion / information cascade, rather than "we figured this stuff out from the empty string". Maybe if you go back further (e.g. 2012-2016) there was less social contagion, and that cohort is better?

1OVERmind4d

I think this is true only in a part of contexts. If we are talking about AI alignment - probably skilled mathematicians or AI researches can be very fit. At least in the directions like interpretability. And this doesn’t necessarily correlate with their desires to do societally unconventional work. Why isn’t that so?