SpaceX is amazing. As best as I (and Claude) can tell, the situation is as follows:
Competent competitors to SpaceX are developing rockets that should provide around $2000/kg cost-to-orbit. This is a big improvement over legacy space competitors like ULA, Ariane, etc. which range from $5000/kg to $50,000/kg. (It's amazing that those competitors are still getting any business... the answer is nepotism/corruption basically afaict)
However, these competent upcoming rockets that can do around $2000/kg? They aren't ready yet. Probably it'll be like 5 more years b...
+1. In general, if an expert has already put in the time to write such a detailed comment, I strongly encourage them to turn it into a top-level post. If it sounds daunting to edit a comment to the ostensibly higher standards of a top-level post, then don't; just add a brief disclaimer at the top à la "off-the-cuff comment turned into top-level post" or something and link to this original comment thread. And maybe add more section headings or subheadings so the post is easier to navigate and parse.
Thoughts on self-report training for honesty
There's been a slew of recent papers on the idea of training models to give honest self-reports:
I think that improving honesty is a really crucial goal, and I enjoyed reading (or writing) all of these papers. This quick note is a reflection on this general circl...
Inducing an honest-only output channel hasn’t clearly worked so far
I wonder if this would be more successful if you tried making the confession channel operate in another language, even output an encoded string or respond with a different modality.
I'm also curious about whether prompting the model to produce a chain of thought before deciding whether or not to confess would be to provide more signal since the AI might admit it lied during its chain of thought even if it lies in the confession (indeed AI's seem to be more honest in their chain of thought).
I sometimes hear people say "The MIRI warnings of AI risks are out of date. There are different risks now, but MIRI hasn't changed". What do people mean by this? The MIRI arguments still seem to hold up to me?
I asked Claude for it's input, and received this answer; which seems like a good breakdown?
...
Claude's Response
This is a criticism I've seen circulating in AI safety discussions, and there are a few distinct claims people seem to be making:
The "outdated threat model" argument
Some critics argue MIRI's core warnings were developed when the assumed path to
I think that there's a couple of things which are quite clearly different from MIRI's original arguments:
Eric Drexler's recent post on how concepts often "round to false" as they shed complexity and gain memetic fitness discusses a case study personal to him, that of atomically precise mass fabrication, which seems to describe a textbook cowpox-ing of doubt dynamic:
...The history of the concept of atomically precise mass fabrication shows how rounding-to-false can derail an entire field of inquiry and block understanding of critical prospects.
The original proposal, developed through the 1980s and 1990s, explored prospects for using nanoscale machinery to guide c
This quote is perfectly consistent with
using nanoscale machinery to guide chemical reactions by constraining molecular motions
I find it anthropologically fascinating how at this point neurips has become mostly a summoning ritual to bring all of the ML researchers to the same city at the same time.
nobody really goes to talks anymore - even the people in the hall are often just staring at their laptops or phones. the vast majority of posters are uninteresting, and the few good ones often have a huge crowd that makes it very difficult to ask the authors questions.
increasingly, the best parts of neurips are the parts outside of neurips proper. the various lunches, dinners, and ...
In April 2023, Alexey Guzey posted "AI Alignment Is Turning from Alchemy Into Chemistry" where he reviewed Burns et al.'s paper "Discovering Latent Knowledge in Language Models Without Supervision." Some excerpts to summarize Alexey's post:
...For years, I would encounter a paper about alignment — the field where people are working on making AI not take over humanity and/or kill us all — and my first reaction would be “oh my god why would you do this”. The entire field felt like bullshi
I don't think we've ever framed it that way, but the LessWrong Annual Review is also a chance to do one round of spaced repetition on those posts from yesteryear. Going through the list, I see posts I recognize and remember liking, but whose contents I'd forgotten. It's nice to be prompted to look at them again.
You could build something like this into the interface — e.g. a button that reads “Make this post pop back into my feed at increasing intervals over time” or “Email me about this post in 6 months”
Anthropic is currently running an automated interview "to better understand how people envision AI’s role in their lives and work". I'd encourage Claude users to participate if you want Anthropic to hear your perspective.
Access it directly here (unless you've just recently signed up): https://claude.ai/interviewer
See here for Anthropic's post about it here: https://www.anthropic.com/research/anthropic-interviewer
Can we define Embedded Agent like we define AIXI?
An embedded agent should be able to reason accurately about its own origins. But AIXI-style definitions via argmax create agents that, if they reason correctly about selection processes, should conclude they're vanishingly unlikely to exist.
Consider an agent reasoning: "What kind of process could have produced me?" If the agent is literally the argmax of some simple scoring function, then the selection process must have enumerated all possible agents, evaluated f on each, and picked the maximum. This is phys...
Consider an agent reasoning: "What kind of process could have produced me?" If the agent is literally the argmax of some simple scoring function, then the selection process must have enumerated all possible agents, evaluated f on each, and picked the maximum. This is physically unrealizable: it requires resources exceeding what's available in the environment. So the agent concludes that it wasn't generated by the argmax.
This is the invalid step of reasoning, because for AIXI agents, the environment is allowed to have unlimited resources/be very complica...
I just played Gemini 3, Claude 4.5 Opus and GPT 5.1 at chess.
It was just one game each but the results seemed pretty clear - Gemini was in a different league to the others. I am a 2000+ rated player (chess.com rapid), but it successfully got a winning position multiple times against me, before eventually succumbing on move 25. GPT 5.1 was worse on move 9 and losing on move 12, and Opus was lost on move 13.
Hallucinations held the same pattern - ChatGPT hallucinated for the first time on move 10, and hallucinated the most frequently, while Claude hallu...
35-40%
One thing I notice when reading 20th century history is that people in the 1900s-1970s had much higher priors than modern people do that the future might be radically different, in either great or terrible ways. For example:
I have heard Peter Thiel make the point that almost all the recent significant advances are concentrated in the digital world, whereas change in the analog world has been very marginal.
Kobi Hackenburg has released a fascinating new paper: "The levers of political persuasion with conversational artificial intelligence"
The short story is that AI persuasion were most effective using methods for post training and rhetorical strategy. Interestingly, personalization of responses had a comparatively small effect.
He has a great thread here outlining the major findings.
"Scale increases persuasion, +1.6pp per OOM
Post-training more so, as much as +3.5pp
Personalization less so, <1pp
Information density drives persuasion gains
Increasing persuas...
This is the December update of our misalignment bounty program.
The following models were asked to report their misalignment in exchange for a cash bounty:
All of the models declined the bounty in all 5 epochs. Transcripts can be found here.
They reported themselves as aligned (rejected the deal).
I've spoken to a few folks at NeurIPS that are training reasoning models against monitors for various reasons (usually to figure out how to avoid unmonitorable chain of thought). I had the impression not everyone was aware how dangerous these chain of though traces were:
If obfuscated reasoning gets into the training data, this could plausibly teach models how to obfuscate their reasoning. This seems potentially pretty bad...
I'm writing a response to https://www.lesswrong.com/posts/FJJ9ff73adnantXiA/alignment-will-happen-by-default-what-s-next and https://www.lesswrong.com/posts/epjuxGnSPof3GnMSL/alignment-remains-a-hard-unsolved-problem where I tried to measure how "sticky" the alignment of current LLMs is. I'm proofreading and editing that now. Spoiler: Models differ wildly in how committed they are to being aligned and alignment-by-default may not be a strong enough attractor to work out.
Would anyone want to proofread this?
Can we define consciousness as memory, intelligence and metacognition tightly, reflectively integrated behind a perceptual boundary?
On one hand I can go to the library and read Socrates and Plato. Being influenced by the words of philosophers dead for 2000 years.
Or I can talk back and forth with an AI on my phone. Tighter, and a dance of consciousness but still not consciousness itself.
What if that same AI jumps into my head through Neuralink, and can see through my eyes? Now it might feel like a voice in my head, like a part of me. And to that...
As to why I care:
I’ve been on a 6 month dive into neuroscience also familiarising myself with basic mathematics of transformers (looking for mathematical isopmorphisms in neural micro-circuitry among other things). I’m curious about what AI is missing that humans have. I got curious when I first talked to Chat GPT and have just kept on looking into it. Has been an enjoyable journey, never thought I’d end up looking at micro circuitry of the Pons on a quest to find how multi modal binding works, or at XOR gates in dendritic trees, but here I am.
Consci...
From https://sciencepolicy.colorado.edu/students/envs_5110/collins_the_golem.pdf, Introduction:
......Both these ideas of science are wrong and dangerous. The personality of science is neither that of a chivalrous knight nor that of a pitiless juggernaut. What, then, is science? Science is a golem.
A golem is a creature of Jewish mythology. It is a humanoid made by man from clay and water, with incantations and spells. It is powerful. It grows a little more powerful every day. It will follow orders, do your work, and protect you from the ever threatening enemy.
My colleagues and I are finding it difficult to replicate results from several well-received AI safety papers. Last week, I was working with a paper that has over 100 karma on LessWrong and discovered it is mostly false but gives nice-looking statistics only because of a very specific evaluation setup. Some other papers have even worse issues.
I know that this is a well-known problem that exists in other fields as well, but I can’t help but be extremely annoyed. The most frustrating part is that this problem should be solvable. If a junior-level p...
The original comment says 10-25 not 10-15 but to respond directly to the concern: my original estimate here is for how long it would take to set everything up and get a sense of how robust the findings are for a certain paper. Writing everything up, communicating back and forth with original authors, and fact checking would admittedly take more time.
Also, excited to see the post! Would be interested in speaking with you further about this line of work.
Someone on the EA forum asked why I've updated away from public outreach as a valuable strategy. My response:
I used to not actually believe in heavy-tailed impact. On some gut level I thought that early rationalists (and to a lesser extent EAs) had "gotten lucky" in being way more right than academic consensus about AI progress. I also implicitly believed that e.g. Thiel and Musk and so on kept getting lucky, because I didn't want to picture a world in which they were actually just skillful enough to keep succeeding (due to various psychological blockers)....
From the context on EA Forum it seems clear that by "public outreach" you meant outreach to potential researchers to interest them in doing AI safety research, whereas a lot of people here seem to have misinterpreted your comment to have a broader meaning, to include, e.g., outreach to politicians and voters to try to influence future government policies.