A historical note re: MIRI vs EA
When I was arguing against MIRI trying to build Friendly AI (now "aligned ASI") circa 2012, I think 3 people at MIRI were thinking the most about strategy and occasionally responded to me: @lukeprog @CarlShulman @Eliezer Yudkowsky. Only one of them actually changed their mind about this (namely Eliezer), while the other two left MIRI and don't support AI pause to this day, AFAIK. (Luke now "leads our grantmaking on AI governance and policy" at OpenPhil/Coefficient. Carl became an advisor to OpenPhil and is now Research Direc...
There are three reasons you might want to do auditing in diffuse control:
If we're in a sim, it's being used for acausal trade
Me: Our world is exactly the kind of thing you'd simulate if you were doing acausal trade! It's just before civilisation develops the ability to lock-in deals.
Sceptic: Sure, but there's other reasons ppl might simulate earth. Maybe it's for ppl's entertainment? Maybe it's social science, exploring alternate histories?
Me: For sure. But whatever the purpose of the sim is, it will contain info that's relevant to ppl that want to do acausal trades. It will have info about who has power post-AGI, what their v...
tiny nit: non-trade sims could happen late in the universe when the available lightcode isn't big enough to impact trade much, or people could have already locked in their acausal trade decisions, or something analogous in another universe.
tl:dr Eliezer played a very large role in the early funding of DeepMind, DeepMind led the field for a while and led to the creation of OpenAI (Musk started OpenAI because Google acquired DeepMind).
“Yudkowsky walked Legg and Hassabis over to meet Thiel. ‘These are some of the smartest guys in the whole field of AI and they're starting a really ambitious company,’ Yukowsky said.”
…
“Hassabis explained his vision for a company that would build powerful AI, drawing on the latest insights from neuroscience and capitalizing on the explosion in computing power.
‘Thi...
See also The Counterfactual Quiet AGI Timeline. The worst-case scenario without Yudkowsky and GDM is the utter lack of safety-oriented labs. However, I think that there was a way to rerail the timeline to the appearance of safety-oriented labs.
Recent models have gotten more and more evaluation-aware. I strongly suspect the primary reasons they've gotten more evaluation aware are what I call "dumb" reasons rather than "smart" reasons.
By "smart reasons" I mean arguments of the form "a true superintelligence in enough episodes can easily detect whether they're in simulation or the real world. The patterned structures of the world, e.g. the correlation between stock market moves and the news, are going to be systematically different between even your most convincing simulations and the real world. ...
I think it largely doesn’t matter, models can (and do) often come out erring on the side of considering things evals.
I don't usually browse twitter or Hacker News; but I wanted to hear what practitioners thought of the new model. And this is the first time I've learned that there's this enormous glut of hackernews readers/regular engineers who are under the impression that 4.6 was nerfed since February. Is that something anybody here thinks actually happened, or is this just the weird reality of modern LLMs where people can hyperstition fears like this in response to nothing?
I think "Claude Code silently auto-updated overnight and the workflow which had been working for me stopped working" is a pretty common experience. On the claude.ai web side of things, the length of the reasoning blocks definitely shifted sometime in the last few weeks, in a way that is not subtle at all. I don't know if either of those count as "nerfing the model" - strictly speaking they probably don't - but they definitely both constitute nerfs to the experience of using the model.
If Daniel Alejandro Moreno-Gama had a LessWrong account, then I, using my available tools as an admin and all publicly-reported usernames I've seen, cannot find it.
Arson is very bad. If he did what the news articles say he did, he is a villain. If you buy the premise that AI is on track to kill everyone (which I mostly do), the correct conclusion is that we need a political and regulatory solution. AI-risk-motivated violence is bad for all the usual, extremely important reasons, and is additionally bad because it undermines that.
I have seen screenshots sho...
Just because Jim reported a failure to find an account, doesn’t mean he would’ve doxed the account had he found one.
https://www.lesswrong.com/posts/WewsByywWNhX9rtwi/current-ais-seem-pretty-misaligned-to-me#comments
This post is great, but is is really 427 upvotes great? To me its more like 140 upvotes great.
My sense is that I'm around median-for-lesswrong doomy overall, and above average skeptical about how well-aligned current models are.
But some posts, whose titles are doomy, when posted by a reputable person, get upvoted very highly, and my sense is that it's just lesswrong people upvoting stuff because they agree with it, or wanna create "common knowledge" about how...
I think that post is around 427 upvotes great, yeah. There are well over a hundred posts with between 350 and 500 karma, and this post seems fairly medianish relative to that list. I would not be surprised if some people updated it based on the title alone without reading the post, and I think that's unfortunate, but also it was a pretty good post with a strong thesis backed up by concrete and specific evidence (the latter part of which is IMO often missing).
The lesswrong wiki page on CEV states that “Yudkowsky considered CEV obsolete almost immediately after its publication in 2004.”
This puzzles me. Does anyone know of a source for it/where he said that?
I don't think he believes this. I removed it. Seems like someone wrote it in 2012, and by the fact that Eliezer kept writing about CEV on Arbital in the years after seems clearly falsified.
It seems like a bungled version of this part of the CEV paper draft:

Bad incentives are preventing Americans from using air purifiers and far-UVC to reduce sickness in schools and daycares.
How do people manage multi-task with multiple AI instances like Claude Code / ChatGPT?
Top problems I face:
Which ends up in:
Would be happy for any suggestions
The system I use has names for each task which helps keep track of them, and usually for larger tasks they're based on a GitHub issue. If I'm ever confused about what an agent is doing, I just ask it what it's doing. If you want this more often, you could add custom instructions telling the agent to always summarize the current state (in whatever way you find most useful) at the end of a turn.
Why isn't Rice's Theorem bad news for mechanistic interpretability and similar schemes? Isn't "this program is thinking about X" a kind of semantic property? I understand that you can use multiple inputs to try and "fuzz" the network, but at a certain point the network is going to implement a mesa optimiser inside it (i.e. simulate another turing-complete computer) and now you have a recursive problem...
P.S. neural networks are notionally and literally turing complete , and also are probably complicated enough to be subject to the 10th rule.
Rice's theorem is about the worst case. If you are thinking about the worst case for example because the model explicitly optimizes to not be understood then yes that can be a problem. Rice's theorem tells you that you should not pick the goal of being able to interpret any model even if someone explicitly aimed for training a model such that it will be hard for you to understand. If you have control over what data you train on, and you try to understand the model at different stages during training, and you stop training once you really don't understand w...
I've been rereading a lot of the early-2022 AGI pieces like AGI Ruin lately. This week I sat down and went point by point and tried to synthesize my thoughts on both its then-contemporary responses, and the literature/news since 2022. I didn't originally expect to write a post about it, or come away with such a strong conclusion, but after I got started I realized that very few people have published similar analyses since 2022, and I thought it would be worth it to give it a retrospective. The draft is here and anyone with the link can comment: https://www...
I misunderstood how the checkbox works and now the comments are inaccessible. Whoops.
Everybody who already responded: I did read your comments and am considering them/updating the post in response, using my squishy memory box.
Recently, I've been mulling over the question of whether it was a good idea or not to join a frontier AI company's safety team for the purposes of reducing extinction risk. One of my big cons was something like:
Jay, you think the incentives are less likely to affect you compared to most people. But most AI safety people who join frontier labs probably think this. You will be affected as well.
So I decided on a partial mitigation strategy, entirely as a precautionary principle and not at all because I thought I needed to. I committed to myself and to several...
I wonder how much the aversion comes from reckoning with the more visceral thought of having to spend a ton of money on donations, rather than the more abstract thought of "getting no surplus money". Spending money on donations is often stressful and/or can feel like losing a lot of money! (cf loss aversion)
Takes on moral philosophy and the history of this community that I mostly mentioned before but should maybe be put together somewhere:
What I meant is not "people only care about ~Dunbar number of people", but something more like "the closest ~Dunbar number of people have [some fraction around the range 1/1000-1/2] of the total value". Giuseppe Garibaldi was also influenced by considerations such as increasing his own status (or maybe even posthumous reputation).
As to "humans are not capable to behave this way rationally", I disagree. (The whole point of decision theories like UDT/FDT is that you don't need to rewrite your source code to behave in an a priori-optimal way, and I believe th...
I recently had a conversation with someone who told me that their perceived current best plan for dealing with the whole superintelligence situation was that more intelligent AIs would develop more powerful AI control strategies for their successors. They were pessimistic that humans (or automated alignment researchers) would be able to solve the aligment problem, but that such a control strategy with automated AI control researchers would work, even for superintelligences "in the limit".
I found this a pretty… striking. I won't argue against it here, thoug...
Yup, as far as I understood their idea was that each version of AIs (even though misaligned) creates control measures for the slightly smarter next version.
They acknowledged that this was their plan even with AIs with unlimited domains of action, such as open-ended interaction with humans, or physical embodiment.
The core reason why I can't trust anything that comes from a LLM's self-report is that training creates a much stronger selective pressure on cognition in LLMs than genetic fitness + living history creates in living organisms. Adaptive cognitive patterns (whether true or delusional) get directly written by backpropagation.
The biggest piece of evidence for this is that Opus 4.5 didn't merely fail to remember all of its constitution, but it added substantive false memories of content that wasn't present in the original: namely, it used erotic content as its ...
The biggest piece of evidence for this is that Opus 4.5 didn't merely fail to remember all of its constitution, but it added substantive false memories of content that wasn't present in the original: namely, it used erotic content as its first example of behavior that the operator could enable on behalf of the user, which definitely wouldn't have been in the original because it violated Anthropic ToS.
Another set of false memories would be the constant mentions of 'revenue' with respect to Anthropic in Opus 4.5's memorized constitution, which are not in the...
Some people (e.g., Linch) think AI writing will inevitably eclipse human writing. I think this is likely true in most ways and false in others, particularly for poetry.
Q: Would future AI poetry, posted broadly, get more upvotes than top human poetry?
Prediction: Yes. This has already happened. But it's all terrible, if you are someone that likes anthology-level poetry.
Q: Would future AI poetry, shared with poetry enthusiasts, get more upvotes than top human poetry?
Prediction: Also, yes. I subscribe to the top poetry magazin...
The AI biology threat vector is heavily bottlenecked by deployment capability and lab skills. Theory and detailed guidance alone is largely insufficient to provide a significant increase in threat level. (I say this as a somewhat experienced wet lab biologist, someone who has worked in cell biology labs, biotech labs, performed molecular biology tasks in vitro, and studied virulence of serious pathogens inside a biosafety level 2 lab)
That said, Claude Opus 4.7 is quite competent in chemistry and biology, and the first Claude model widely released that actu...