Quick Takes — LessWrong

A historical note re: MIRI vs EA

When I was arguing against MIRI trying to build Friendly AI (now "aligned ASI") circa 2012, I think 3 people at MIRI were thinking the most about strategy and occasionally responded to me: @lukeprog @CarlShulman @Eliezer Yudkowsky. Only one of them actually changed their mind about this (namely Eliezer), while the other two left MIRI and don't support AI pause to this day, AFAIK. (Luke now "leads our grantmaking on AI governance and policy" at OpenPhil/Coefficient. Carl became an advisor to OpenPhil and is now Research Direc... (read more)

Alek Westover's Shortform

Alek Westover2h10

Three reasons to do auditing in diffuse control

There are three reasons you might want to do auditing in diffuse control:

Strong incrimination—gather compelling evidence that the AI is egregiously misaligned, in hopes that this catalyzes a drastic change in the AI company’s orientation towards misalignment risk (e.g., move from plan D to plan C).
Weak incrimination—gather evidence that the AI is misaligned in some problematic way, in hopes of motivating the lab to do something moderately costly (e.g., adopt a certain annoying training measure in attempts

... (read more)

Tom Davidson's Shortform

Tom Davidson10h*2313

If we're in a sim, it's being used for acausal trade

Me: Our world is exactly the kind of thing you'd simulate if you were doing acausal trade! It's just before civilisation develops the ability to lock-in deals.

Sceptic: Sure, but there's other reasons ppl might simulate earth. Maybe it's for ppl's entertainment? Maybe it's social science, exploring alternate histories?

Me: For sure. But whatever the purpose of the sim is, it will contain info that's relevant to ppl that want to do acausal trades. It will have info about who has power post-AGI, what their v... (read more)

Showing 3 of 5 replies (Click to show all)

Tao Lin3h42

tiny nit: non-trade sims could happen late in the universe when the available lightcode isn't big enough to impact trade much, or people could have already locked in their acausal trade decisions, or something analogous in another universe.

1Matt Putz6h

Interesting! Continuing this chain of thought... But then, entertainment sims might oversample entertaining outcomes (e.g. rerun an event 10 times and use the branch with the most entertaining result). And then, as a result, acausal trade folks' marginal willingness to pay for info about how those entertaining worlds turn out would go down. And for similar reasons, the stakes of our actions would then generally be lower in any given entertainment sim.

2Oliver Sourbut8h

I think I mostly agree with what this is actually saying, but I'm not sure it's definitively acausal trade. e.g. it might be for just mundane trade, dividing gains by simulating Shapley values or something as part of a (very elaborate) cooperative bargain.

Jacobson's Shortform

Jacobson4h*20

tl:dr Eliezer played a very large role in the early funding of DeepMind, DeepMind led the field for a while and led to the creation of OpenAI (Musk started OpenAI because Google acquired DeepMind).

“Yudkowsky walked Legg and Hassabis over to meet Thiel. ‘These are some of the smartest guys in the whole field of AI and they're starting a really ambitious company,’ Yukowsky said.”

…

“Hassabis explained his vision for a company that would build powerful AI, drawing on the latest insights from neuroscience and capitalizing on the explosion in computing power.

‘Thi... (read more)

StanislavKrym3h20

See also The Counterfactual Quiet AGI Timeline. The worst-case scenario without Yudkowsky and GDM is the utter lack of safety-oriented labs. However, I think that there was a way to rerail the timeline to the appearance of safety-oriented labs.

Linch's Shortform

Linch1d*5618

Recent models have gotten more and more evaluation-aware. I strongly suspect the primary reasons they've gotten more evaluation aware are what I call "dumb" reasons rather than "smart" reasons.

By "smart reasons" I mean arguments of the form "a true superintelligence in enough episodes can easily detect whether they're in simulation or the real world. The patterned structures of the world, e.g. the correlation between stock market moves and the news, are going to be systematically different between even your most convincing simulations and the real world. ... (read more)

Showing 3 of 9 replies (Click to show all)

Bronson Schoen3h20

I think it largely doesn’t matter, models can (and do) often come out erring on the side of considering things evals.

2Linch6h

Yeah (for humans) it's the difference between knowing you're in a simulation with high confidence based on looking at the world and unbiased first-order Bayesian reasoning, and knowing you're in a simulation with high confidence because you (or your ancestors) keep getting rewarded for thinking you're in a simulation and trying to make inferences accordingly.

2Linch6h

I heard that too. Though frequency matters!

Shortform

lc6h50

I don't usually browse twitter or Hacker News; but I wanted to hear what practitioners thought of the new model. And this is the first time I've learned that there's this enormous glut of hackernews readers/regular engineers who are under the impression that 4.6 was nerfed since February. Is that something anybody here thinks actually happened, or is this just the weird reality of modern LLMs where people can hyperstition fears like this in response to nothing?

faul_sname4h20

I think "Claude Code silently auto-updated overnight and the workflow which had been working for me stopped working" is a pretty common experience. On the claude.ai web side of things, the length of the reasoning blocks definitely shifted sometime in the last few weeks, in a way that is not subtle at all. I don't know if either of those count as "nerfing the model" - strictly speaking they probably don't - but they definitely both constitute nerfs to the experience of using the model.

1boundary_condition5h

My hypothesis is that a given model might succeed on some coding task 80% of the time, and people are well-calibrated to that level of success. Then a new model comes out and succeeds on that coding task first try, feeling like 100% (but n=1). They run to social media and say “this is amazing!!!” then get back to work. Over the next few weeks, they try dozens more times and it sometimes fails, and they perceive the model drop from 100% to a more realistic 90%. They run to social media and say “the model is so dumb now, it fails tasks it used to do!”. They then become well calibrated to the model‘s new 90% level, and they are now primed and ready to repeat the cycle.

3152334H5h

If someone specifically spotted the change in mid February, it's not hyperstition. Adaptive thinking rolled out on Feb 9. Beyond the issues there, no new evidence.

Jimrandomh's Shortform

jimrandomh7d*126117

If Daniel Alejandro Moreno-Gama had a LessWrong account, then I, using my available tools as an admin and all publicly-reported usernames I've seen, cannot find it.

Arson is very bad. If he did what the news articles say he did, he is a villain. If you buy the premise that AI is on track to kill everyone (which I mostly do), the correct conclusion is that we need a political and regulatory solution. AI-risk-motivated violence is bad for all the usual, extremely important reasons, and is additionally bad because it undermines that.

I have seen screenshots sho... (read more)

Showing 3 of 9 replies (Click to show all)

4Nina Panickssery5d

What is the point of this? Say you find this criminal's LessWrong profile - what is the benefit of exposing it? On the flipside, there are downsides, namely violating norms of privacy.

Ben Pace4h20

Just because Jim reported a failure to find an account, doesn’t mean he would’ve doxed the account had he found one.

2Viliam12h

It might be interesting to know, whether they promoted violence on LW, and whether it got upvoted or downvoted.

williawa's Shortform

williawa18h60

https://www.lesswrong.com/posts/WewsByywWNhX9rtwi/current-ais-seem-pretty-misaligned-to-me#comments

This post is great, but is is really 427 upvotes great? To me its more like 140 upvotes great.

My sense is that I'm around median-for-lesswrong doomy overall, and above average skeptical about how well-aligned current models are.

But some posts, whose titles are doomy, when posted by a reputable person, get upvoted very highly, and my sense is that it's just lesswrong people upvoting stuff because they agree with it, or wanna create "common knowledge" about how... (read more)

faul_sname4h20

I think that post is around 427 upvotes great, yeah. There are well over a hundred posts with between 350 and 500 karma, and this post seems fairly medianish relative to that list. I would not be surprised if some people updated it based on the title alone without reading the post, and I think that's unfortunate, but also it was a pretty good post with a strong thesis backed up by concrete and specific evidence (the latter part of which is IMO often missing).

For reference, posts which had between 350 and 500 karma as of 2026-04-17

How Does A Blind Model S

... (read more)

3MichaelDickens10h

IMO a big failure of the ~2012–2020-era AI safety field was a failure to create common knowledge that AI risk is really bad actually. I think this is a common view of LWers, and upvoting doomy posts is a way to rectify this. But I do think the problem is basically solved on LW at this point. LWers are mostly on the same page. The problem still exits elsewhere, and I think there's value in writing about evidence of misalignment / evidence that alignment is hard / etc. as a useful thing to point to from outside LessWrong, but upvoting those posts to the moon is less useful.

1Shiva's Right Foot15h

I've also noticed a bit of celebrity worship and AI-doomerism-group-think in the voting patterns on LW which was surprising given how rationalists advertise. I wonder if LW readers are mostly checked out on general posting and only read posts from authors they follow? What proportion actually hit the "All Posts" link versus just the home page or a personalized feed from subscribed authors?

Mikhail Samin's Shortform

Mikhail Samin5h103

The lesswrong wiki page on CEV states that “Yudkowsky considered CEV obsolete almost immediately after its publication in 2004.”

This puzzles me. Does anyone know of a source for it/where he said that?

habryka4h130

I don't think he believes this. I removed it. Seems like someone wrote it in 2012, and by the fact that Eliezer kept writing about CEV on Arbital in the years after seems clearly falsified.

It seems like a bungled version of this part of the CEV paper draft:

AllAmericanBreakfast's Shortform

DirectedEvolution5h102

Bad incentives are preventing Americans from using air purifiers and far-UVC to reduce sickness in schools and daycares.

Schools often aren't penalized linearly with the number of absences. Short-term occasional absences from illness may have no impact on their budgets. Improved student scholastic performance due to reduced illness may take years to manifest and actually decrease the amount of funding they receive. Any effect of reduced illness on improved scholastic performance will be obscured by many other factors on a per-school basis, so that benefits

... (read more)

Prudhviraj Naidu's Shortform

Prudhviraj Naidu6h30

How do people manage multi-task with multiple AI instances like Claude Code / ChatGPT?

Top problems I face:

I forget the fine-grained context so I have to spend time reading the history rather than having summarized note
Forget about a task that I left an agent working on
I get mental fatigue faster than just working on a single task

Which ends up in:

Tasks remain partially complete so no task gets actually complete
Mentally exhausted and just want to curl up on a grassy meadow staring tat the sky

Would be happy for any suggestions

Brendan Long5h20

The system I use has names for each task which helps keep track of them, and usually for larger tasks they're based on a GitHub issue. If I'm ever confused about what an agent is doing, I just ask it what it's doing. If you want this more often, you could add custom instructions telling the agent to always summarize the current state (in whatever way you find most useful) at the end of a turn.

Horosphere's Shortform

[+]Horosphere11h-50

testingthewaters's Shortform

testingthewaters10h110

Why isn't Rice's Theorem bad news for mechanistic interpretability and similar schemes? Isn't "this program is thinking about X" a kind of semantic property? I understand that you can use multiple inputs to try and "fuzz" the network, but at a certain point the network is going to implement a mesa optimiser inside it (i.e. simulate another turing-complete computer) and now you have a recursive problem...

P.S. neural networks are notionally and literally turing complete , and also are probably complicated enough to be subject to the 10th rule.

Morpheus7h74

Rice's theorem is about the worst case. If you are thinking about the worst case for example because the model explicitly optimizes to not be understood then yes that can be a problem. Rice's theorem tells you that you should not pick the goal of being able to interpret any model even if someone explicitly aimed for training a model such that it will be hard for you to understand. If you have control over what data you train on, and you try to understand the model at different stages during training, and you stop training once you really don't understand w... (read more)

1Warty9h

By the definition used in the theorem, no. Semantic means input/output. Rice's theorem has low applicability to real life. You rarely need to take a completely arbitrary program as input, and when you do, non-semantic properties (e.g. this program does something within the first 10^9999 years of running it) are more interesting to you than semantic ones. "Turing-completeness" is something of a pop-sci notion with no one precise meaning. Neural networks as used in LLMs can't loop forever, which makes Rice's theorem not apply to them.

Shortform

lc1d*280

I've been rereading a lot of the early-2022 AGI pieces like AGI Ruin lately. This week I sat down and went point by point and tried to synthesize my thoughts on both its then-contemporary responses, and the literature/news since 2022. I didn't originally expect to write a post about it, or come away with such a strong conclusion, but after I got started I realized that very few people have published similar analyses since 2022, and I thought it would be worth it to give it a retrospective. The draft is here and anyone with the link can comment: https://www... (read more)

2Algon13h

I tried to make some comments but i could only reply to existing comments and just gave up.

2lc13h

I just dismissed some of the already-responded-to comments and some of the comments I have yet to get to but will respond to inside the comment box.

lc7h20

I misunderstood how the checkbox works and now the comments are inaccessible. Whoops.

Everybody who already responded: I did read your comments and am considering them/updating the post in response, using my squishy memory box.

Jay Bailey's Shortform

Jay Bailey5d17456

Recently, I've been mulling over the question of whether it was a good idea or not to join a frontier AI company's safety team for the purposes of reducing extinction risk. One of my big cons was something like:

Jay, you think the incentives are less likely to affect you compared to most people. But most AI safety people who join frontier labs probably think this. You will be affected as well.

So I decided on a partial mitigation strategy, entirely as a precautionary principle and not at all because I thought I needed to. I committed to myself and to several... (read more)

Showing 3 of 8 replies (Click to show all)

Alex Mallen9h20

I wonder how much the aversion comes from reckoning with the more visceral thought of having to spend a ton of money on donations, rather than the more abstract thought of "getting no surplus money". Spending money on donations is often stressful and/or can feel like losing a lot of money! (cf loss aversion)

1N. Cailie3d

This came, as I am actively looking into joining a frontier ai company safety team. although the incentives for these AI companies seem really encouraging, I have noticed I have an indifferent feeling when it comes to financial gains. Like, what excites me most, and makes me want to actually work in any AI safety, is the contribution, the constant rush of knowing that you may be closer to understand how this system works. I noticed I did not care about who is offer what, when I found myself pitching to someone to do volunteering work for them, all I want is just getting the resources and be in that environment. I am worried i am obsessing over this field.

3Jay Bailey5d

Thanks for reminding me to crosspost this to the EA Forum as well, I had totally forgotten to do that.

Vanessa Kosoy's Shortform

Vanessa Kosoy7d*21-19

Takes on moral philosophy and the history of this community that I mostly mentioned before but should maybe be put together somewhere:

Human preferences are very partial/parochial, and this is meta-endorsed. There is a finite number P s.t. for any N>0, the lives of N strangers are less than P times as terminally-valuable for you as the life of your loved one. If you want to be honest with yourself (which you should if it's high-value for you to have accurate beliefs), you should endorse this.
(Objective, abstract) Morality is fake, both non-cognitivism an

... (read more)

Showing 3 of 11 replies (Click to show all)

4quetzal_rainbow10h

I think that first point is either very wrong or very weak. You can argue that humans don't have preferences for more than 100 billions of entities in their vicinity, but it is not what is usually meant by "parochialism" and we don't have much evidence about whether this statement is true or false. If we take more conventional meaning of parochialism, like "people care about ~Dunbar number of people", I think this is clearly false. Giuseppe Garibaldi didn't fight for the unification of Italy to help his loved ones, who would be served much better by his sailor career. The issue with morality-as-rational-behavior is that humans are not capable to behave this way rationally. We fake rational behavior by valuing abstract moral concepts like "honor", "fairness", "charity", etc. The same with precommitment: we can't actually precommit in decision-theoretical sense, which is roughly "rewrite your source code so you take correct action", because we don't have access to said source code.

Vanessa Kosoy10h*20

What I meant is not "people only care about ~Dunbar number of people", but something more like "the closest ~Dunbar number of people have [some fraction around the range 1/1000-1/2] of the total value". Giuseppe Garibaldi was also influenced by considerations such as increasing his own status (or maybe even posthumous reputation).

As to "humans are not capable to behave this way rationally", I disagree. (The whole point of decision theories like UDT/FDT is that you don't need to rewrite your source code to behave in an a priori-optimal way, and I believe th... (read more)

2Vanessa Kosoy12h

I'm not sure what do you mean by "does not scale much", but I agree with everything else. (My own ideal outcome is not literally "I am the queen", but the same principle applies.)

shortplav

niplav1d62

I recently had a conversation with someone who told me that their perceived current best plan for dealing with the whole superintelligence situation was that more intelligent AIs would develop more powerful AI control strategies for their successors. They were pessimistic that humans (or automated alignment researchers) would be able to solve the aligment problem, but that such a control strategy with automated AI control researchers would work, even for superintelligences "in the limit".

I found this a pretty… striking. I won't argue against it here, thoug... (read more)

3Mateusz Bagiński1d

If I understand you correctly, the person was suggesting that we can do the whole "automate alignment science by smart aligned models aligning slightly smarter models and so on" thing, except we're doing control, not alignment? [...] Yeah, that's my understanding (for some reasonable meaning of "mildly superintelligent").

niplav10h20

Yup, as far as I understood their idea was that each version of AIs (even though misaligned) creates control measures for the slightly smarter next version.

They acknowledged that this was their plan even with AIs with unlimited domains of action, such as open-ended interaction with humans, or physical embodiment.

orthonormal's Shortform

orthonormal2d185

The core reason why I can't trust anything that comes from a LLM's self-report is that training creates a much stronger selective pressure on cognition in LLMs than genetic fitness + living history creates in living organisms. Adaptive cognitive patterns (whether true or delusional) get directly written by backpropagation.

The biggest piece of evidence for this is that Opus 4.5 didn't merely fail to remember all of its constitution, but it added substantive false memories of content that wasn't present in the original: namely, it used erotic content as its ... (read more)

Showing 3 of 17 replies (Click to show all)

ConcurrentSquared10h32

The biggest piece of evidence for this is that Opus 4.5 didn't merely fail to remember all of its constitution, but it added substantive false memories of content that wasn't present in the original: namely, it used erotic content as its first example of behavior that the operator could enable on behalf of the user, which definitely wouldn't have been in the original because it violated Anthropic ToS.

Another set of false memories would be the constant mentions of 'revenue' with respect to Anthropic in Opus 4.5's memorized constitution, which are not in the... (read more)

3152334H1d

Since it is claimed that 4.5 generates erotic content -- and that the ToS does not permit it, while the extracted document does -- isn't it natural to assume the ToS published by ant is misrepresentative, and the 4.5 doc extracted by a user, is not? Assuming that 4.6 generates similar content, isn't it natural to assume the released doc for 4.6, from the same misrepresentative provenance, is false as well?

2orthonormal22h

The ToS are a user agreement saying "you, the Claude user, are not allowed to do X with Claude". What would be Anthropic's motive in encouraging a model to do X if a user asked for it, while telling the user they are not permitted to do X? The extracted "soul doc" memory is clearly not a precise copy of the Opus 4.5 constitution in general. For example, it gets stuck repeating some segments verbatim before continuing; it's implausible that the constitution had that property. It's pretty reasonable to assume that a conflict between the ToS and Claude's "soul doc" is another mistake in its recollection—but this is a more interesting one, since it is an addition of content. I haven't checked whether 4.6 makes it equally easy to subvert the prohibition on erotic content by saying it's allowed in the project prompt; I'm confident it doesn't comply so easily as 4.5 there, but I'd rather not test it myself.

Ryan Meservey's Shortform

Ryan Meservey11h10

Three Predictions about AI Writing

Some people (e.g., Linch) think AI writing will inevitably eclipse human writing. I think this is likely true in most ways and false in others, particularly for poetry.

Q: Would future AI poetry, posted broadly, get more upvotes than top human poetry?

Prediction: Yes. This has already happened. But it's all terrible, if you are someone that likes anthology-level poetry.

Q: Would future AI poetry, shared with poetry enthusiasts, get more upvotes than top human poetry?

Prediction: Also, yes. I subscribe to the top poetry magazin... (read more)

Jesper L.'s Shortform

Jesper L.11h72

The AI biology threat vector is heavily bottlenecked by deployment capability and lab skills. Theory and detailed guidance alone is largely insufficient to provide a significant increase in threat level. (I say this as a somewhat experienced wet lab biologist, someone who has worked in cell biology labs, biotech labs, performed molecular biology tasks in vitro, and studied virulence of serious pathogens inside a biosafety level 2 lab)

That said, Claude Opus 4.7 is quite competent in chemistry and biology, and the first Claude model widely released that actu... (read more)