Anthropic employees: stop deferring to Dario on politics. Think for yourself.
Do your company's actions actually make sense if it is optimizing for what you think it is optimizing for?
Anthropic lobbied against mandatory RSPs, against regulation, and, for the most part, didn't even support SB-1047. The difference between Jack Clark and OpenAI's lobbyists is that publicly, Jack Clark talks about alignment. But when they talk to government officials, there's little difference on the question of existential risk from smarter-than-human AI systems. They do not honestly tell the governments what the situation is like. Ask them yourself.
A while ago, OpenAI hired a lot of talent due to its nonprofit structure.
Anthropic is now doing the same. They publicly say the words that attract EAs and rats. But it's very unclear whether they institutionally care.
Dozens work at Anthropic on AI capabilities because they think it is net-positive to get Anthropic at the frontier, even though they wouldn't work on capabilities at OAI or GDM.
It is not net-positive.
Anthropic is not our friend. Some people there do very useful work on AI safety (where "useful" mostly means "shows that the predictions of MIRI-s...
I think you should try to clearly separate the two questions of
I strongly believe 2 is true, because why on Earth would they want to make an extra dollar if misaligned AI kills them in addition to everyone else? Won't any measure of their social status be far higher after the singularity, if it's found that they tried to do the best for humanity?
I'm not sure about 1. I think even they're not sure about 1. I heard that they held back on releasing their newer models until OpenAI raced ahead of them.
You (and all the people who upvoted your comment) have a chance of convincing them (a little) in a good faith debate maybe. We're all on the same ship after all, when it comes to AI alignment.
PS: AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter, so I personally hope for their success. I may be wrong, but it's my best guess...
I don't agree that the probability of alignment research succeeding is that low. 17 years or 22 years of trying and failing is strong evidence against it being easy, but doesn't prove that it is so hard that increasing alignment research is useless.
People worked on capabilities for decades, and never got anywhere until recently, when the hardware caught up, and it was discovered that scaling works unexpectedly well.
There is a chance that alignment research now might be more useful than alignment research earlier, though there is uncertainty in everything.
We should have uncertainty in the Ten Levels of AI Alignment Difficulty.
It's unlikely that 22 years of alignment research is insufficient but 23 years of alignment research is sufficient.
But what's even more unlikely, is the chance that $200 billion on capabilities research plus $0.1 billion on alignment research is survivable, while $210 billion on capabilities research plus $1 billion on alignment research is deadly.
In the same way adding a little alignment research is unlikely to turn failure into success, adding a little capabilities research is unlikely to turn success into failure.
It's also unlikely that alignme...
But it's very unclear whether they institutionally care.
There are certain kinds of things that it's essentially impossible for any institution to effectively care about.
People representing Anthropic argued against government-required RSPs. I don’t think I can share the details of the specific room where that happened, because it will be clear who I know this from.
Ask Jack Clark whether that happened or not.
When do you think would be a good time to lock in regulation? I personally doubt RSP-style regulation would even help, but the notion that now is too soon/risks locking in early sketches, strikes me as in some tension with e.g. Anthropic trying to automate AI research ASAP, Dario expecting ASL-4 systems between 2025—the current year!—and 2028, etc.
AFAIK Anthropic has not unequivocally supported the idea of "you must have something like an RSP" or even SB-1047 despite many employees, indeed, doing so.
Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don't actually have good advice to give anyone.
It seems to me that other possibilities exist, besides "has model with numbers" or "confused." For example, that there are relevant ethical considerations here which are hard to crisply, quantitatively operationalize!
One such consideration which feels especially salient to me is the heuristic that before doing things, one should ideally try to imagine how people would react, upon learning what you did. In this case the action in question involves creating new minds vastly smarter than any person, which pose double-digit risk of killing everyone on Earth, so my guess is that the reaction would entail things like e.g. literal worldwide riots. If so, this strikes me as the sort of consideration one should generally weight more highly than their idiosyncratic utilitarian BOTEC.
Does your model predict literal worldwide riots against the creators of nuclear weapons? They posed a single-digit risk of killing everyone on Earth (total, not yearly).
It would be interesting to live in a world where people reacted with scale sensitivity to extinction risks, but that's not this world.
nuclear weapons have different game theory. if your adversary has one, you want to have one to not be wiped out; once both of you have nukes, you don't want to use them.
also, people were not aware of real close calls until much later.
with ai, there are economic incentives to develop it further than other labs, but as a result, you risk everyone's lives for money and also create a race to the bottom where everyone's lives will be lost.
I don't think that most people, upon learning that Anthropic's justification was "other companies were already putting everyone's lives at risk, so our relative contribution to the omnicide was low" would then want to abstain from rioting. Common ethical intuitions are often more deontological than that, more like "it's not okay to risk extinction, period." That Anthropic aims to reduce the risk of omnicide on the margin is not, I suspect, the point people would focus on if they truly grokked the stakes; I think they'd overwhelmingly focus on the threat to their lives that all AGI companies (including Anthropic) are imposing.
If you trust the employees of Anthropic to not want to be killed by OpenAI
In your mind, is there a difference between being killed by AI developed by OpenAI and by AI developed by Anthropic? What positive difference does it make, if Anthropic develops a system that kills everyone a bit earlier than OpenAI would develop such a system? Why do you call it a good bet?
AGI is coming whether you like it or not
Nope.
You’re right that the local incentives are not great: having a more powerful model is hugely economically beneficial, unless it kills everyone.
But if 8 billion humans knew what many of LessWrong users know, OpenAI, Anthropic, DeepMind, and others cannot develop what they want to develop, and AGI doesn’t come for a while.
From the top of my head, it actually likely could be sufficient to either (1) inform some fairly small subset of 8 billion people of what the situation is or (2) convince that subset that the situation as we know it is likely enough to be the case that some measures to figure out the risks and not be killed by AI in the meantime are justified. It’s also helpful to (3) suggest/introduce/support policies that change the incentives to race or increase the chance of ...
The book is now a NYT bestseller: #7 in combined print&e-books nonfiction, #8 in hardcover nonfiction.
I want to thank everyone here who contributed to that. You're an awesome community, and you've earned a huge amount of dignity points.
Nobody at Anthropic can point to a credible technical plan for actually controlling a generally superhuman model. If it’s smarter than you, knows about its situation, and can reason about the people training it, this is a zero-shot regime.
The world, including Anthropic, is acting as if "surely, we’ll figure something out before anything catastrophic happens."
That is unearned optimism. No other engineering field would accept "I hope we magically pass the hardest test on the first try, with the highest stakes" as an answer. Just imagine if flight or nuclear technology were deployed this way. Now add having no idea what parts the technology is made of. We've not developed fundamental science about how any of this works.
As much as I enjoy Claude, it’s ordinary professional ethics in any safety-critical domain: you shouldn't keep shipping SOTA tech if your own colleagues, including the CEO, put double-digit chances on that tech causing human extinction.
You're smart enough to know how deep the gap is between current safety methods and the problem ahead. Absent dramatic change, this story doesn’t end well.
In the next few years, the choices of a technical leader in this field could literally determine not what the future looks like, but whether we have a future at all.
If you care about doing the right thing, now is the time to get more honest and serious than the prevailing groupthink wants you to be.
I think it's accurate to say that most Anthropic employees are abhorrently reckless about risks from AI (though my guess is that this isn't true of most people who are senior leadership or who work on Alignment Science, and I think that a bigger fraction of staff are thoughtful about these risks at Anthropic than other frontier AI companies). This is mostly because they're tech people, who are generally pretty irresponsible. I agree that Anthropic sort of acts like "surely we'll figure something out before anything catastrophic happens", and this is pretty scary.
I don't think that "AI will eventually pose grave risks that we currently don't know how to avert, and it's not obvious we'll ever know how to avert them" immediately implies "it is repugnant to ship SOTA tech", and I wish you spelled out that argument more.
I agree that it would be good if Anthropic staff (including those who identify as concerned about AI x-risk) were more honest and serious than the prevailing Anthropic groupthink wants them to be.
What if someone at Anthropic thinks P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30%? Then the obvious alternatives are to do their best to get governments / international agreements to make everyone pause or to make everyone's AI development safer, but it's not completely obvious that this is a better strategy because it might not be very tractable. Additionally, they might think these things are more tractable if Anthropic is on the frontier (e.g. because it does political advocacy, AI safety research, and deploys some safety measures in a way competitors might want to imitate to not look comparatively unsafe). And they might think these doom-reducing effects are bigger than the doom-increasing effects of speeding up the race.
You probably disagree with P(doom|some other company builds AGI) - P(doom|Anthropic builds AGI) and with the effectiveness of Anthropic advocacy/safety research/safety deployments, but I feel like this is a very different discussion from "obviously you should never build something that has a big chance of killing everyone".
(I don't think most people at Anthropic think like that, but I believe at least some of the most influential employees do.)
Also my understanding is that technology is often built this way during deadly races where at least one side believes that them building it faster is net good despite the risks (e.g. deciding to fire the first nuke despite thinking it might ignite the atmosphere, ...).
If this is their belief, they should state it and advocate for the US government to prevent everyone in the world, including them, from building what has a double-digit chance of killing everyone. They’re not doing that.
P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30% --> You need to add to this the probability that Anthropic is first and that the other companies are not going to create AGI if Anthropic already created it. this is by default not the case
It would be easier to argue with you if you proposed a specific alternative to the status quo and argued for it. Maybe "[stop] shipping SOTA tech" is your alternative If so: surely you're aware of the basic arguments for why Anthropic should make powerful models; maybe you should try to identify cruxes.
Separately from my other comment: It is not the case that the only appropriate thing to do when someone is going around killing your friends and your family and everyone you know is to "try to identify cruxes".
It's eminently reasonable for people to just try to stop whatever is happening, which includes intention for social censure, convincing others, and coordinating social action. It is not my job to convince Anthropic staff they are doing something wrong. Indeed, the economic incentives point extremely strongly towards Anthropic staff being the hardest to convince of true beliefs here. The standard you invoke here seems pretty crazy to me.
It is not clear to me that Anthropic "unilaterally stopping" will result in meaningfully better outcomes than the status quo, let alone that it would be anywhere near the best way for Anthropic to leverage its situation.
I do think there's a Virtue of Silence problem here.
Like--I was a ML expert who, roughly ten years ago, decided to not advance capabilities and instead work on safety-related things, and when the returns to that seemed too dismal stopped doing that also. How much did my 'unilateral stopping' change things? It's really hard to estimate the counterfactual of how much I would have actually shifted progress; on the capabilities front I had several 'good ideas' years early but maybe my execution would've sucked, or I would've been focused on my bad ideas instead. (Or maybe me being at the OpenAI lunch table and asking people good questions would have sped the company up by 2%, or w/e, independent of my direct work.)
How many people are there like me? Also not obvious, but probably not that many. (I would guess most of them ended up in the MIRI orbit and I know them, but maybe there are lurkers--one of my friends in SF works for generic tech companies but is highly suspicious of working for AI companies, for reasons roughly downstream of MIRI, and there might easily be hundreds of people in that boat. But maybe the AI companies would only actually have wanted to hire ten of them, and the others objecting to AI work didn't actually matter.)
It is not clear to me that Anthropic "unilaterally stopping" will result in meaningfully better outcomes than the status quo
I think that just Anthropic, OpenAI, and DeepMind stopping would plausibly result in meaningfully better outcomes than the status quo. I still see no strong evidence that anyone outside these labs is actually pursuing AGI with anything like their level of effectiveness. I think it's very plausible that everyone else is either LARPing (random LLM startups), or largely following their lead (DeepSeek/China), or pursuing dead ends (Meta's LeCun), or some combination.
The o1 release is a good example. Yes, everyone and their grandmother was absent-mindedly thinking about RL-on-CoTs and tinkering with relevant experiments. But it took OpenAI deploying a flashy proof-of-concept for everyone to pour vast resources into this paradigm. In the counterfactual where the three major labs weren't there, how long would it have taken the rest to get there?
I think it's plausible that if only those three actors stopped, we'd get +5-10 years to the timelines just from that. Which I expect does meaningfully improve the outcomes, particularly in AI-2027-style short-timeline worlds.
So I think getting any one of them to individually stop would be pretty significant, actually (inasmuch as it's a step towards "make all three stop").
I think more than this, when you look at the labs you will often see the breakthru work was done by a small handful of people or a small team, whose direction was not popular before their success. If just those people had decided to retire to the tropics, and everyone else had stayed, I think that would have made a huge difference to the trajectory. (What does it look like if Alec Radford had decided to not pursue GPT? Maybe the idea was 'obvious' and someone else gets it a month later, but I don't think so.)
I see no principle by which I should allow Anthropic to build existentially dangerous technology, but disallow other people from building it. I think the right choice is for no lab to build it. I am here not calling for particularly much censure of Anthropic compared to all labs, and my guess is we can agree that in aggregate building existentially dangerous AIs is bad and should face censure.
Ah. In that case we just disagree about morality. I am strongly in favour of judging actions by their consequences, especially for incredibly high stakes actions like potential extinction level events. If an action decreases the probability of extinction I am very strongly in favour of people taking it.
I'm very open to arguments that the consequences would be worse, that this is the wrong decision theory, etc, but you don't seem to be making those?
I too believe we should ultimately judge things based on their consequences. I believe that having deontological lines against certain actions is something that leads humans to make decisions with better consequences, partly because we are bounded agents that cannot well-compute the consequences of all of our actions.
For instance, I think you would agree that it would be wrong to kill someone in order to prevent more deaths, today here in the Western world. Like, if an assassin is going to kill two people, but says if you kill one then he won’t kill the other, if you kill that person you should still be prosecuted for murder. It is actually good to not cross these lines even if the local consequentialist argument seems to check out. I make the same sort of argument for being first in the race toward an extinction-level event. Building an extinction-machine is wrong, and arguing you’ll be slightly more likely to pull back first does not stop it from being something you should not do.
I think when you look back at a civilization that raced to the precipice and committed auto-genocide, and ask where the lines in the sand should’ve been drawn, the most natural one will be “building the extinction machine, and competing to be first to do so”. So it is wrong to cross this line, even for locally net positive tradeoffs.
I think this just takes it up one level of meta. We are arguing about the consequences of a ruleset. You are arguing that your ruleset has better consequences, while others disagree. And so you try to censure these people - this is your prerogative, but I don't think this really gets you out of the regress of people disagreeing about what the best actions are.
Engaging with the object level of whether your proposed ruleset is a good one, I feel torn.
For your analogy of murder, I am very pro-not-murdering people, but I would argue this is convergent because it is broadly agreed upon by society. We all benefit from it being part of the social contract, and breaking that erodes the social contract in a way that harms all involved. If Anthropic unilaterally stopped trying to build AGI, I do not think this would significantly affect other labs, who would continue their work, so this feels disanalogous.
And it is reasonable in extreme conditions (e.g. when those prohibitions are violated by others acting against you) to abandon standard ethical prohibitions. For example, I think it was just for Allied soldiers to kill Nazi soldiers in World War II. I think having nuclear weapons is terribl...
If Anthropic unilaterally stopped trying to build AGI, I do not think this would significantly affect other labs, who would continue their work, so this feels disanalogous.
Not a crux for either of us, but I disagree. When is the last time that someone shut down a multi-billion dollar profit arm of a company due to ethics, and especially due to the threat of extinction? If Anthropic announced they were ceasing development / shutting down because they did not want to cause an extinction-level event, this would have massive ramifications through society as people started to take this consequence more seriously, and many people would become more scared, including friends of employees at the other companies and more of the employees themselves. This would have massive positive effects.
For your analogy of murder, I am very pro-not-murdering people, but I would argue this is convergent because it is broadly agreed upon by society. We all benefit from it being part of the social contract, and breaking that erodes the social contract in a way that harms all involved.
This implies one should never draw lines in the sand about good/bad behavior if society has not reached consensus on it. In co...
I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.
The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.
Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects ar...
If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.
Hm… I would say that one should follow deontological rules like “don’t lie” and “don’t steal” and so on because we fail to understand or predict the knock-on consequences. For instance they can get the world into a much worse equilibrium of mutual liars/stealers, for instance, in ways that are hard to predict. And being a good person can get the world into a much better equilibrium of mutually-honorable people in ways that are hard to predict. And also because, if it does screw up in some hard to predict way, then when you look back, it will often be the easiest line in the sand to draw.
For instance, if SBF is wondering at what point he could have most reliably intervened on his whole company collapsing and ruining the reputation of things associated with it, he might talk about certain deals he made or strategic plays with Binance or the US Govt, for he is not a very ethical person; I would tal...
I'm not Ben, but I think you don't understand. I think explaining what you are doing loudly in public isn't like "having a really good reason to believe it is net good" is instead more like asking for consent.
Like you are saying "please stop me by shutting down this industry" and if you don't get shut down, that it is analogous to consent: you've informed society about what you're doing and why and tried to ensure that if everyone else followed a similar sort of policy we'd be in a better position.
(Not claiming I agree with Ben's perspective here, just trying to explain it as I understand it.)
Has anyone spelled out the arguments for how it's supposed to help us, even incrementally, if one AI lab (rather than all of them) drops out of the AI race?
An AI lab dropping out helps in two ways:
I think it would be bad to suppress arguments! But I don't see any arguments being suppressed here. Indeed, I see Zack as trying to create a standard where (for some reason) arguments about AI labs being reckless must be made directly to the people who are working at those labs, and other arguments should not be made, which seems weird to me. The OP seems to me like it's making fine arguments.
I don't think it was ever a requirement for participation on LessWrong to only ever engage in arguments that could change the minds of the specific people who you would like to do something else, as opposed to arguments that are generally compelling and might affect those people in indirect ways. It's nice when it works out, but it really doesn't seem like a tenet of LessWrong.
It's OK to ask people to act differently without engaging with your views! If you are stabbing my friends and family I would like you to please stop, and I don't really care about engaging with your views. The whole point of social censure is to ask people to act differently even if they disagree with you, that's why we have civilization and laws and society.
I think Anthropic leadership should feel free to propose a plan to do something that is not "ship SOTA tech like every other lab". In the absence of such a plan, seems like "stop shipping SOTA tech" is the obvious alternative plan.
Clearly in-aggregate the behavior of the labs is causing the risk here, so I think it's reasonable to assume that it's Anthropic's job to make an argument for a plan that differs from the other labs. At the moment, I know of no such plan. I have some vague hopes, but nothing concrete, and Anthropic has not been very forthcoming with any specific plans, and does not seem on track to have one.
I have great empathy and deep respect for the courage of the people currently on hunger strikes to stop the AI race. Yet, I wish they hadn’t started them: these hunger strikes will not work.
Hunger strikes can be incredibly powerful when there’s a just demand, a target who would either give in to the demand or be seen as a villain for not doing so, a wise strategy, and a group of supporters.
I don’t think these hunger strikes pass the bar. Their political demands are not what AI companies would realistically give in to because of a hunger strike by a small number of outsiders.
A hunger strike can bring attention to how seriously you perceive an issue. If you know how to make it go viral, that is; in the US, hunger strikes are rarely widely covered by the media. And even then, you are more likely to marginalize your views than to make them go more mainstream: if people don’t currently think halting frontier general AI development requires hunger strikes, a hunger strike won’t explain to them why your views are correct: this is not self-evident just from the description of the hunger strike, and so the hunger strike is not the right approach here and now.
Also, our movement does not need...
Hi Mikhail, thanks for offering your thoughts on this. I think having more public discussion on this is useful and I appreciate you taking the time to write this up.
I think your comment mostly applies to Guido in front of Anthropic, and not our hunger strike in front of Google DeepMind in London.
Hunger strikes can be incredibly powerful when there’s a just demand, a target who would either give in to the demand or be seen as a villain for not doing so, a wise strategy, and a group of supporters.
I don’t think these hunger strikes pass the bar. Their political demands are not what AI companies would realistically give in to because of a hunger strike by a small number of outsiders.
...A hunger strike ca
Action is better than inaction; but please stop and think of your theory of change for more than five minutes,
I think there's a very reasonable theory of change - X-risk from AI needs to enter the Overton window. I see no justification here for going to the meta-level and claiming they did not think for 5 minutes, which is why I have weak downvoted in addition to strong disagree.
This tactic might not work, but I am not persuaded by your supposed downsides. The strikers should not risk their lives, but I don't get the impression that they are. The movement does need people who are eating -> working on AI safety research, governance, and other forms of advocacy. But why not this too? Seems very plausibly a comparative advantage for some concerned people, and particularly high leverage when very few are taking this step. If you think they should be doing something else instead, say specifically what it is and why these particular individuals are better suited to that particular task.
At the moment, these hunger strikes are people vibe-protesting.
I think I somewhat agree, but also I think this is a more accurate vibe than “yay tech progress”. It seems like a step in the right direction to me.
Please don’t risk your life; especially, please don’t risk your life in this particular way that won’t change anything.
Action is better than inaction; but please stop and think of your theory of change for more than five minutes, if you’re planning to risk your life, and then don’t risk your life; please pick actions thoughtfully and wisely and not because of the vibes.
You repeat a recommendation not to risk your life. Um, I’m willing to die to prevent human extinction. The math is trivial. I’m willing to die to reduce the risk by a pretty small percentage. I don’t think a single life here is particularly valuable on consequentialist terms.
There’s important deontology about not unilaterally risking other people’s lives, but this mostly goes away in the case of risking your own life. This is why there are many medical ethics guidelines that separate self-experimentation as a special case from rules for experimenting on others (and that’s been used very well in many cases and ...
I recently read The Sacrifices We Choose to Make by Michael Nielsen, which was a good read. Here are some relevant extracts.
...At the time, South Vietnam was led by President Ngo Dinh Diem, a devout Catholic who had taken power in 1955, and then instigated oppressive actions against the Buddhist majority population of South Vietnam. This began with measures like filling civil service and army posts with Catholics, and giving them preferential treatment on loans, land distribution, and taxes. Over time, Diem escalated his measures, and in 1963 he banned flying the Buddhist flag during Vesak, the festival in honour of the Buddha's birthday. On May 8, during Vesak celebrations, government forces opened fire on unarmed Buddhists who were protesting the ban, killing nine people, including two children, and injured many more.
[...]
Unfortunately, standard measures for negotiation – petitions, street fasting, protests, and demands for concessions – were ignored by the Diem government, or met with force, as in the Vesak shooting.
[...]
Since conventional measures were failing, the Inter-Sect Committee decided to consider more extreme measures, including the idea of a voluntary self-immolation. Wh
I have heard rumor that most people who attempt suicide and fail, regret it.
After doing some research on this, I think this is unlikely to be true. The only quantitative study I found says that among its sample of suicide attempt survivors, 35.6% are glad to have survived, while 42.7% feel ambivalent, and 21.6% regret having survived. I also found a couple of sources agreeing with your "rumor", but one cited just a suicide awareness trainer as its source, while the other cited the above study as the only evidence for its claim, somehow interpreting it as "Previous research has found that more than half of suicidal attempters regret their suicidal actions." (Gemini 2.5 Pro says "It appears the authors of the 2023 paper misinterpreted or misremembered the findings of the 2005 study they cited.")
If this "rumor" was true, I would expect to see a lot of studies supporting it, because such studies are easy to do and the result would be highly useful for people trying to prevent suicides (i.e., they can use it to convince potential suicide attempters that they're likely to regret it). Evidence to the contrary are likely to be suppressed or not gathered in the first place, as almost nob...
I don't strongly agree or disagree with your empirical claims but I do disagree with the level of confidence expressed. Quoting a comment I made previously:
I'm undecided on whether things like hunger strikes are useful but I just want to comment to say that I think a lot of people are way too quick to conclude that they're not useful. I don't think we have strong (or even moderate) reason to believe that they're not useful.
When I reviewed the evidence on large-scale nonviolent protests, I concluded that they're probably effective (~90% credence). But I've seen a lot of people claim that those sorts of protests are ineffective (or even harmful) in spite of the evidence in their favor.[1] I think hunger strikes are sufficiently different from the sorts of protests I reviewed that the evidence might not generalize, so I'm very uncertain about the effectiveness of hunger strikes. But what does generalize, I think, is that many peoples' intuitions on protest effectiveness are miscalibrated.
[1] This may be less relevant for you, Mikhail Samin, because IIRC you've previously been supportive of AI pause protests in at least some contexts.
ETA: To be clear, I'm responding to the part of your...
"There is no justice in the laws of Nature, no term for fairness in the equations of motion. The universe is neither evil, nor good, it simply does not care. The stars don't care, or the Sun, or the sky. But they don't have to! We care! There is light in the world, and it is us!"
And someday when the descendants of humanity have spread from star to star they won’t tell the children about the history of Ancient Earth until they’re old enough to bear it and when they learn they’ll weep to hear that such a thing as Death had ever once existed!
Everyone should do more fun stuff![1]
I thought it'd just be very fun to develop a new sense.
Remember vibrating belts and ankle bracelets that made you have a sense of the direction of north? (1, 2)
I made some LLMs make me an iOS app that does this! Except the sense doesn't go away the moment you stop the app!
I am pretty happy about it! I can tell where’s north and became much better at navigating and relating different parts of the (actual) territory in my map. Previously, I would remember my paths as collections of local movements (there, I turn left); now, I generally know where places are, and Google Maps feel much more connected to the territory.
If you want to try it, it's on the App Store: https://apps.apple.com/us/app/sonic-compass/id6746952992
It can vibrate when you face north; even better, if you're in headphones, it can give you spatial sounds coming from north; better still, a second before playing a sound coming from north, it can play a non-directional cue sound to make you anticipate the north sound and learn very quickly.
None of this interferes with listening to any other kind of audio.
It’s all probably less relevant to the US, as your roads are in a grid anyway...
i made a thing!
it is a chatbot with 200k tokens of context about AI safety. it is surprisingly good- better than you expect current LLMs to be- at answering questions and counterarguments about AI safety. A third of its dialogues contain genuinely great and valid arguments.
You can try the chatbot at https://whycare.aisgf.us (ignore the interface; it hasn't been optimized yet). Please ask it some hard questions! Especially if you're not convinced of AI x-risk yourself, or can repeat the kinds of questions others ask you.
Send feedback to ms@contact.ms.
A couple of examples of conversations with users:
I know AI will make jobs obsolete. I've read runaway scenarios, but I lack a coherent model of what makes us go from "llms answer our prompts in harmless ways" to "they rebel and annihilate humanity".
I want to signal-boost this LW post.
I long wondered why OpenPhil made so many obvious mistakes in the policy space. That level of incompetence just did not make any sense.
I did not expect this to be the explanation:
THEY SIMPLY DID NOT HAVE ANYONE WITH ANY POLITICAL EXPERIENCE ON THE TEAM until hiring one person in April 2025.
This is, like, insane. Not what I'd expect at all from any org that attempts to be competent.
(openphil, can you please hire some cracked lobbyists to help you evaluate grants? This is, like, not quite an instance of Graham's Design Paradox, because instead of trying to evaluate grants you know nothing about, you can actually hire people with credentials you can evaluate, who'd then evaluate the grants. thank you <3)
To be clear, I don't think this is an accurate assessment of what is going on. If anything, I think marginally people with more "political experience" seemed to me to mess up more.
In-general, takes of the kind "oh, just hire someone with expertise in this" almost never make sense IMO. First of all, identifying actual real expertize is hard. Second, general competence and intelligence is a better predictor of task performance in almost all domains after even just a relatively short acclimation period that OpenPhil people far exceed. Third, the standard practices in many industries are insane and most of the time if you hire someone specifically for their expertise in a domain, not just as an advisor but an active team member, they will push for adopting those standard practices even when it doesn't make sense.
I mean, it's not like OpenPhil hasn't been interfacing with a ton of extremely successful people in politics. For example, OpenPhil approximately co-founded CSET, and talks a ton with people at RAND, and has done like 5 bajillion other projects in DC and works closely with tons of people with policy experience.
The thing that Jason is arguing for here is "OpenPhil needs to hire people with lots of policy experience into their core teams", but man, that's just such an incredibly high bar. The relevant teams at OpenPhil are like 10 people in-total. You need to select on so many things. This is like saying that Lightcone "DOESN'T HAVE ANYONE WITH ARCHITECT OR CONSTRUCTION OR ZONING EXPERIENCE DESPITE RUNNING A LARGE REAL ESTATE PROJECT WITH LIGHTHAVEN". Like yeah, I do have to hire a bunch of people with expertise on that, but it's really very blatantly obvious from where I am that trying to hire someone like that onto my core teams would be hugely disruptive to the organization.
It seems really clear to me that OpenPhil has lots of contact with people who have lots of policy experience, frequently consults with them on stuff, and that the people working there full-time seem reasonably selected for me. The only way I see the things Jason is arguing for work out is if OpenPhil was to much more drastically speed up their hiring, but hiring quickly is almost always a mistake.
PSA: if you're looking for a name for your project, most interesting .ml domains are probably available for $10, because the mainstream registrars don't support the TLD.
I bought over 170 .ml domains, including anthropic.ml (redirects to the Fooming Shoggoths song), closed.ml & evil.ml (redirect to OpenAI Files), interpretability.ml, lens.ml, evals.ml, and many others (I'm happy to donate them to AI safety projects).
Since this seems to be a crux, I propose a bet to @Zac Hatfield-Dodds (or anyone else at Anthropic): someone shows random people in San-Francisco Anthropic’s letter to Newsom on SB-1047. I would bet that among the first 20 who fully read at least one page, over half will say that Anthropic’s response to SB-1047 is closer to presenting the bill as 51% good and 49% bad than presenting it as 95% good and 5% bad.
Zac, at what odds would you take the bet?
(I would be happy to discuss the details.)
Sorry, I'm not sure what proposition this would be a crux for?
More generally, "what fraction good vs bad" seems to me a very strange way to summarize Anthropic's Support if Amended letter or letter to Governor Newsom. It seems clear to me that both are supportive in principle of new regulation to manage emerging risks, and offering Anthropic's perspective on how best to achieve that goal. I expect most people who carefully read either letter would agree with the preceeding sentence and would be open to bets on such a proposition.
Personally, I'm also concerned about the downside risks discussed in these letters - because I expect they both would have imposed very real costs, and reduced the odds of the bill passing and similar regulations passing and enduring in other juristictions. I nonetheless concluded that the core of the bill was sufficiently important and urgent, and downsides manageable, that I supported passing it.
In RSP, Anthropic committed to define ASL-4 by the time they reach ASL-3.
With Claude 4 released today, they have reached ASL-3. They haven’t yet defined ASL-4.
Turns out, they have quietly walked back on the commitment. The change happened less than two months ago and, to my knowledge, was not announced on LW or other visible places unlike other important changes to the RSP. It’s also not in the changelog on their website; in the description of the relevant update, they say they added a new commitment but don’t mention removing this one.
Anthropic’s behavior...
I directionally agree!
Btw, since this is a call to participate in a PauseAI protest on my shortform, do your colleagues have plans to do anything about my ban from the PauseAI Discord server—like allowing me to contest it (as I was told there was a discussion of making a procedure for) or at least explaining it?
Because it’s lowkey insane!
For everyone else, who might not know: a year ago I, in context, on the PauseAI Discord server, explained my criticism of PauseAI’s dishonesty and, after being asked to, shared proofs that Holly publicly lied about our personal communications, including sharing screenshots of our messages; a large part of the thread was then deleted by the mods because they were against personal messages getting shared, without warning (I would’ve complied if asked by anyone representing a server to delete something!) or saving/allowing me to save any of the removed messages in the thread, including those clearly not related to the screenshots that you decided were violating the server norms; after a discussion of that, the issue seemed settled and I was asked to maybe run some workshops for PauseAI to improve PauseAI’s comms/proofreading/factchecking; and then, mo...
I wrote the article Mikhail referenced and wanted to clarify some things.
The thresholds are specified, but the original commitment says, "We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e. before continuing training beyond when ASL-3 evaluations are triggered). Similarly, we commit to define ASL-5 evaluations before training ASL-4 models, and so forth," and, regarding ASL-4, "Capabilities and warning sign evaluations defined before training ASL-3 models."
The latest RSP says this of CBRN-4 Required Safeguards, "We expect this threshold will require the ASL-4 Deployment and Security Standards. We plan to add more information about what those entail in a future update."
Additionally, AI R&D 4 (confusingly) corresponds to ASL-3 and AI R&D 5 corresponds to ASL-4. This is what the latest RSP says about AI R&D 5 Required Safeguards, "At minimum, the ASL-4 Security Standard (which would protect against model-weight theft by state-level adversaries) is required, although we expect a higher security standard may be required. As with AI R&D-4, we also expect an affirmative case will be required."
AFAICT, now that ASL-3 has been implemented, the upcoming AI R&D threshold, AI R&D-4, would not mandate any further security or deployment protections. It only requires ASL-3. However, it would require an affirmative safety case concerning misalignment.
I assume this is what you meant by "further protections" but I just wanted to point this fact out for others, because I do think one might read this comment and expect AI R&D 4 to require ASL-4. It doesn't.
I am quite worried about misuse when we hit AI R&D 4 (perhaps even moreso than I'm worried about misalignment) — and if I understand the policy correctly, there are no further protections against misuse mandated at this point.
Regardless, it seems like Anthropic is walking back its previous promise: "We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models." The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.
Worked on this with Demski. Video, report.
Any update to the market is (equivalent to) updating on some kind of information. So all you can do is dynamically choose what to do or do not update on.* Unfortunately, whenever you choose not to update on something, you are giving up on the asymptotic learning guarantees of policy market setups. So the strategic gains from updatelesness (like not falling into traps) are in a fundamental sense irreconcilable with the learning gains from updatefulness. That doesn't prevent that you can be pretty smart about deciding what to update on exactly... but due to embededness problems and the complexity of the world, it seems to be the norm (rather than the exception) that you cannot be sure a priori of what to update on (you just have to make some arbitrary choices).
*For avoidance of doubt, what matters for whether you have updated on X is not "whether you have heard about X", but rather "whether you let X factor into your decisions". Or at least, this is the case for a sophisticated enough external observer (assessing whether you've updated on X), not necessarily all observers.
I think the first question to think about is how to use them to make CDT decisions. You can create a market about a causal effect if you have control over the decision and you can randomise it to break any correlations with the rest of the world, assuming the fact that you’re going to randomise it doesn’t otherwise affect the outcome (or bettors don’t think it will).
Committing to doing that does render the market useless for choosing policy, but you could randomly decide whether to randomise or to make the decision via whatever the process you actually want to use, and have the market be conditional on the former. You probably don’t want to be randomising your policy decisions too often, but if liquidity wasn’t an issue you could set the probability of randomisation arbitrarily low.
Then FDT… I dunno, seems hard.
The IMO organizers asked AI labs not to share their IMO results until a week later to not steal the spotlight from the kids. IMO organizers consider OpenAI's actions "rude and inappropriate".
https://x.com/Mihonarium/status/1946880931723194389
People are arguing about the answer to the Sleeping Beauty! I thought this was pretty much dissolved with this post's title! But there are lengthy posts and even a prediction market!
Sleeping Beauty is an edge case where different reward structures are intuitively possible, and so people imagine different game payout structures behind the definition of “probability”. Once the payout structure is fixed, the confusion is gone. With a fixed payout structure&preference framework rewarding the number you output as “probability”, people don’t have a disagreem...
I do not believe Anthropic as a company has a coherent and defensible view on policy. It is known that they said words they didn't hold while hiring people (and they claim to have good internal reasons for changing their minds, but people did work for them because of impressions that Anthropic made but decided not to hold). It is known among policy circles that Anthropic's lobbyists are similar to OpenAI's.
From Jack Clark, a billionaire co-founder of Anthropic and its chief of policy, today:
Dario is talking about countries of geniuses in datacenters in the...
[RETRACTED after Scott Aaronson’s reply by email]
I'm surprised by Scott Aaronson's approach to alignment. He has mentioned in a talk that a research field needs to have at least one of two: experiments or a rigorous mathematical theory, and so he's focusing on the experiments that are possible to do with the current AI systems.
The alignment problem is centered around optimization producing powerful consequentialist agents appearing when you're searching in spaces with capable agents. The dynamics at the level of superhuman general agents are not something ...