How can we recognize when we are failing to change our thinking in light of new evidence that doesn’t fit our expectations and assumptions? And how can we update our thought processes to overcome the challenges that our old ways of seeing pose?

Recent Discussion

[Note: This post is an excerpt from a longer paper, written during the first half of the Philosophy Fellowship at the Center for AI Safety. I (William D'Alessandro) am a Postdoctoral Fellow at the Munich Center for Mathematical Philosophy. Along with the other Philosophy Fellowship midterm projects, this draft is posted here for feedback.

The full version of the paper includes a discussion of the conceptual relationship between safety and moral alignment, and an argument that we should choose a reliably safe powerful AGI over one that's (apparently) successfully morally aligned. I've omitted this material for length but can share it on request.

The deontology literature is big, and lots of angles here could be developed further. Questions and suggestions much appreciated!]

1 Introduction[1]

Value misalignment arguments for AI risk observe...

I'm not an ethical philosopher, but my intuition, based primarily on personal experience, is that deontological ethics are a collection of heuristic rules of thumb extracted from the average answers of utilitarian ethics applied to a common range of situations that often crop up between humans. (I also view this as a slightly-idealized description of the legal system.) As such, they're useful primarily in the same ways that heuristics often are useful compared to actually calculating a complex function, by reducing computational load. For people, they also... (read more)

2Gordon Seidoh Worley8h
I don't see it in the references so you might find this paper of mine [] (link is to Less Wrong summary, which links to full thing) interesting because within it I include an argument suggesting building AI that assumes deontology is strictly more risky than building one that does not.

When I introduced Rational Animations, I wrote:

What I won't do is aggressively advertise LessWrong and the EA Forum. If the channel succeeds, I will organize fundraisers for EA charities. If I adapt an article for YT, I will link it in the description or just credit the author. If I use quotes from an author on LW or the EA Forum, I will probably credit them on-screen. But I will never say: "Come on LW! Plenty of cool people there!" especially if the channel becomes big. Otherwise, "plenty of cool people" becomes Reddit pretty fast.

If the channel becomes big, I will also refrain from posting direct links to LW and the EA Forum. Remind me if I ever forget. And let me know if these rules are not conservative enough.

In my...

Some rough thoughts: I think it's kind of inevitable that LessWrong will eventually get a huge number of people attempting to join. I do think we’ll need to deal with that somehow or other sooner or later. I don't think we're ready yet. I think it’s possible for us to become ready if we prioritize it. We’ve been thinking about this a lot lately. In addition to the rejected section we also recently shipped AutoRateLimits for low and negative karma users. I'll have a post about this soon, but the basic gist is that users start with a rate limit of 3 comments per day and 2 posts per week. That rate limit disappears once they hit 5 karma. If their karma becomes negative, this limit is made more strict. (At -1 karma, it becomes 1 post per week and 1 comment per day. At -15 karma they can only write 1 comment every 3 days. At -30 karma they can only submit 1 post every 2 weeks) I don’t think this is enough to handle a really large influx. Even if they’re heavily rate limited, a bunch of people posting a mediocre comment every 3 days would add up to a significant drop in the site signal noise ratio. You could increase the rate limit but that does make things harder for new users and would make the site feel more punishing. Karma is only a rough measure of quality and there's a lot of room for disagreement over whether a given downvote is fair. In the pre-GPT world I'd be more optimistic about making some kind of test that checks for whether a user has a reasonable understanding of what LessWrong is about. and is able to participate. In the post GPT world it's less clear how to do that sort of thing – any kind of automated test is basically a test for “do they know how to use an LLM?”. There are options like “Let established users approve new users.” I'd set my bar fairly high for established users to avoid a situation where each generation of users lets in a somewhat weaker set of users. The kind of bar I’d feel safe with for users-with-approval-power is something lik

You could have AutoAutoRateLimits. That is, you have some target such as "total number of posts/comments per time" or "total number of posts/comments per time from users with <15 karma" or something. From that, you automatically adjust the rate limits to keep the target below a global target level. Maybe you add complications, such as a floor to never let the rate limit go to None, and maybe you have inertia. (There's plausibly bad effects to this though, IDK.)

10the gears to ascension10h
I think lesswrong would need a beginner/application section to deal with something like this. I'd also suggest doing it at the end of a long and very detailed single video with intentionally lower production value and much higher math content, more like a manim video than a mainline youtube video. That will produce more natural filtering; in order to get into the technical side of youtube you need to speedwatch long, highly technical videos, and skim the beginnings of many long technical videos when the recommender gives them to you, while also being disciplined about not watching nontechnical stuff. people who are using that pattern of videowatching are the ones who should almost certainly get encouraged to come visit lesswrong.
Upvoted for this idea.

People talk about Kelly betting and expectation maximization as though they're alternate strategies for the same problem. Actually, they're each the best option to pick for different classes of problems. Understanding when to use Kelly betting and when to use expectation maximization is critical.

Most of the ideas for this came from Ole Peters ergodicity economics writings. Any mistakes are my own.

The parable of the casino

Alice and Bob visit a casino together. They each have $100, and they decide it'll be fun to split up, play the first game they each find, and then see who has the most money. They'll then keep doing this until their time in the casino is up in a couple days.

Alice heads left and finds a game that looks good. It's double...

Can you be more precise about the exact situation Bob is in? How many rounds will he get to play? Is he trying to maximise money, or trying to beat Alice? I doubt the Kelly criterion will actually be his optimal strategy.

This supposes that Bob 1 knows about Bob 2's experiences. That seems impossible if Bob 1 died before Bob 2 came into being, which is what's typically understood by the term "resurrect" used in the context of death ("restore (a dead person) to life."). If Bob 1 and Bob 2 exist at the same time, whatever's happening is probably not resurrection. Let's stick with standard resurrection though: Bob 1 dies and then Bob 2 comes into existence. We're measuring their sameness, at your request, by the expected sentiment of each toward the other. If I was unethical researcher in the present day, I could name a child Bob 2 and raise it to be absolutely certain that it was the reincarnation of Bob 1. It would be nice if the child happened to share some genes with Bob 1, but not absolutely essential. The child would not have an easy life, as it would be accused of various mental disorders and probably identity theft, but it would technically meet the "sameness is individual belief" criterion that you require. As an unethical researcher, I would of course select the individual Bob 1 to be someone who believes that reincarnation is possible, and thus cares about the wellbeing of their expected reincarnated self (whom they probably define as 'the person who believes they're my reincarnation', because most people don't think adversarially about such things) as much as they care about their own. There you go, a hypothetical pair of individuals who meet your criteria, created using no technology more advanced than good ol' cult brainwashing. So for this definition, I'd say the percentage chance that it's possible matches the percentage chance that someone would be willing to set their qualms aside and ruin Bob 2's life prospects for the sake of the experiment. (yes, this is an unsatisfying answer, but I hope it might illustrate something useful if you see how its nature follows directly from the nature of your question)

Your response does illustrate that there are holes in my explanation. Bob 1 and Bob 2 do not exist at the same time. They are meant to represent one person at two different points in time.

A separate way I could try to explain what kind of resurrection I am talking about is to imagine a married couple. An omniscient husband would have to care as much about his wife after she was resurrected as he did before she died.

I somewhat doubt that I could patch all of the holes that could be found in my explanation. I would appreciate it if you try to answer what I am trying to ask.

A new Anthropic interpretability paper—“Toy Models of Superpostion”—came out last week that I think is quite exciting and hasn't been discussed here yet.

Twitter thread from Anthropic

Twitter thread from Chris


It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This


I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.

However, I think there's a likely to be another 'phase' that they don't discuss (possibly it didn't crop up in their small models, since it's only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vecto... (read more)


(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)

I have several times failed to write up a well-organized list of reasons why AGI will kill you.  People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first.  Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.

Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants.  I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally...

This is another reply in this vein, I'm quite new to this so don't feel obliged to read through. I just told myself I will publish this.

I agree (90-99% agreement) with almost all of the points Eliezer made. And the rest is where I probably didn't understand enough or where there's no need for a comment, e.g.:

1. - 8.  agree

9. Not sure if I understand it right - if the AGI has been successfully designed not to kill everyone then why need oversight? If it is capable to do so and the design fails then on the other hand what would our oversight do? I don't... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with

LessWrong is experimenting with the addition of reacts to the site, as per the recent experimental Open Thread. We are now progressing to the next stage of the experiment: trying out reacts in actual discussion threads.

The dev/moderator team will be proactively looking for posts to enable to react voting on (with author permission), but also any user can enable it themselves to help us experiment:

  • When creating or editing a post, expand the "Options" section at the bottom and change the Voting system to Names-attached reactions

The admins will also be on the lookout for good posts to enable reacts on (with author permission).

Iterating on the react palette

We're continuing to think about what reacts should be available. Thanks to everyone who's weighed in so far.

I just spent time today...

Alice: Hey Bob, how's it going?

Bob: Good. How about you?

Alice: Good. So... I stalked you and read your blog.

Bob: Haha, oh yeah? What'd you think?


Bob: Oh boy. This is what I was afraid of. This is why I keep my internet life separate from my personal one.

Alice: I'm sorry. I feel bad about stalking you. It's just...

Bob: No, it's ok. I actually would like to talk about this.

Alice: Bob, I'm worried about you. I don't mean this in an offensive way, but this MoreRight community you're a part of, it seems like a cult.

Bob: Again, it's ok. I won't be offended. You don't have to hold back.


To be clear, I do give a lot of weight to Yudkowsky in the sense that I think his arguments make sense and I mostly believe them. Similarly, I don't give much weight to Yann LeCun on this topic. But that's because I can read what Yudkowsky has said and what LeCun has said and think about whether it made sense. If I didn't speak English, so that their words appeared as meaningless noise to me, then I'd be much more uncertain about who to trust, and would probably defer to an average of the opinions of the top ML names, eg. Sutskever, Goodfellow, Hinton, LeCun, Karpathy, Benigo, etc. The thing about closely studying a specific aspect of AI (namely alignment) would probably get Yudkowsky and Christiano's names onto that list, but it wouldn't necessarily give Yudkowsky more weight than everyone else combined. (I'm guessing, for hypothetical non-English-speaking me, who somehow has translations for what everyone's bottom line position is on the topic, but not what their arguments are. Basically the intuition here is that difficult technical achievements like Alexnet, GANs, etc. are some of the easiest things to verify from the outside. It's hard to tell which philosopher is right, but easy to tell which scientist can build a thing for you that will automatically generate amusing new animal pictures.)

If I didn't speak English, so that their words appeared as meaningless noise to me, then I'd be much more uncertain about who to trust, and would probably defer to an average of the opinions of the top ML names, eg. Sutskever, Goodfellow, Hinton, LeCun, Karpathy, Benigo, etc.

Do you think it'd make sense to give more weight to people in the field of AI safety than to people in the field of AI more broadly?

I would, and I think it's something that generally makes sense. Ie. I don't know much about food science but on a question involving dairy, I'd trust food... (read more)

Note: this has been in my draft queue since well before the FLI letter and TIME article. It was sequenced behind the interpretability post, and delayed by travel, and I'm just plowing ahead and posting it without any edits to acknowledge ongoing discussion aside from this note.

This is an occasional reminder that I think pushing the frontier of AI capabilities in the current paradigm is highly anti-social, and contributes significantly in expectation to the destruction of everything I know and love. To all doing that (directly and purposefully for its own sake, rather than as a mournful negative externality to alignment research): I request you stop.

(There's plenty of other similarly fun things you can do instead! Like trying to figure out how the heck modern AI systems...

"I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it. "  This has to be taken as a sign that AI alignment research is funding constrained.  At a minimum, technical alignment organizations should engage in massive labor hording to prevent the talent from going into capacity research.

This feels game-theoretically pretty bad to me, and not only abstractly, but I expect concretely that setting up this incentive will cause a bunch of people to attempt to go into capabilities (based on conversations I've had in the space). 

My answer is "work on applications of existing AI, not the frontier". Advancing the frontier is the dangerous part, not using the state-of-the-art to make products. But also, don't do frontend or infra for a company that's advancing capabilities.
7Matthew Barnett11h
What are your thoughts on the argument [] that advancing capabilities could help make us safer?