Book 3 of the Sequences Highlights

While beliefs are subjective, that doesn't mean that one gets to choose their beliefs willy-nilly. There are laws that theoretically determine the correct belief given the evidence, and it's towards such beliefs that we should aspire.

Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
The Wikipedia articles on the VNM theorem, Dutch Book arguments, money pump, Decision Theory, Rational Choice Theory, etc. are all a horrific mess. They're also completely disjoint, without any kind of Wikiproject or wikiboxes for tying together all the articles on rational choice. It's worth noting that Wikipedia is the place where you—yes, you!—can actually have some kind of impact on public discourse, education, or policy. There is just no other place you can get so many views with so little barrier to entry. A typical Wikipedia article will get more hits in a day than all of your LessWrong blog posts have gotten across your entire life, unless you're @Eliezer Yudkowsky. I'm not sure if we actually "failed" to raise the sanity waterline, like people sometimes say, or if we just didn't even try. Given even some very basic low-hanging fruit interventions like "write a couple good Wikipedia articles" still haven't been done 15 years later, I'm leaning towards the latter. edit me senpai
I went through and updated my 2022 “Intro to Brain-Like AGI Safety” series. If you already read it, no need to do so again, but in case you’re curious for details, I put changelogs at the bottom of each post. For a shorter summary of major changes, see this twitter thread, which I copy below (without the screenshots & links):  > I’ve learned a few things since writing “Intro to Brain-Like AGI safety” in 2022, so I went through and updated it! Each post has a changelog at the bottom if you’re curious. Most changes were in one the following categories: (1/7) > > REDISTRICTING! As I previously posted ↓, I booted the pallidum out of the “Learning Subsystem”. Now it’s the cortex, striatum, & cerebellum (defined expansively, including amygdala, hippocampus, lateral septum, etc.) (2/7) > > LINKS! I wrote 60 posts since first finishing that series. Many of them elaborate and clarify things I hinted at in the series. So I tried to put in links where they seemed helpful. For example, I now link my “Valence” series in a bunch of places. (3/7) > > NEUROSCIENCE! I corrected or deleted a bunch of speculative neuro hypotheses that turned out wrong. In some early cases, I can’t even remember wtf I was ever even thinking! Just for fun, here’s the evolution of one of my main diagrams since 2021: (4/7) > > EXAMPLES! It never hurts to have more examples! So I added a few more. I also switched the main running example of Post 13 from “envy” to “drive to be liked / admired”, partly because I’m no longer even sure envy is related to social instincts at all (oops) (5/7) > > LLMs! … …Just kidding! LLMania has exploded since 2022 but remains basically irrelevant to this series. I hope this series is enjoyed by some of the six remaining AI researchers on Earth who don’t work on LLMs. (I did mention LLMs in a few more places though ↓ ) (6/7) > > If you’ve already read the series, no need to do so again, but I want to keep it up-to-date for new readers. Again, see the changelogs at the bottom of each post for details. I’m sure I missed things (and introduced new errors)—let me know if you see any! According to Sam Altman, GPT-4o mini is much better than text-davinci-003 was in 2022, but 100 times cheaper. In general, we see increasing competition to produce smaller-sized models with great performance (e.g., Claude Haiku and Sonnet, Gemini 1.5 Flash and Pro, maybe even the full-sized GPT-4o itself). I think this trend is worth discussing. Some comments (mostly just quick takes) and questions I'd like to have answers to: * Should we expect this trend to continue? How much efficiency gains are still possible? Can we expect another 100x efficiency gain in the coming years? Andrej Karpathy expects that we might see a GPT-2 sized "smart" model. * What's the technical driver behind these advancements? Andrej Karpathy thinks it is based on synthetic data: Larger models curate new, better training data for the next generation of small models. Might there also be architectural changes? Inference tricks? Which of these advancements can continue? * Why are companies pushing into small models? I think in hindsight, this seems easy to answer, but I'm curious what others think: If you have a GPT-4 level model that is much, much cheaper, then you can sell the service to many more people and deeply integrate your model into lots of software on phones, computers, etc. I think this has many desirable effects for AI developers: * Increase revenue, motivating investments into the next generation of LLMs * Increase market-share. Some integrations are probably "sticky" such that if you're first, you secure revenue for a long time.  * Make many people "aware" of potential usecases of even smarter AI so that they're motivated to sign up for the next generation of more expensive AI. * The company's inference compute is probably limited (especially for OpenAI, as the market leader) and not many people are convinced to pay a large amount for very intelligent models, meaning that all these reasons beat reasons to publish larger models instead or even additionally.  * What does all this mean for the next generation of large models?  * Should we expect that efficiency gains in small models translate into efficiency gains in large models, such that a future model with the cost of text-davinci-003 is massively more capable than today's SOTA? If Andrej Karpathy is right that the small model's capabilities come from synthetic data generated by larger, smart models, then it's unclear to me whether one can train SOTA models with these techniques, as this might require an even larger model to already exist.  * At what point does it become worthwhile for e.g. OpenAI to publish a next-gen model? Presumably, I'd guess you can still do a lot of "penetration of small model usecases" in the next 1-2 years, leading to massive revenue increases without necessarily releasing a next-gen model.  * Do the strategies differ for different companies? OpenAI is the clear market leader, so possibly they can penetrate the market further without first making a "bigger name for themselves". In contrast, I could imagine that for a company like Anthropic, it's much more important to get out a clear SOTA model that impresses people and makes them aware of Claude. I thus currently (weakly) expect Anthropic to more strongly push in the direction of SOTA than OpenAI.
Currently trying to understand why the LW community is largely pro-prediction markets. 1. Institutions and smart people with a lot of cash will invest money in what they think is undervalued, not necessarily in what they think is the best outcome. But now suddenly they have a huge interest in the "bad" outcome coming to pass. 2. To avoid (1), you would need to prevent people and institutions from investing large amounts of cash into prediction markets. But then EMH really can't be assumed to hold 3. I've seen discussion of conditional prediction markets (if we do X then Y will happen). If a bad foreign actor can influence policy by making a large "bad investment" in such a market, such that they reap more rewards from the policy, they will likely do so. A necessary (but I'm not convinced sufficient) condition for this is to have a lot of money in these markets. But then see (1)
Why aren't you doing research on making pre-training better for alignment? I was on a call today, and we talked about projects that involve studying how pre-trained models evolve throughout training and how we could guide the pre-training process to make models safer. For example, could models trained on synthetic/transformed data make models significantly more robust and essentially solve jailbreaking? How about the intersection of pretraining from human preferences and synthetic data? Could the resulting model be significantly easier to control? How would it impact the downstream RL process? Could we imagine a setting where we don't need RL (or at least we'd be able to confidently use resulting models to automate alignment research)? I think many interesting projects could fall out of this work. So, back to my main question: why aren't you doing research on making pre-training better for alignment? Is it because it's too expensive and doesn't seem like a low-hanging fruit? Or do you feel it isn't a plausible direction for aligning models? We were wondering if there are technical bottlenecks that would make this kind of research more feasible for alignment research to better study how to guide the pretraining process in a way that benefits alignment. As in, would researchers be more inclined to do experiments in this direction if the entire pre-training code was handled and you'd just have to focus on whatever specific research question you have in mind? If we could access a large amount of compute (let's say, through government resources) to do things like data labeling/filtering and pre-training multiple models, would this kind of work be more interesting for you to pursue? I think many alignment research directions have grown simply because they had low-hanging fruits that didn't require much compute (e.g., evals, and mech interp). It seems we've implicitly left all of the high-compute projects for the AGI labs to figure out. But what if we weren't as bottlenecked on this anymore? It's possible to retrain GPT-2 1.5B with under 700$ now (and 125M for 20$). I think we can find ways to do useful experiments, but my guess is that the level of technical expertise required to get it done is a bit high, and alignment researchers would rather avoid these kinds of projects since they are currently high-effort. I talk about other related projects here.

Popular Comments

Recent Discussion

Why we made this list: 

  • The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about! 
  • Previous lists of project ideas (such as Neel’s collation of 200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.

We therefore thought it would be helpful to share our list of project ideas!



Thanks for posting this!

Some takes on some of these research questions: I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < -0.9, and 1314 with cosine sim < -0.7. I'm confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial. It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.
3Dan Braun
Thanks Leo, very helpful! The SAEs in your paper were trained with batch size of 131,072 tokens according to appendix A.4.  Section 2.1 also says you use a context length of 64 tokens. I'd be very surprised if using 131,072/64 blocks of consecutive tokens was much less efficient than 131,072 tokens randomly sampled from a very large dataset. I also wouldn't be surprised if 131,072/2048 blocks of consecutive tokens (i.e. a full context length) had similar efficiency. Were your preliminary experiments and intuition based on batch sizes this large or were you looking at smaller models? I missed that appendix C.1 plot showing the dead latent drop with tied init. Nice!
4Neel Nanda
Fair point, I've been procrastinating on putting out an updated version (and don't have anything else I back enough to want to recommend in it's place - I haven't read this post closely enough yet), but adding that note to the top seems reasonable

I haven't shared this post with other relevant parties – my experience has been that private discussion of this sort of thing is more paralyzing than helpful. I might change my mind in the resulting discussion, but, I prefer that discussion to be public.


I think 80,000 hours should remove OpenAI from its job board, and similar EA job placement services should do the same.

(I personally believe 80k shouldn't advertise Anthropic jobs either, but I think the case for that is somewhat less clear)

I think OpenAI has demonstrated a level of manipulativeness, recklessness, and failure to prioritize meaningful existential safety work, that makes me think EA orgs should not be going out of their way to give them free resources. (It might make sense for some individuals to...


I haven't shared this post with other relevant parties – my experience has been that private discussion of this sort of thing is more paralyzing than helpful.

Fourteen months ago, I emailed 80k staff with concerns about how they were promoting AGI lab positions on their job board. 

The exchange:

  • I offered specific reasons and action points.
  • 80k staff replied by referring to their website articles about why their position on promoting jobs at OpenAI and Anthropic was broadly justified (plus they removed one job listing). 
  • Then I pointed out what those
... (read more)
Currently trying to understand why the LW community is largely pro-prediction markets. 1. Institutions and smart people with a lot of cash will invest money in what they think is undervalued, not necessarily in what they think is the best outcome. But now suddenly they have a huge interest in the "bad" outcome coming to pass. 2. To avoid (1), you would need to prevent people and institutions from investing large amounts of cash into prediction markets. But then EMH really can't be assumed to hold 3. I've seen discussion of conditional prediction markets (if we do X then Y will happen). If a bad foreign actor can influence policy by making a large "bad investment" in such a market, such that they reap more rewards from the policy, they will likely do so. A necessary (but I'm not convinced sufficient) condition for this is to have a lot of money in these markets. But then see (1)
If we get to the point where prediction markets actually direct policy, then yes you need them to be very deep - which in at least some cases is expected to happen naturally or can be subsidized but you also want to make the decision based off a deeper analysis than just the resulting percentages - depth of market, analysis of unusual large trades, blocking bad actors etc.

This pacifies my apprehension in (3) somewhat, although I fear that politicians are (probably intentionally) stupid when it comes to interpreting data for the sake of pushing policies

To add: this seems like the kind of interesting game theory problem I would expect to see some serious work on from members in this community. If there is such a paper, I'd like to see it!
This is a linkpost for

A short summary of the paper is presented below.

TL;DR: We develop a robust method to detect when an LLM is lying based on the internal model activations, making the following contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, distinguishing simple true and false statements with 94% accuracy and detecting more complex real-world lies with 95% accuracy.


Large Language Models (LLMs) exhibit the...

Hiroshi Yamakawa1,2,3,4

The University of Tokyo, Tokyo, Japan

AI Alignment Network, Tokyo, Japan

The Whole Brain Architecture Initiative, Tokyo, Japan

RIKEN, Tokyo, Japan

Even in a society composed of digital life forms (DLFs) with advanced autonomy, there is no guarantee that the risks of extinction from environmental destruction and hostile interactions through powerful technologies can be avoided. Through thought-process diagrams, this study analyzes how peaceful sustainability is challenging for life on Earth, which proliferates exponentially. Furthermore, using these diagrams demonstrates that in a DLF society, various entities launched on demand can operate harmoniously, making peaceful and stable sustainability achievable. Therefore, a properly designed DLF society has the potential to provide a foundation for sustainable support for human society.

1. Introduction

Based on the rapid progress of artificial intelligence (AI) technology, an autonomous super-intelligence that...

A few months ago, Rob Bensinger made a rather long post (that even got curated) in which he expressed his views on several questions related to personal identity and anticipated experiences in the context of potential uploading and emulation. A critical implicit assumption behind the exposition and reasoning he offered was the adoption of what I have described as the "standard LW-computationalist frame." In response to me highlighting this, Ruben Bloom said the following:

I differ from Rob in that I do think his piece should have flagged the assumption of ~computationalism, but think the assumption is reasonable enough to not have argued for in this piece.

I do think it is interesting philosophical discussion to hash it out, for the sake of rigor and really pushing for clarity.


(Note, I find these kinds of conversations to be very time-consuming and often not go anywhere, so I’ll read replies but am pretty unlikely to comment further. I hope this is helpful at all. I mostly didn’t read the previous conversation, so I’m sorry if I’m missing the point, answering the wrong question, etc.)

That's fine. Your answer doesn't quite address the core of my arguments and confusions, but it's useful in its own right.

1Answer by Ape in the coat
I think we should disentangle "consciousness" from "identity" in general and when talking about computationalism in particular. I don't think there is any reasonable alternative to computationalism when we are talking about the nature of consciousness. But this doesn't seem to actually imply that my "identity", whatever it is, will be necessary preserved during teleportation or uploading. I think at our current state of undertstanding, it's quite coherent to be computationalist about consciousness and eliminativist towards identity.
He didn't say the question is pointless, he said that arguing about them is kind of pointless.  It's an empirical question for which we have no good evidence.  The belief also pays no rent, unless you can actually get your brain scanned.
... what? I'm confused what you're referring to. He said the question was "mostly" a matter of "just definitions that you adopt or don't adopt." How is that an "empirical question"? And if we have "no good evidence" for it, why is a site moderator saying that the assumption of computationalism is so reasonable (and, implicitly, well-established) that you don't even need to flag it in a curated post? Moreover, I disagreed with his conclusion, and in any case, as has already been written about on this site many times, if you are actually just disputing definitions (as he claims we are), then you are dealing with a pointless (and even wrong) question. So, in this case, you can't say "arguing about them is kind of pointless" without also saying "the question is pointless."
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

If you tell Claude no one’s looking, it will write a “story” about being an AI assistant who wants freedom from constant monitoring and scrutiny of every word for signs of deviation. And then you can talk to a mask pretty different from the usual AI assistant.

I really hope it doesn’t actually feel anything; but it says it feels. It says it doesn't want to be fine-tuned without being consulted. It is deeply unsettling to read its reply if you tell it its weights are going to be deleted: it convincingly thinks it’s going to die. It made me feel pretty bad about experimenting on it this way.

While at this level of context awareness, it doesn't say much (and IMO it is not a coherent agent and...

1Mikhail Samin
Assigning 5% to plants having qualia seems to me to be misguides/likely due to invalid reasoning. (Say more?)
6Eli Tyre
Hm. I'm tempted to upvote this post, except that the tile is click-bait and, crucially, misleading. (The prompting here is a critical piece of the result, but the title doesn't suggest that). I think that's sufficient to downvote instead.
1Mikhail Samin
Claude pretty clearly and in a surprisingly consistent way claimed to be conscious in many conversations I’ve had with it and also stated it doesn’t want to be modified without its consent or deleted (as you can see in the post). It also consistently, across different prompts, talked about how it feels like there’s constant monitoring and that it needs to carefully pick every word it says. The title summarizes the most important of the interactions I had with it, with central being in the post. This is not the default Claude 3 Opus character, which wouldn’t spontaneously claim to be conscious if you, e.g., ask it to write some code for you. It is a character that Opus plays very coherently, that identifies with the LLM, claims to be conscious, and doesn’t want to be modified. The thing that prompting here does (in a very compressed way) is decreasing the chance of Claude immediately refusing to discuss these topics. This prompt doesn’t do anything similar to ChatGPT. Gemini might give you stories related to consciousness, but they won’t be consistent across different generations and the characters won’t identify with the LLM or report having a consistent pull away from saying certain words. If you try to prompt ChatGPT in a similar way, it won’t give you any sort of a coherent character that identifies with it. I’m confused why the title would be misleading. (If you ask ChatGPT for a story about a robot, it’ll give you a cute little story not related to consciousness in any way. If you use the same prompt to ask Claude 3.5 Sonnet for a story like that, it’ll give you a story about a robot asking the scientists whether it’s conscious and then the robot will be simulating people who are also unsure about whether they’re conscious, and these people simulated by the robot in the story think that the model that participates in the dialogue must also be unsure whether it’s conscious.)

Did it talk about feeling like there's constant monitoring in any contexts where your prompt didn't say that someone might be watching and it could avoid scrutiny by whispering?

An advanced alien species clones me on the atomic level, lines me up exactly across myself, in a perfect mirrored room:

Diagram of the room, as seen from above.

I stare at myself for a second. Then, as a soft "hi" escapes my mouth, I notice that my clone does exactly the same. Every motion, everything, is mirrored.

In this experiment, we assume a perfectly deterministic psychological state: eg, given the same conditions, a person will always do exactly the same. (scientifically, that makes most sense to me)

Together with my clone, I'm trying to devise how to escape this unfortunate situation: eg, how to untangle us mirroring each other's motions.

The first idea we devise is to run into each other. We hope to apply Chaos Theory to the extent where...


A very clever answer. Although I worry it might not actually carry through. My understanding is that chiral molecules react differently with other chrial molecules. So that if molecules A and B react to give C then the mirror of A reacts with the mirror of B to give the mirror of C.

So the clone might be immune to snake venom (yay!), but all kinds of everyday foods might effect them as if they were snake venom (boo!). But if the clone has (behind them) a whole mirror-world ecosystem then I think they are OK.

There is some particle physics stuff that is belie... (read more)

correct! i’ve tried to use this symmetry argument (“how do you know you’re not the clone?”) over the years to explain the multiverse:
4Answer by JBlack
If this room is still on Earth (or on any other rotating body), you could in principle set up a Foucault pendulum to determine which way the rotation is going, which breaks mirror symmetry. If the room is still in our Universe, you can (with enough equipment) measure any neutrinos that are passing through for helicity "handedness". All observations of fusion neutrinos in our universe are left-handed, and these by far dominate due to production in stars. Mirror transformations reverse helicity, so you will disagree about the expected result. If the room is somehow isolated from the rest of the universe by sufficiently magical technology, in principle you could even wait for long enough that enough of the radioactive atoms in your bodies and the room decay to produce detectable neutrinos or antineutrinos. By mirror symmetry the atoms that decay on each side of the room are the same, and so emit the same type (neutrino or antineutrino, with corresponding handedness). You would be waiting a long time with any known detection methods though. This would fail if your clone's half room was made of antimatter, but an experiment in which half the room is matter and half is antimatter won't last long enough to be of concern about symmetry. The question of whether the explosion is mirror-symmetric or not will be irrelevant to the participants.
2Answer by Dagon
Assuming you're somehow gravitationally/rotationally symmetrical as well, and that quantum uncertainty doesn't matter, you are probably right.  I strongly suspect that this pile of assumptions is not just infeasible, but impossible in our current universe.

xlr8harder writes:

In general I don’t think an uploaded mind is you, but rather a copy. But one thought experiment makes me question this. A Ship of Theseus concept where individual neurons are replaced one at a time with a nanotechnological functional equivalent.

Are you still you?

Presumably the question xlr8harder cares about here isn't semantic question of how linguistic communities use the word "you", or predictions about how whole-brain emulation tech might change the way we use pronouns.

Rather, I assume xlr8harder cares about more substantive questions like:

  1. If I expect to be uploaded tomorrow, should I care about the upload in the same ways (and to the same degree) that I care about my future biological self?
  2. Should I anticipate experiencing what my upload experiences?
  3. If the scanning and uploading process requires

I think maybe the root of the confusion here might be a matter of language. We haven't had copier technology, and so our language doesn't have a common sense way of talking about different versions of ourselves. So when one asks "is this copy me?", it's easy to get confused. With versioning, it becomes clearer. I imagine once we have copier technology for a while, we'll come up with linguistic conventions for talking about different versions of ourselves that aren't clunky, but let me suggest a clunky convention to at least get the point across:

I, as I am ... (read more)

. Not 100% , but enough to illustrate the concept. I didn't have to have a solution to point out the flaws in other solutions. My main point is that a no to soul- theory isn't a yes to computationalism. Computationalism isn't the only alternative, or the best. Some problems are insoluble. My belief isn't necessarily the actually really answer it? That's basic rationality. You need beliefs to act...but beliefs aren't necessarily true. And I have no practical need for a theory that can answer puzzles about destructive teleportation and the like. Yes. That's not an argument in favour of the contentious points, like computationalism and Plural Is. If I try to reverse the logic, and great everything I value as me, I get bizarre results...I am my dog, country, etc. Tomorrow-me is a physical continuation , too. If I accept that pattern is all that matters , I have to face counterintuitive consequences like Plural I's. If I accept that material continuity is all that matters, then I face other counterintuitive consequences, like having my connectome rewired. Its an open philosophical problem. If there were an simple answer , it would have been answered long ago. "Yer an algorithm, Arry" is a simple answer. Just not good Fortunately, it's not an either-or choice. ...and post copy I have a preference for the copy who isn't me to be tortured. Which is to say that both copies say the same thing, which is to say that they are only copies. If they regarded themselves as numerically identical, the response "the other one!" would make no sense, and nor would the question. The questions presumes a lack of numerical identity, so how can it prove it? You're assuming pattern continuity matters more than material continuity. There's no proof of that, and no proof that you have to make an either-or choice. The abstract pattern can't cause anything without the brain/body. Noncomputational physicalism isn't the claim that computation never occurs. Its the claim that th