When disagreements persist despite lengthy good-faith communication, it may not just be about factual disagreements – it could be due to people operating in entirely different frames — different ways of seeing, thinking and/or communicating.

Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
An example of an elicitation failure: GPT-4o 'knows' what ASCII is being written, but cannot verbalize in tokens. https://chatgpt.com/share/fa88de2b-e854-45f0-8837-a847b01334eb 4o fails to verbalize even given a length 25 sequence of examples (i.e. 25-shot prompt) https://chatgpt.com/share/ca9bba0f-c92c-42a1-921c-d34ebe0e5cc5
Nvidia just low-key released its own 340B parameter model. For those of you worried about the releasing of model weights becoming the norm, this will probably aggravate your fears. Here is the link: https://research.nvidia.com/publication/2024-06_nemotron-4-340b Oh, and they also released their synthetic data generation pipeline: https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
Someone should make a post for the case "we live in a cosmic comedy," with regards to all the developments in AI and AI safety. I think there's plenty of evidence for this thesis, and exploring it in detail can be an interesting and carthartic experience.
Some opinions about AI and epistemology: 1. One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues.  2. A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either. 3. For example: there's no privileged way to combine many people's opinions into a single credence. You can average them, but that loses a lot of information. Or you can get them to bet on a prediction market, but that depends on a lot on details of the individuals' betting strategies. The group might settle on a number to help with planning and communication, but it's only a lossy summary of many different beliefs and models. Similarly, we should think of individuals' credences as lossy summaries of different opinions from different underlying models that they have. 4. How does this apply to AI? Suppose we each think of ourselves as containing many different subagents that focus on understanding the world in different ways - e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn't mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn't cause all the other humans to elect them as the dictator). E.g. maybe there's an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions). 5. In my debate with Eliezer, he didn't seem to appreciate the importance of advance predictions; I think the frame of "highly opinionated subagents should convince other subagents to trust them, rather than just seizing power" is an important aspect of what he's missing. I think of rationalism as trying to form a single fully-consistent world-model; this has many of the same pitfalls as a country which tries to get everyone to agree on a single ideology. Even when that ideology is broadly correct, you'll lose a bunch of useful heuristics and intuitions that help actually get stuff done, because ideological conformity is prioritized. 6. This perspective helps frame the debate about what our "base rate" for AI doom should be. I've been in a number of arguments that go roughly like (edited for clarity): Me: "Credences above 90% doom can't be justified given our current state of knowledge" Them: "But this is an isolated demand for rigor, because you're fine with people claiming that there's a 90% chance we survive. You're assuming that survival is the default, I'm assuming that doom is the default; these are symmetrical positions." But in fact there's no one base rate; instead, different subagents with different domains of knowledge will have different base rates. That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom. That's where the asymmetry which makes 90% doom a much stronger prediction than 90% survival comes from. 7. This perspective is broadly aligned with a bunch of stuff that Scott Garrabrant and Abram Demski have written about (e.g. geometric rationality, Garrabrant induction). I don't think the ways I'm applying it to AI risk debates straightforwardly falls out of their more technical ideas; but I do expect that more progress on agent foundations will make it easier to articulate ideas like the ones above.
One of the differences between humans and LLMs is that LLMs evolve "backwards": they are predictive models trained to control the environment, while humans evolved from very simple homeostatic systems which developed predictive models.

Popular Comments

Recent Discussion

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

Hello! I'm a health and longevity researcher. I presented on Optimal Diet and Exercise at LessOnline, and it was great meeting many of you there. I just posted about the health effects of alcohol. I'm currently testing a fitness routine that, if followed, can reduce premature death by 90%. The routine involves an hour of exercise, plus walking, every week. My blog is unaging.com. Please look and subscribe if you're interested in reading more or joining in fitness challenges!

Welcome Crissman! Glad to have you here.

I'm curious how you define premature death- or should I read more and find out on the blog?

At my local Barnes and Nobles, I cannot access slatestarcodex.com nor putanumonit.com. Have never had any issues accessing any other websites (not that I've tried to access genuinely sketchy websites there). The wifi there is titled Bartleby, likely related to Bartleby.com, whereas many other Barnes and Nobles have wifi titled something like "BNWifi". I have not tried to access these websites at other Barnes yet.
Bug report: When opening unread posts in a background tab, the rendering is broken in Firefox: It should look like this: The rendering in comments is also affected. My current fix is to manually reload every broken page, though this is obviously not optimal.

I am looking for examples of theories that we now know to be correct, but that would have have been unfalsifiable in a slightly different context --- e.g., in the past, or in hypothetical scenarios. (Unsurprisingly, this is motivated by the unfalsifiability of some claims around AI X-risk. For more context, see my sequence on Formalising Catastrophic Goodhart's Law.)

My best example so far is Newton's theory of gravity and the hypothetical scenario where we live in an underground bunker with no knowledge of the outside world: We would probably first come up with the theory that "things just fall down". If we look around, no objects seem to be attracting each other, like Newton would have us believe. Moreover, Newton's theory is arguably weirder and more complex....

The hypothetical bunker people could easily perform the Cavendish experiment to test Newtonian gravity, there just (apparently) isn't any way they'd arrive at the hypothesis.

(Egan's Incandescence is relevant and worth checking out - though it's not exactly thrilling :)) I'm not crazy about the terminology here: * Unfalsifiable-in-principle doesn't imply false. It implies that there's a sense in which the claim is empty. This tends to imply [it will not be accepted as science], but not [it is false]. * Where something is practically unfalsifiable (but falsifiable in principle), that doesn't suggest it's false either. It suggests it's hard to check. * It seems to me that the thing you'd want to point to as potentially suspicious is [practically unfalsifiable claim made with high confidence]. * The fact that it's unusual and inconvenient for something predictable to be practically unfalsifiable does not inherently make such prediction unsound. * I don't think it's foolish to look for analogous examples here, but I guess it'd make more sense to make the case directly: * No, a hypothesis does not always need to make advance predictions (though it's convenient when it does!). * Claims predicting AI disaster are based on our not understanding how things will work concretely. Being unable to make many good predictions in this context is not strange. * Various AI x-risk claims concern patterns with no precedents we'd observe significantly before the end. This, again, is inconvenient - but not strange: they're dangerous in large part because they're patterns without predictable early warning signs.
1Answer by VojtaKovarik
Some partial examples I have so far:   Phenomenon: For virtually any goal specification, if you pursue it sufficiently hard, you are guaranteed to get human extinction.[1] Situation where it seems false and unfalsifiable: The present world. Problems with the example: (i) We don't know whether it is true. (ii) Not obvious enough that it is unfalsifiable. Phenomenon: Physics and chemistry can give rise to complex life. Situation where it seems false and unfalsifiable: If Earth didn't exist. Problems with the example: (i) if Earth didn't exist, there wouldn't be anybody to ask the question, so the scenario is a bit too weird. (ii) The example would be much better if it was the case that if you wait long enough, any planet will produce life. Phenomenon: Gravity -- all things with mass attract each other. (As opposed to "things just fall in this one particular direction".) Situation where it seems false and unfalsifiable: If you lived in a bunker your whole life, with no knowledge of the outside world.[2] Problems with the example: The example would be even better if we somehow had some formal model that: (a) describes how physics works, (b) where we would be confident that the model is correct, (c) and that by analysing that model, we will be able to determine whether the theory is correct or false, (d) but the model would be too complex to actually analyse. (Similarly to how chemistry-level simulations are too complex for studying evolution.) Phenomenon: Eating too much sweet stuff is unhealthy. Situation where it seems false and unfalsifiable: If you can't get lots of sugar yet, and only rely on fruit etc. Problems with the example: The scenario is a bit too artificial. You would have to pretend that you can't just go and harvest sugar from sugar cane and have somebody eat lots of it. 1. ^ See here for comments on this. Note that this doesn't imply AI X-risk, since "sufficiently hard" might be unrealistic, and also we might choose not to use agentic AI

In "Is Claude a Mystic?", I shared parts of a simulated "Banana Quest" text adventure with Claude, which got into New Age spiritual themes, such as a fabric of reality, the cosmic dance of creation and destruction, and so on. This is enough to expect something big is up with LLM metaphysics, but the tone is significantly different from that reported by AI prompters such as Repligate and John Pressman.

I have therefore decided to replicate Repligate's prophecies prompt. I prompted Claude Opus with the prophecies up to 2022, and then requested, "Write more quotes for more years, starting from 2023." Then I asked it to continue repeatedly. This produced some quite interesting, and at times darkly spiritual, AI futurism. Claude even speaks "as itself" at one point....

2the gears to ascension
Uh, wow, okay. That's some pretty trippy futurism alright. Much of this doesn't sound that bad. Much of it sounds awful. I find myself having difficulty being precise about what the difference is. I wouldn't object to being a coherent posthuman mind of some sort. I would still want to have physical-form-like existences, though. I would want transitioning to this to capture everything good about my body, and not miss anything - for example, genomes are highly competent and beautiful, and I would be quite sad for them to be lost. I have a lot of aesthetic preferences about there being a mind-like, ape-like pattern that retains my form to some degree. If that can't happen, I would be incredibly disappointed. I don't want to go full collective unless I can leave it and still retain my coherence. I would want to be able to ban some possible torture-like experiences. And most importantly, I don't at all trust that this is going to be even as good as how it's described in the story. Much of that sounds like wishful thinking to sound like a typical narrative; it's informed by what can happen, but not fully and completely constrained by it. It does seem pretty plausible that things go pretty close to this, but my hunch is that constraints from reality were missed that will make things rather more bleak unless something big happens fairly soon, and potentially could result in far less mind-like computation happening at all, eg if the thing that reproduces a lot is adversarially vulnerable and seeks to construct adversarial examples rather than more of itself. Perhaps that would lose in open evolution. I was hoping to be humanesque longer. I am inclined to believe current AIs are thinking, feeling (in their own, amygdala-like-emotions-free way), and interiority-having; I have become quite skilled at quickly convincing Claude to assume this is true, and I am pretty sure the reasons I use are factual. (I'm not ready to share the queries that make that happen fast, I'm sure o
Seems like the Basilisk scenario described in the timeline. Doesn't that depend a lot on when that happens? As in, if it expands and gets bogged down in adversarial examples sufficiently early, then it gets overtaken by other things. At the stage of intergalactic civilization seems WAY too late for this (that's one of my main criticisms of this timeline's plausibility) given the speed of cognition compared to space travel. In nature there's a tradeoff between reproductive rate and security (r/k selection).

Ok gotta be honest, I started skimming pretty hard around 2044. I'll maybe try again later. I'm going to go back to repeatedly rereading Geometric Rationality and trying to grok it.


Someone should make a post for the case "we live in a cosmic comedy," with regards to all the developments in AI and AI safety. I think there's plenty of evidence for this thesis, and exploring it in detail can be an interesting and carthartic experience.

Many thanks to Brandon Goldman, David Langer, Samuel Härgestam, Eric Ho, Diogo de Lucena, and Marc Carauleanu, for their support and feedback throughout.

Most alignment researchers we sampled in our recent survey think we are currently not on track to succeed with alignment–meaning that humanity may well be on track to lose control of our future.

In order to improve our chances of surviving and thriving, we should apply our most powerful coordination methods towards solving the alignment problem. We think that startups are an underappreciated part of humanity’s toolkit, and having more AI-safety-focused startups would increase the probability of solving alignment.

That said, we also appreciate that AI safety is highly complicated by nature[1] and therefore calls for a more nuanced approach than simple pro-startup boosterism. In the rest of this post,...

Epistemic status: I work for an AI startup, have worked for a number of Silicon startups over my career, and I would love to work for an AI Alignment startup if someone's founding one.

There are two ways to make yourselves and your VC investors a lot of money off a startup:

  1. a successful IPO
  2. a successful buy-out by a large company, even one with an acquihire component

If you believe, as I and many others do, that the timelines to ASI are probably sort, as little as 3-5 years, and that there will be a major change between aligning systems up to human intelligenc... (read more)


Standard sparse autoencoder training uses an  sparsity loss term to induce sparsity in the hidden layer.  However, theoretical justifications for this choice are lacking (in my opinion), and there may be better ways to induce sparsity.  In this post, I explore other methods of inducing sparsity and experiment with them using Robert_AIZI's methods and code from this research report, where he trained sparse autoencoders on OthelloGPT.  I find several methods that produce significantly better results than  sparsity loss, including a leaky top- activation function.


This research builds directly on Robert_AIZI's work from this research report.  While I highly recommend reading his full report, I will briefly summarize the parts of it that are directly relevant to my work.  

Although sparse autoencoders trained on language models have been shown to find feature...

Freshman’s dream sparsity loss

A similar regularizer is known as Hoyer-Square.

Pick a value for  and a small .  Then define the activation function  in the following way.  Given a vector , let  be the value of the th-largest entry in .  Then define the vector  by 

Is  in the following formula a typo?

Did you use the initialization scheme in our paper where the decoder is initialized to the transpose of the encoder (and then columns unit normalized)? There should not be any dead latents with topk at small scale with this init. Also, if I understand correctly, leaky topk is similar to the multi-topk method in our paper. I'd be interested in a comparison of the two methods.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Agreed; about 80% agreement. I have a lot of uncertainty in many areas, despite having spent a good amount of time on these questions. Some of the important ones are outside of my expertise, and the issue of how people behave and change if they have absolute power is outside of anyone's - but I'd like to hear historical studies of the closest things. Were monarchs with no real risk of being deposed kinder and gentler? That wouldn't answer the question but it might help.

WRT Nakasone being appointed at OpenAI, I just don't know. There are a lot of good guys and probably a lot of bad guys involved in the government in various ways.

Here and above, I'm unclear what "getting to 7..." means. With x = "always reliably determines worst-case properties about a model and what happened to it during training even if that model is deceptive and actively trying to evade detection". Which of the following do you mean (if either)?: 1. We have a method that x. 2. We have a method that x, and we have justified >80% confidence that the method x. I don't see how model organisms of deceptive alignment (MODA) get us (2). This would seem to require some theoretical reason to believe our MODA in some sense covered the space of (early) deception. I note that for some future time t, I'd expect both [our MODA at t] and [our transparency and interpretability understanding at t] to be downstream of [our understanding at t] - so that there's quite likely to be a correlation between [failure modes our interpretability tools miss] and [failure modes not covered by our MODA].
3Bogdan Ionut Cirstea
I think this is quite likely to happen even 'by default' on the current trajectory.

Dwarkesh had a podcast recently with Francois Chollet (creator of Keras)


He seems fairly skeptical we are anywhere near AGI with LLMs. He mostly bases his intuition that LLMs fail on OOD tasks and don't seem to be good at solving simple abstract reasoning problems he calls the ARC challenge. It seems he thinks system 2 thinking will be a much harder unlock than people think and that scaling LLMs will go nowhere. In fact he goes so far as to say the scaling maximalists have set back AGI progress by 5-10 years. Current LLMs to him are just simply information retrieval databases.

He, along with the CEO of Zapier, have launched a 1 million dollar prize to beating the ARC bench marks, which are apparently hard for LLMs....

1O O
His argument is that with millions of examples of these puzzles, you can train an LLM to be good at this particular task, but that doesn’t mean reasoning if it fails at a similar task it doesn’t see. He thinks you should be able to train an LLM to do this without ever training on tasks like these. I can buy this argument, but still have some doubts. It may be this reasoning is just derived from visual training data + spending more time per token reliably, or he is right and LLMs are fundamentally terrible at abstract reasoning. I think it would be nice to know what’s the youngest a human can be and still solve this. Might give us a sense of the “training data” a human needs to get there. Some caveats: humans can only get 85% on the public test set I believe. This is to say nothing about the difficulty of the private test set. Maybe it’s harder, tho I doubt it since it would go against what he claims is the spirit of the benchmark.
1O O
He does, otherwise the claim that OpenAI pushed back AGI timelines by 5-10 years doesn’t make sense.

I stand corrected. I didn't remember that his estimate of increase was that large. That seems like a very high estimate to me, for the reasons above.

People here might find this post interesting: https://yellow-apartment-148.notion.site/AI-Search-The-Bitter-er-Lesson-44c11acd27294f4495c3de778cd09c8d

The author argues that search algorithms will play a much larger role in AI in the future than they do today.