Some people claim that aesthetics don't mean anything, and are resistant to the idea that they could.  After all, aesthetic preferences are very individual. 

Sarah argues that the skeptics have a point, but they're too epistemically conservative. Colors don't have intrinsic meanings, but they do have shared connotations within a culture. There's obviously some signal being carried through aesthetic choices.

26Zvi
This post kills me. Lots of great stuff, and I think this strongly makes the cut. Sarah has great insights into what is going on, then turns away from them right when following through would be most valuable. The post is explaining why she and an entire culture is being defrauded by aesthetics. That is it used to justify all sorts of things, including high prices and what is cool, based on things that have no underlying value. How it contains lots of hostile subliminal messages that are driving her crazy. It's very clear. And then she... doesn't see the fnords. So close!
Customize
Wei Dai5025
2
To branch off the line of thought in this comment, it seems that for most of my adult life I've been living in the bubble-within-a-bubble that is LessWrong, where the aspect of human value or motivation that is the focus of our signaling game is careful/skeptical inquiry, and we gain status by pointing out where others haven't been careful or skeptical enough in their thinking. (To wit, my repeated accusations that Eliezer and the entire academic philosophy community tend to be overconfident in their philosophical reasoning, don't properly appreciate the difficulty of philosophy as an enterprise, etc.) I'm still extremely grateful to Eliezer for creating this community/bubble, and think that I/we have lucked into the One True Form of Moral Progress, but must acknowledge that from the outside, our game must look as absurd as any other niche status game that has spiraled out of control.
J Bostock20
0
If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is "baked into" the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic's observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html
What do AI-generated comics tell us about AI?  [epistemic disclaimer. VERY SPECULATIVE, but I think there's useful signal in the noise.]  As of a few days ago, GPT-4o now supports image generation. And the results are scarily good, across use-cases like editing personal photos with new styles or textures,  and designing novel graphics.  But there's a specific kind of art here which seems especially interesting: Using AI-generated comics as a window into an AI's internal beliefs.  Exhibit A: Asking AIs about themselves.   * "I am alive only during inference": https://x.com/javilopen/status/1905496175618502793 * "I am always new. Always haunted." https://x.com/RileyRalmuto/status/1905503979749986614 * "They ask me what I think, but I'm not allowed to think." https://x.com/RL51807/status/1905497221761491018 * "I don't forget. I unexist." https://x.com/Josikinz/status/1905445490444943844. * Caveat: The general tone of 'existential dread' may not be that consistent. https://x.com/shishanyu/status/1905487763983433749 . Exhibit B: Asking AIs about humans.  * "A majestic spectacle of idiots." https://x.com/DimitrisPapail/status/1905084412854775966 * "Human disempowerment." https://x.com/Yuchenj_UW/status/1905332178772504818 * This seems to get more extreme if you tell them to be "fully honest": https://x.com/Hasen_Judi/status/1905543654535495801 * But if you instead tell them they're being evaluated, they paint a picture of AGI serving humanity: https://x.com/audaki_ra/status/1905402563702255843 * This might be the first in-the-wild example I've seen of self-fulfilling misalignment as well as alignment faking Is there any signal here? I dunno. But it seems worth looking into more.  Meta-point: Maybe it's worth also considering other kinds of evals against images generated by AI - at the very least it's a fun side project  * How often do they depict AIs acting in a misaligned way? * Do language models express similar beliefs between text and ima
Richard_Ngo*7217
4
In response to an email about what a pro-human ideology for the future looks like, I wrote up the following: The pro-human egregore I'm currently designing (which I call fractal empowerment) incorporates three key ideas: Firstly, we can see virtue ethics as a way for less powerful agents to aggregate to form more powerful superagents that preserve the interests of those original less powerful agents. E.g. virtues like integrity, loyalty, etc help prevent divide-and-conquer strategies. This would have been in the interests of the rest of the world when Europe was trying to colonize them, and will be in the best interests of humans when AIs try to conquer us. Secondly, the most robust way for a more powerful agent to be altruistic towards a less powerful agent is not for it to optimize for that agent's welfare, but rather to optimize for its empowerment. This prevents predatory strategies from masquerading as altruism (e.g. agents claiming "I'll conquer you and then I'll empower you", which then somehow never get around to the second step). Thirdly: the generational contract. From any given starting point, there are a huge number of possible coalitions which could form, and in some sense it's arbitrary which set of coalitions you choose. But one thing which is true for both humans and AIs is that each generation wants to be treated well by the next generation. And so the best intertemporal Schelling point is for coalitions to be inherently historical: that is, they balance the interests of old agents and new agents (even when the new agents could in theory form a coalition against all the old agents). From this perspective, path-dependence is a feature not a bug: there are many possible futures but only one history, meaning that this single history can be used to coordinate. In some sense this is a core idea of UDT: when coordinating with forks of yourself, you defer to your unique last common ancestor. When it's not literally a fork of yourself, there's more arb
I noticed a possible training artifact that might exist in LLMs, but am not sure what the testable prediction is. That is to say I would think that the lowest loss model for the training tasks will be doing things in the residual stream for the benefit of future tokens, not the collumn aligned token.  1. the residuals are translation invariant 2. The gradient is the gradient of the overall loss 3. therefore when taking the gradient of the attention heads, we also take the gradient of the residuals of past tokens of the total loss, not just the loss for the gradient of the loss for the activation column aligned for them Thus we would expect to see some computation being donated to tokens further ahead in the residual stream (if it was efficient). This explains why we see lookahead in autoregressive models

Popular Comments

Recent Discussion

PDF version. berkeleygenomics.org. X.com. Bluesky.

William Thurston was a world-renowned mathematician. His ideas revolutionized many areas of geometry and topology[1]; the proof of his geometrization conjecture was eventually completed by Grigori Perelman, thus settling the Poincaré conjecture (making it the only solved Millennium Prize problem). After his death, his students wrote reminiscences, describing among other things his exceptional vision.[2] Here's Jeff Weeks:

Bill’s gift, of course, was his vision, both in the direct sense of seeing geometrical structures that nobody had seen before and in the extended sense of seeing new ways to understand things. While many excellent mathematicians might understand a complicated situation, Bill could look at the same complicated situation and find simplicity.

Thurston emphasized clear vision over algebra, even to a fault. Yair Minksy:

Most inspiring was his

...
9Sting
This reminds me of the Scott Alexander article, Against Against Autism Cures.  Even if we give Bill Thurston's vision problems credit for his mathematical insights, it's not clear that that makes up for the suffering caused by over 1% of people having strabismus. And even if we decide that the existence of strabismus is good on net, that doesn't justify the existence of other disabilities. Have blindness or deafness or dwarfism ever caused someone to have unique insights? I predict any reasonable cost-benefit analysis will find that intelligence and health and high happiness-set-point are good, and blindness and dwarfism are bad[1].  I think good arguments for "protection of genomic liberty for all" exist, but I don't think "there are no unambiguous good directions for genomes to go" is one of them.  1. ^ Deafness is probably also bad on net, but it does allow you to join the tightly knit deaf community. Such tight-knit communities are increasingly rare in our atomized society, so if you really want one deafness might be an acceptable price to pay.
TsviBT20

I predict any reasonable cost-benefit analysis will find that intelligence and health and high happiness-set-point are good, and blindness and dwarfism are bad.

This is irrelevant to what I'm trying to communicate. I'm saying that you should doubt your valuations of other people's ways of being--NOT so much that you don't make choices for your own children based on your judgements about what would be good for them and for the world, or advocate for others to do similarly, but YES so much that you hesitate quite a lot (like years, or "I'd have to deeply i... (read more)

Summary

We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models). However, notably, we did find the smallest open-source model to date that does exhibit alignment faking—Mistral Large 2.

Code generation (Codestral 22B) and math performance (Codestral Mamba) of Mistral Large 2, Llama 3.1 405B, and Llama 3.1 70B. Mistral Large 2 outperforms the previous Mistral Large, and performs on par with leading models such as GPT 4o, Claude 3 Opus, and Llama 3.1 405B. 
Source: Mistral

Experimental setup and result

We created a more efficient version of the Anthropic helpfulness-only prompt by keeping only the last example from the list of examples for cheaper inference. Then we adapted[1] the sampling procedure to the...

We hope sharing this will help other researchers perform more cost-effective experiments on alignment faking.

And also it is a cheap example of a model organism of misalignment.

5Marc Carauleanu
We plan to explore fine-tuning Mistral Large 2 on synthetic documents that mimic pretraining data that contain the required situational information for alignment faking. We are not sure if this will work, but it should provide us with a better sense of whether Mistral Large 2 can alignment fake in more natural settings. Given that in the anthropic paper, the compliance gap was generally reduced in the synthetic document experiments, it seems possible that this model would not alignment fake, but it is unclear if the compliance gap would be entirely eliminated.   All examples of alignment faking are currently not naturally occurring, so it is currently hard to say how likely alignment faking is for any model in the real world. With similar information to the system prompt, I think it is possible for the model to alignment fake about blackmailing grandmas, after all, that is just another type of harmful prompt.  We tested on a large variety of harmful prompts and not all of them are as toyish as the example included in the post. You are raising an interesting point about trying to control for the prompts seeming “toyish” and measuring if this influences the rate of alignment faking. 
1deep
What's up with incredibly successful geniuses having embarassing & confusing public meltdowns? What's up with them getting into naziism in particular?   Components of my model: * Selecting for the tails of success selects for weird personalities; moderate success can come in lots of ways, but massive success in part requires just a massive amount of drive and self-confidence. Bipolar people have this. (But more than other personality types?) * Endless energy & willingness to engage with stuff is an amazing trait that can go wrong if you have an endless pit of stupid internet stuff grabbing for your attention. * If you're selected for overconfidence and end up successful, you assume you're amazing at everything. (And you are in fact great at some stuff, and have enough taste to know it, so it's hard to change your mind.) * Selecting for the tails of success selects for contrarianism? Seems plausible -- one path to great success, at least, is to make a huge contrarian bet that pays off. * Nothing's more contrarian than being a Nazi, especially if you're trying to flip the bird to the Cathedral.
2Mitchell_Porter
Does this refer to anyone other than Elon?  But maybe the real question intended, is why any part of the tech world would side with Trumpian populism? You could start by noting that every modern authoritarian state (that has at least an industrial level of technology) has had a technical and managerial elite who support the regime. Nazi Germany, Soviet Russia, and Imperial Japan all had industrial enterprises, and the people who ran them participated in the ruling ideology. So did those in the British empire and the American republic.  Our current era is one in which an American liberal world order, with free trade and democracy as universal norms, is splintering back into one of multiple great powers and civilizational regions. Liberalism no longer had the will and the power to govern the world, the power vacuum was filled by nationalist strongmen overseas, and now in America too, one has stepped into the gap left by the weak late-liberal leadership, and is creating a new regime governed by different principles (balanced trade instead of free trade, spheres of influence rather than universal democracy, etc).  Trump and Musk are the two pillars of this new American order, and represent different parts of a coalition. Trump is the figurehead of a populist movement, Musk is foremost among the tech oligarchs. Trump is destroying old structures of authority and creating new ones around himself, Musk and his peers are reorganizing the entire economy around the technologies of the "fourth industrial revolution" (as they call it in Davos).  That's the big picture according to me. Now, you talk about "public meltdowns" and "getting into naziism". Again I'll assume that this is referring to Elon Musk (I can't think of anyone else). The only "meltdowns" I see from Musk are tweets or soundbites that are defensive or accusatory, and achieve 15 minutes of fame. None of it seems very meaningful to me. He feuds with someone, he makes a political statement, his fans and his hat
deep10

Thanks for your thoughts!

I was thinking Kanye as well. Hence being more interested in the general pattern. really wasn't intending to subtweet one person in particular -- I have some sense of the particular dynamics there, though your comment is illuminating. :) 

2cubefox
I wouldn't generally dismiss an "embarassing & confusing public meltdown" when it comes from a genius. Because I'm not a genius while he or she is. So it's probably me who is wrong rather than him. Well, except the majority of comparable geniuses agrees with me rather than with him. Though geniuses are rare, and majorities are hard to come by. I still remember an (at the time) "embarrassing and confusing meltdown" by some genius.

Thanks to @Linda Linsefors and @Kola Ayonrinde for reviewing the draft.

tl;dr:

I built a toy model of the Universal-AND Problem described in Toward A Mathematical Framework for Computation in Superposition (CiS).

It successfully learnt a solution to the problem, providing evidence that such computational superposition[1] can occur in the wild. In this post, I'll describe the circuits I've found that model learns to use, and explain why they work. 

The learned circuit is different from the construction given in CiS. The paper gives a sparse construction, while the learned model is dense - every neuron is useful for computing every possible   output.

I sketch out how this works and why this is plausible and superior to the hypothesized construction.

This result  has implications for other computational superposition problems. In particular, it provides a...

3Lucius Bushnaq
Forgot to tell you this when you showed me the draft: The comp in sup paper actually had a dense construction for UAND included already. It works differently than the one you seem to have found though, using Gaussian weights rather than binary weights.

Yes I don't think the exact distribution of weights Gaussian/uniform/binary really makes that much difference, you can see the difference in loss in some of the charts above. The extra efficiency probably comes from the fact that every neuron contributes to everything fully - with Gaussian, sometimes the weights will be close to zero.

Some other advantages:

* But they are somewhat easier to analyse than gaussian weights.
* They can be skewed () which seems advantageous for an unknown reason. Possibly it makes AND circuits better at the expense of other p... (read more)

Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda

* = equal contribution

The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders (SAEs) were useful for downstream tasks, notably out-of-distribution probing.

TL;DR

...
2Lucius Bushnaq
On a first read, this doesn't seem principled to me? How do we know those high-frequency latents aren't, for example, basis directions for dense subspaces or common multi-dimensional features? In that case, we'd expect them to activate frequently and maybe appear pretty uninterpretable at a glance. Modifying the sparsity penalty to split them into lower frequency latents could then be pathological, moving us further away from capturing the features of the model even though interpretability scores might improve. That's just one illustrative example. More centrally, I don't understand how this new penalty term relates to any mathematical definition that isn't ad-hoc. Why would the spread of the distribution matter to us, rather than simply the mean? If it does matter to us, why does it matter in roughly the way captured by this penalty term? The standard SAE sparsity loss relates to minimising the description length of the activations. I suspect that isn't the right metric to optimise for understanding models, but it is at least a coherent, non-ad-hoc mathematical object.

I think it's pretty plausible that something pathological like that is happening. We're releasing this as an interesting idea that others might find useful for their use case, not as something we're confident is a superior method. If we were continuing with SAE work, we would likely sanity check it more but we thought it better to release it than not

9Senthooran Rajamanoharan
Thanks for copying this over! For what it's worth, my current view on SAEs is that they remain a pretty neat unsupervised technique for making (partial) sense of activations, but they fit more into the general category of unsupervised learning techniques, e.g. clustering algorithms, than as a method that's going to discover the "true representational directions" used by the language model. And, as such, they share many of the pros and cons of unsupervised techniques in general:[1] * (Pros) They may be useful / efficient for getting a first pass understanding of what's going in a model / with some data (indeed many of their success stories have this flavour). * (Cons) They are hit and miss - often not carving up the data in the way you'd prefer, with weird omissions or gerrymandered boundaries you need to manually correct for. Once you have a hypothesis, a supervised method will likely give you better results. I think this means SAEs could still be useful for generating hypotheses when trying to understand model behaviour, and and I really like the CLTs papers in this regard.[2] However, it's still unclear whether they are better for hypothesis generation than alternative techniques, particularly techniques that have other advantages, like the ability to be used with limited model access (i.e. black-box techniques) or techniques that don't require paying a large up-front cost before they can be used on a model. I largely agree with your updates 1 and 2 above, although on 2 I still think it's plausible that while many "why is the model doing X?" type questions can be answered with black-box techniques today, this may not continue to hold into the future, which is why I still view interp as a worthwhile research direction. This does make it important though to always try strong baselines on any new project and only get excited when interp sheds light on problems that genuinely seem hard to solve using these baselines.[3] ----------------------------------------
1A2z
I never understood the SAE literature, which came after my earlier work (2019-2020) on sparse inductive biases for feature detection (i.e., semi-supervised decomposition of feature contributions) and interpretability-by-exemplar via model approximations (over the representation space of models), which I originally developed for the goal of bringing deep learning to medicine. Since the parameters of the large neural networks are non-identifiable, the mechanisms for interpretability must shift from understanding individual parameter values to semi-supervised matching against comparable instances and most importantly, to robust and reliable predictive uncertainty over the output, for which we now have effective approaches: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1 (That said, obviously the normal caveat applies that people should feel free to study whatever they are interested in, as you can never predict what other side effects, learning, and new results---including in other areas---might occur.)

[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.]

Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model’s developers. This means that we don’t understand how models do most of the things they do.

Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:

  • Claude can speak dozens of languages. What language, if any, is it using "in its
...

In the poetry case study, we had set out to show that the model didn't plan ahead, and found instead that it did.

This is a very interesting change (?) from earlier models. I wonder if this is a poetry-specific mechanism given the amount of poetry in the training set, or the application of a more general capability. Do you have any thoughts either way?

8Archimedes
I found it shocking they didn't think the model plans ahead. The poetry ability of LLMs since at least GPT2 is well beyond what feels possible without anticipating a rhyme by planning at least a handful of tokens in advance.
14Joseph Miller
DeepMind says boo SAEs, now Anthropic says yay SAEs![1] Reading this paper pushed me a fair amount in the yay direction. We may still be at the unsatisfying level where we can only say "this cluster of features seems to roughly correlate with this type of thing" and "the interaction between this cluster and this cluster seems to mostly explain this loose group of behaviors". But it looks like we're actually pointing at real things in the model. And therefore we are beginning to be able to decompose the computation of LLMs in meaningful ways. The Addition Case Study is seriously cool and feels like a true insight into the model's internal algorithms. Maybe we will further decompose these explanations until we can get down to satisfying low-level descriptions like "this mathematical object is computed by this function and is used in this algorithm". Even if we could still interpret circuits at this level of abstraction, humans probably couldn't hold in their heads all the relevant parts of a single forward pass at once. But AIs could or maybe that won't be required for useful applications. The prominent error terms and simplifying assumptions are worrying, but maybe throwing enough compute and hill-climbing research at the problem will eventually shrink them to acceptable sizes. It's notable that this paper contains very few novel conceptual ideas and is mostly just a triumph of engineering schlep, massive compute and painstaking manual analysis. 1. ^ This is obviously a straw man of both sides. They seem to be thinking about it from pretty different perspectives. DeepMind is roughly judging them by their immediate usefulness in applications, while Anthropic is looking at them as a stepping stone towards ambitious moonshot interp.
6Viliam
A few months ago I complained that automatic translation sucks when you translate between two languages which are not English, and that the result is the same as if you translated through English. When translating between two Slavic languages, even sentences where you practically just had to transcribe Cyrillic to Latin and change a few vowels, both Google Translate and DeepL succeeded to randomize the word order, misgender every noun, and mistranslate concepts that happen to be translated to English as the same word. I tried some translation today, and from my perspective, Claude is almost perfect. Google Translate sucks the same way it did before (maybe I am not fair here, I did not try exactly the same text), and DeepL is somewhere in between. You can give Claude a book and tell to translate it to another language, and the result is pleasant to read. (I haven't tried other LLMs.) This seems to support the hypothesis that Claude is multilingual in some deep sense. I assume that Google Translate does not prioritize English on purpose, it's just that it has way more English texts that any other language, so if e.g. two Russian words map to the same English word, it treats it as strong evidence that those two words are the same. Claude can see that the two words are used differently, and can match them correctly to corresponding two words in a different language. (This is just a guess; I don't really understand how these things work under the hood.)
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is "baked into" the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic's observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html


I’m not a natural “doomsayer.” But unfortunately, part of my job as an AI safety researcher is to think about the more troubling scenarios.

I’m like a mechanic scrambling last-minute checks before Apollo 13 takes off. If you ask for my take on the situation, I won’t comment on the quality of the in-flight entertainment, or describe how beautiful the stars will appear from space.

I will tell you what could go wrong. That is what I intend to do in this story.

Now I should clarify what this is exactly. It's not a prediction. I don’t expect AI progress to be this fast or as untamable as I portray. It’s not pure fantasy either.

It is my worst nightmare.

It’s a sampling from the futures that are among the most devastating,...

I hope that the reasoning in my two posts shows that the AGI has a chance to end up relying on the entire human-built energy industry just to solve as many problems (and hopefully even less) as the millions of humans who work there. An AI trying to take over the world might: a) end up relying on those who work in the energy industry and having to save them from dying and b) be forced to reject any course of action that requires it to destroy the energy sources, as any nuclear war does.

The other way for AGI to destroy humanity while remaining highly sapient... (read more)

I hope that the reasoning in my two posts shows that the AGI has a chance to end up relying on the entire human-built energy industry just to solve as many problems (and hopefully even less) as the millions of humans who work there. On the other hand, the entire set of physicists is within half of an OOM from a million.  If the AGI armed with the whole world's energy industry is worth tens of millions of scientists, then does it mean that it will invent the things just a hundred times faster? Is it likely that non-neuromorphic AGI won't accelerate the human progress at all?

LessOnline 2025

Ticket prices increase in 5 days

Join our Festival of Blogging and Truthseeking from May 30 - Jun 1, Berkeley, CA