Some people claim that aesthetics don't mean anything, and are resistant to the idea that they could. After all, aesthetic preferences are very individual.
Sarah argues that the skeptics have a point, but they're too epistemically conservative. Colors don't have intrinsic meanings, but they do have shared connotations within a culture. There's obviously some signal being carried through aesthetic choices.
PDF version. berkeleygenomics.org. X.com. Bluesky.
William Thurston was a world-renowned mathematician. His ideas revolutionized many areas of geometry and topology[1]; the proof of his geometrization conjecture was eventually completed by Grigori Perelman, thus settling the Poincaré conjecture (making it the only solved Millennium Prize problem). After his death, his students wrote reminiscences, describing among other things his exceptional vision.[2] Here's Jeff Weeks:
Bill’s gift, of course, was his vision, both in the direct sense of seeing geometrical structures that nobody had seen before and in the extended sense of seeing new ways to understand things. While many excellent mathematicians might understand a complicated situation, Bill could look at the same complicated situation and find simplicity.
Thurston emphasized clear vision over algebra, even to a fault. Yair Minksy:
...Most inspiring was his
I predict any reasonable cost-benefit analysis will find that intelligence and health and high happiness-set-point are good, and blindness and dwarfism are bad.
This is irrelevant to what I'm trying to communicate. I'm saying that you should doubt your valuations of other people's ways of being--NOT so much that you don't make choices for your own children based on your judgements about what would be good for them and for the world, or advocate for others to do similarly, but YES so much that you hesitate quite a lot (like years, or "I'd have to deeply i...
Summary
We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models). However, notably, we did find the smallest open-source model to date that does exhibit alignment faking—Mistral Large 2.
Experimental setup and result
We created a more efficient version of the Anthropic helpfulness-only prompt by keeping only the last example from the list of examples for cheaper inference. Then we adapted[1] the sampling procedure to the...
We hope sharing this will help other researchers perform more cost-effective experiments on alignment faking.
And also it is a cheap example of a model organism of misalignment.
Thanks for your thoughts!
I was thinking Kanye as well. Hence being more interested in the general pattern. really wasn't intending to subtweet one person in particular -- I have some sense of the particular dynamics there, though your comment is illuminating. :)
Thanks to @Linda Linsefors and @Kola Ayonrinde for reviewing the draft.
I built a toy model of the Universal-AND Problem described in Toward A Mathematical Framework for Computation in Superposition (CiS).
It successfully learnt a solution to the problem, providing evidence that such computational superposition[1] can occur in the wild. In this post, I'll describe the circuits I've found that model learns to use, and explain why they work.
The learned circuit is different from the construction given in CiS. The paper gives a sparse construction, while the learned model is dense - every neuron is useful for computing every possible output.
I sketch out how this works and why this is plausible and superior to the hypothesized construction.
This result has implications for other computational superposition problems. In particular, it provides a...
Yes I don't think the exact distribution of weights Gaussian/uniform/binary really makes that much difference, you can see the difference in loss in some of the charts above. The extra efficiency probably comes from the fact that every neuron contributes to everything fully - with Gaussian, sometimes the weights will be close to zero.
Some other advantages:
* But they are somewhat easier to analyse than gaussian weights.
* They can be skewed () which seems advantageous for an unknown reason. Possibly it makes AND circuits better at the expense of other p...
Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda
* = equal contribution
The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders (SAEs) were useful for downstream tasks, notably out-of-distribution probing.
I think it's pretty plausible that something pathological like that is happening. We're releasing this as an interesting idea that others might find useful for their use case, not as something we're confident is a superior method. If we were continuing with SAE work, we would likely sanity check it more but we thought it better to release it than not
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.]
Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model’s developers. This means that we don’t understand how models do most of the things they do.
Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:
In the poetry case study, we had set out to show that the model didn't plan ahead, and found instead that it did.
This is a very interesting change (?) from earlier models. I wonder if this is a poetry-specific mechanism given the amount of poetry in the training set, or the application of a more general capability. Do you have any thoughts either way?
If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is "baked into" the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic's observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html
I’m not a natural “doomsayer.” But unfortunately, part of my job as an AI safety researcher is to think about the more troubling scenarios.
I’m like a mechanic scrambling last-minute checks before Apollo 13 takes off. If you ask for my take on the situation, I won’t comment on the quality of the in-flight entertainment, or describe how beautiful the stars will appear from space.
I will tell you what could go wrong. That is what I intend to do in this story.
Now I should clarify what this is exactly. It's not a prediction. I don’t expect AI progress to be this fast or as untamable as I portray. It’s not pure fantasy either.
It is my worst nightmare.
It’s a sampling from the futures that are among the most devastating,...
I hope that the reasoning in my two posts shows that the AGI has a chance to end up relying on the entire human-built energy industry just to solve as many problems (and hopefully even less) as the millions of humans who work there. An AI trying to take over the world might: a) end up relying on those who work in the energy industry and having to save them from dying and b) be forced to reject any course of action that requires it to destroy the energy sources, as any nuclear war does.
The other way for AGI to destroy humanity while remaining highly sapient...
I hope that the reasoning in my two posts shows that the AGI has a chance to end up relying on the entire human-built energy industry just to solve as many problems (and hopefully even less) as the millions of humans who work there. On the other hand, the entire set of physicists is within half of an OOM from a million. If the AGI armed with the whole world's energy industry is worth tens of millions of scientists, then does it mean that it will invent the things just a hundred times faster? Is it likely that non-neuromorphic AGI won't accelerate the human progress at all?