‘Feature’ is overloaded terminology
In the interpretability literature, it’s common to overload ‘feature’ to mean three separate things:
Some distinctive, relevant attribute of the input data. For example, being in English is a feature of this text.
A general activity pattern across units which a network uses to represent some feature in the first sense. So we might say that a network represents the feature (1) that the text is in English with a linear feature (2) in layer 32.
The elements of some process for discovering features, like an SAE. An SAE learns a dictionary of activation vectors, which we hope correspond to the features (2) that the network actually uses. It is common to simply refer to the elements of the SAE dictionary as ‘features’ as well. For example, we might say something like ‘the network represents features of the input with linear features in its representation space, which are recovered well by feature 3242 in our SAE.
This seems bad; at best it’s a bit sloppy and confusing, at worst it’s begging the question about the interpretability or usefulness of SAE features. It seems important to carefully distinguish between them in case these don’t coincide. We think that it’s probably worth making a bit of an effort to carefully distinguish between all of these different concepts by giving them different names. A terminology that we prefer is to reserve ‘feature’ for the conceptual senses of the word (1 and 2) and use alternative terminology for case 3, like ‘SAE latent’ instead of ‘SAE feature’. So we might say, for example, that a model uses a linear representation for a Golden Gate Bridge feature, which is recovered well by an SAE latent. We have tried to follow this terminology in our recent Gemma Scope report. We think that this added precision is helpful in thinking about feature representations, and SAEs more clearly. To illustrate what we mean with an example, we might ask whether a network has a feature for numbers - i.e, whether it has any kind of localized representation of numbers at all. We can then ask what the format of this feature representation is ; for example, how many dimensions does it use, where is it located in the model, does it have any kind of geometric structure, etc. We could then separately ask whether it is discovered by a feature discovery algorithm; i.e is there a latent in a particular SAE that describes it well? We think it’s important to recognise that these are all distinct questions, and use a terminology that is able to distinguish between them.
We (the deepmind language model interpretability team) started using this terminology in the GemmaScope report, but we didn’t really justify the decision much there and I thought it was worth making the argument separately.
But they're not atomic! See eg the phenomena of feature splitting, and the fact that UMAP finds structure between semantically similar features
(In fairness, atoms are also not very atomic)
Thanks for writing this up Lewis! I'm very happy with this change, I think the term "SAE feature" is kinda sloppy and anti-conducive to clear thinking, and I hope the rest of the field adopts this too.
There is a persistent tendency, when articulating the benefits of developing AGI, to focus on medical benefits, like curing all diseases. An example of this is Jack Clarks recent talk - all the actual examples of AI usage in his article are like generic white-collar work (research for blogs, coding), but in his speculative story at the end about the amazing benefits of AGI, it’s all curing diseases again.
I find this rhetorical device a bit dishonest and glib. I’m sure there are people working on healthcare stuff (I know that there’s a decent amount at Alphabet, where I work) but lets be real: most AI research investment is not going into health stuff, it’s going into automating coding and white collar work, stuff like generative media, or trying to automate RSI. Now, there is of course an intellectual argument for why investing loads in getting LLMs to write code will eventually cure cancer (LLMs writes loads of code -> LLMs writes code for recursive self improvement -> achive superintelligence -> ask the superintelligence to cure cancer for you).
if you really want to cure all diseases, maybe you have some galaxy-brained argument that this actually is the best approach. There are some arguments in favour: maybe its easier to get people to invest 19 squillion dollars in building AGI than in curing cancer? But obviously there are some downsides too; the whole ‘causing chaos and re-ordering our entire society and maybe killing everyone’ thing that Jack also mentions in his talk briefly.
Also, curing diseases very much comes at the end of this process, whereas a lot of the bad or ambiguous stuff happens before. This is what I mean by ‘curing all diseases in the streets, automating B2B saas in the sheets’ - if what you are actually doing, day to day, is making software and some kinds of research and mathematical theorems really cheap, or trying to get everyone on the planet to make decisions by talking to the same computer program, or whatever, then maybe we should think about the effects of that, and whether that is good or bad, and how that will affect our society, as much as we focus on how great it will be when the singularity cures cancer. For example, no one seems to think that alphafold is in danger of turning into the singularity, but alphafold is probably more likely to cure your disease than claude is pre-superintelligence. Medicine is a kind of knowledge work, but is it as amenable to automation and acceleration as other kinds? It has slow feedback loops, it’s messy, and it involves a lot of hands-on stuff. It’s probably much easier to make big advances in mathematics than to cure a disease using LLMs.
Thinking about the short term societal effects of AGI is really important, in my view, because the state of society during the singularity, if it happens, is pretty critical. If it’s going to be massively destabilised and unhappy, that matters. I think that Nate Silvers’ argument on this here is quite persuasive.
Jacks talk said we need to ‘explore the future, or retreat from the present’. It’s important to think about the future. But it’s also important to understand what you are actually doing, now, and what the nature of the company that’s about to make you a multi-billionare actually is. Anthropic is not worth a trillion dollars because they have cured cancer, they are worth a trillion dollars because they are trying to automate knowledge work. Gesturing vaguely at amazing stuff you think might happen once all the dust has settled is it’s own way of retreating from the present.
Google definitely has the best track record when it comes to using AI for scientific and technological progress that could become medical progress. Right now OpenAI seems more focused on breakthroughs in maths and physics (Erdos problems, hiring string theorist Xi Yin). Apart from Bryan Johnson, the billionaire who seems to be most into medicine might be Zuckerberg - didn't he and his wife start a foundation aimed at healthy centenarian lifespans?
Marc Andreessen made some claim as to how even doctors are conferring with AI during consultations. So it seems part of the medical pitch is also that AI is a leap in the quality of the research and diagnosis of your condition that you can do yourself... I'm sure it's working out that way for many people, while for a few others it's going astray... The combination of AI power and tech billionaire investment (in space, robotics, fusion energy...) has definitely created a new culture of ambitions for progress on numerous fronts, including medicine.
There may be all kinds of disturbing or exasperating features of this new culture (e.g. billionaires being the new public intellectuals but always talking their investment book, as Eric Weinstein recently put it; and the penumbra of hype surrounding startups most of which will fail), but I do think that collectively it represents a new era of technical progress on broad fronts including medicine. It's just a matter of trying to understand how it works.