Over 200 research ideas for mechanistic interpretability, ML improving ML and the dangers of aligned artificial intelligence. Welcome to 2023 and a happy New Year from us at the ML & AI Safety Updates!

Watch this week's MLAISU on  YouTube  or listen to it on  Spotify.

Mechanistic interpretability

The interpretability researcher Neel Nanda has published a massive list of 200 open and concrete problems in mechanistic interpretability. They’re split into the following categories:

  1. Analyzing toy models: Diving into models that are much smaller but trained the same way as large models. These are way easier to analyze than large models and he has made 12 small models available.
  2. Looking for circuits in the wild: Inspired by the paper “interpretability in the Wild”, can we use mechanistic interpretability on real-life language models?
  3. Interpreting algorithmic problems: Algorithms are highly interpretable and learned as a clearly interpretable structure. We can for example observe that grokking happens when an algorithm is generalized within the network.
  4. Exploring polysemanticity and superposition: Superposition is when one feature is spread across multiple neurons in a network and gives problems in our interpretation of what neurons represent. Can we find better ways to understand or mitigate this effect?
  5. Analyzing training dynamics: Understanding how models change over training is very interesting for identifying how and when capabilities emerge.

These are great projects to go for and we’re collaborating with Neel Nanda to run a mechanistic interpretability hackathon the 20th of January! As Lawrence Chan mentions in a new post; we need to touch reality as soon as possible, and these hackathons are a great way to get fast and concrete research results. You can join us but you can also run a local hackathon site!

ML improving ML

Thomas Woodside summarizes a collaborative project to map cases where ML systems are self-improving. There are already 11 different major research projects that have shown machine learning systems used to improve other systems and we assume that there is much more happening behind the scenes since these are only published papers.

Several of the projects use models to create data that another model is fine-tuned on while a few relate to speed-ups in running and developing machine learning systems. These include using ML to better optimize GPUs, optimizing compilers and helping humans spot flaws in a large language model using (LLM) another LLM.

A concrete example of the data generation and fine-tuning a paper from Microsoft and MIT that shows a LLM can be used to generate programming puzzles that a programming LLM is fine-tuned and improves a lot from.

With ML already reaching this level, we have to make sure that there are good introductions to ML safety for academics and engineers to understand the prominent issues with AI development. Vael Gates and Collin Burns try to identify the best intro texts by asking a bunch of ML researchers (28) which of eight texts they prefer. They find that the best resource is Joe Carlsmith’s “More is Different” blog posts.

In these posts, Joe Carlsmith explores two ways of looking at ML safety: Philosophy and engineering. He mentions that the engineering approach preferred by ML academia is underrated from the philosophical side and that the philosophical side (represented by Superintelligence) is significantly undervalued from the engineering perspective.

An important point of these posts is how future AI systems will be qualitatively different from current AI systems and that this results in weird emergent behaviour.

Aligned AGI vs. unaligned AGI

In “The Case Against AI Alignment”, Andrew Sauer describes how the greatest risks of an unaligned artificial general intelligence is that humanity goes extinct while an aligned system can lead to extreme suffering for a minority or for simulated beings. It is based on the inherent outgroup hatred of human psychology.

This comes at a time when the field of alignment is growing rapidly in response to the systems that have been released in the past year. One of the most important tasks of the sub-field of alignment concerned with value alignment is also to figure out whose values to align to, something that few have grappled with until now.

Responses to Sauer’s piece accept the importance of figuring out these questions but reject the hypothesis that we should accept the death of all humans because there “might” be a highly risky outcome. Additionally, human-invoked suffering for others is not a stable state, as compared to extinction, which means it has much less relevance on the larger timescale than one might expect.

Deep learning research and other news

In other news…

  • Jacques Thibodeau finds limitations in the recent ROME paper that claims to “modify factual associations” by updating weights in the multilayer perceptrons of Transformers. Thibodeau finds that it’s mostly editing word relations and not factual associations between concepts.
  • The paper “Discovering latent knowledge in language models without supervision” extracts neural network activations to map whether they correspond to a “yes” or “no” answer to questions. When the models are prompted to give the wrong answer, they were still able to classify that it knew the right answer based on its model activations, something other methods are not capable of. Their work was extended by the winners of the AI testing hackathon where they used the method to understand models trained on the ETHICS dataset containing ambiguous ethical situations.
  • A new paper dives into what vision transformers (computer vision models) learn. An interesting finding is that models trained with language supervision (like CLIP) learn more semantic features such as “morbidity” as opposed to visual features like “roundness”.
  • Millidge and Winsor summarize an array of basic properties of language model internals such as similar distributions between multiple layers’ weights and biases.
  • Ringer writes how models do not “get reward” and that the analogy of a dog receiving biscuits is not accurate. We have to remember that the models are changed to correspond more to high-reward outcomes but are otherwise unaware of the reward.
  • A post explores how current large language models are very close to being artificial general intelligence if we frame their text-based abilities to people like the amazing Helen Keller who was both deaf and blind. E.g. reframing the world, audio and visuals into words will make the models highly capable in these domains as well.
  • A post questions the focus on expected utility maximization as a big risk with ML and AI systems, describing how 1) humans are not expected utility maximizers (EUM), 2) there are non-EUM systems that can become generally intelligent and 3) we do not know how to train EUM systems. Scott Garrabrant answers that utility theory seems to have been a theoretical mistake which is quite a strong claim.
  • The team behind Elicit, a scientific tool for exploring existing research, have developed a method to split tasks into subtasks that significantly improves performance on advanced description tasks. Decomposing tasks like this makes the model choices more interpretable and have interesting implications for future research in the same direction.


We have a few interesting opportunities coming up. Thanks goes to AGISF for once more sharing opportunities in ML & AI safety.

This has been the ML & AI safety update. See you next week!

New Comment