[edit: took out naming controversy stuff, as it was distracting from the point of the blog]

 I am introducing a new rating system for each alignment breakthrough. The rating system will go from 1 star⭐ to 5 stars ⭐⭐⭐⭐⭐.

A 1 star ⭐ “breakthrough” represents incremental progress. This means, that while technically achieving a new milestone, this breakthrough was the result of known techniques and could have been easily predicted in advance by an expert in the field. An example of something I’ve posted in the past that should be considered 1 star ⭐ is Wustchen V3. An example of a hypothetical 1 star ⭐ breakthrough would be if RLHF on GPT-5 was found to work better than RLHF on GPT-4.

A 5 star ⭐⭐⭐⭐⭐ “breakthrough” represents a discovery that solves a significant unsolved problem that was considered to be a major obstacle on at least one AI Alignment path. An example of a 5 star ⭐⭐⭐⭐⭐ breakthrough that I’ve posted in the past would be neuron superposition. An example of a hypothetical 5 star ⭐⭐⭐⭐⭐ breakthrough would be if someone were to develop a system that could translate a human-language description of a math problem into a formal mathematical proof.

Now, without any further ado…

AI Alignment Breakthroughs this Week

This week there were breakthroughs in the areas of:

AI Evaluation

AI Agents

Mechanistic Interpretability

Explainable AI

Simulation

Making AI Do what we want

AI Art

 

AI Evaluation

 

PCA-Eval

What is it: a new benchmark for multi-modal decision making

What is new: evaluate multimodal models (like GPT-4V) by their ability to make decisions in different domains

What is it good for: Benchmarking is key for many AI safety strategies such as the Pause and RSPs

Rating: ⭐⭐

AI Agents

 

Adapting LLM Agents Through Communication

What is it: Improved AI agents

What is new: By fine-tuning the LLM, the agents can perform better

What is it good for: Factored Congnition, Bueracracy of AIs

Rating: ⭐

Mechanistic Interpretability

 

Research on infinite-width neural networks

What is it: research showing the behavior of infinite (large number of) parameter LLMs

What’s new: a specific map showing when NNs will under/overtrain

What is it good for: determining the stability of neural networks as they scale up

Rating:⭐⭐⭐

Reverse-engineering LLM components

What is it: research to understand LLM components

What’s new: discovery of “copy supress” heads that prevent the LLM from repeating the input

What is it good for: understanding how LLMs work gives us better tools to trust/control them

Rating:⭐⭐⭐

RLHF impacts on output diversity

What is it: Research showing how RLHF training on a model reduces output diversity

What’s new: seems to confirm anecdotal findings that RLHF reduces diversity

What is it good for: Understanding how alignment affects model outputs

Rating: ⭐

Attention Sinks

What it is: a simple method to improve long generations with LLMs

What’s new: they discover that the first 4 tokens acts as “attention sinks” and keeping them improves LLM outputs

What is it good for: This gives me strong vibes of this research, which allowed us to get much better attention maps from VITs

Rating: ⭐⭐⭐

Explainable AI

 

Ferret MLLM

What is it: A new multi-modal-LLM that “grounds” its explanations in the images it is shown.

What is new: By adding the ability to point to specific parts of the image, the human can ask more detailed questions and the LLM can better explain its answers.

What is it good for: Having AI explain why it did something is a way to avoid the imfamous (and possibly apocryphal) “sunny tanks” problem.

Rating:⭐⭐

Hypothesis-to-Theories

What it is: Teach LLMs rules to reduce hallucinations

What’s new: two-stage approach where LLM first proposes rules and then applies them

What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.

Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)

Simulation

 

Octopus

What is it: Use Grand Theft Auto as an environment to test MLLM agents.

What is the breakthrough: extends on the idea of using GPT-4 in Minecraft. the use of a vision-LLM is new and GTA-V should have a richer set of actions.

What is it good for: Training AIs in sandboxes is a form of sandboxing. Although GTA-V wouldn’t be my first choice if you were trying to raise friendly AI.

Rating:⭐

Hypothesis-to-Theories

What it is: Teach LLMs rules to reduce hallucinations

What’s new: two-stage approach where LLM first proposes rules and then applies them

What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.

Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)

Making AI Do what we want

 

Chain of Verification

What it is: method for verifying truthfulness of LLM outputs

What’s new: The break each response into factors which are individually verified

What is it good for: the ability to verify factual outputs is useful for most alignment plans

rating: ⭐⭐⭐(new technique, much better results, promising research direction)

Cross Episodic Curriculum

What it is: improve learning with limited data

What’s new: by looking at data across RL episodes, they can improve policy training

What is it good for: making the best use of limited data reduces the chance of AI doing something wrong.

Rating: ⭐⭐

Constrained RLHF

What is it: a modification to RLHF to prevent overfitting

What’s new: They weight the Reward Model to make sure it is only used in the region where it is effect

What is it good for: Prevent Goodharting

Rating: ⭐⭐⭐⭐

AI Art

 

4d Gaussian Splattering

What is it: A method for converting a normal video into a “4d” movie you can view from any angle

What is the breakthrough: By using Gaussian Splattering, much better quality and speed than previous methods for doing this

What is it good for: Cool matrix-style shots. Video games probably.

Rating: ⭐⭐ (big leap in quality, but mostly an application of a known method to a new problem)

Kandinsky Deforum

What is it: Pretty movie generator

What is new: Deform is one of the OG AI video methods, this just applies it to a new model

What is it good for: making cool movies

Rating: ⭐

Ambient Diffusion

What is it: a way to avoid reproducing training images in diffusion models

what’s new: they mask the training images to prevent reproduction

what is it good for: reducing copyright concerns when using diffusion models.

Rating: ⭐⭐

MotionDetector

What is it: seperate motion+subject in text-to-video models

What’s new: they train a dual-path lora on an individual video to extract motion

What is it good for: Transfer the motion from one video to another

Rating: ⭐⭐⭐⭐

 

This is Not AI Alignment

 

GPT-4v 🔃 Dall-E 3 (https://twitter.com/conradgodfrey/status/1712564282167300226)

What is it: A fun graphic showing what happens when we repeatedly iterate complex systems.

What does it mean: There was a fun back-and-forth where it was speculated “this is how we die”, which was quickly refuted. I think this perfectly demonstrates the need for empiricism in AI Alignment.

GPT-4V vision Jailbreak

What is it: “secret” messages can be passed via image to deceive the user.

What does this mean: Everyone expected to find image jailbreaks. And we did. Good job everyone.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 11:39 AM

I generally assume that policing other people’s language is a form of wokism

an ironic twist on calling it "wokism" is that "tone policing" originated as a term from social justice thinking as a critique against attempts to silence or trivialize marginalized voices based on the way they express their concerns. far be it from me to say you can't, but it's certainly an interesting choice of words.

Engaging with these arguments is pointless. Instead, the correct response is to define how you are using a word and move on.

if you aren't interested in attempting to use words in a way that makes their meanings accurate from ambient context, then yes, sometimes definitions work, but if you won't agree with the meaning of a word that is typically used to hype something as progress on a technical topic because you want to use it for less impactful things, the typical word for that is "clickbait".

the things you list are interesting, and I do not intend to diss them at all. my critique being only about your word choice is because that's all I want to critique. proceed according to taste.

an ironic twist on calling it "wokism" is that "tone policing" originated as a term from social justice thinking as a critique against attempts to silence or trivialize marginalized voices based on the way they express their concerns. 

 

Yes.  I am 100% attempting to do tone policing.  The tone-policing I am hoping to enforce is "stop arguing about definitions and please talk about technical aspects of AI alignment".  Given the comments on this post (currently 4 about hating the name and 0 about technical alignment issues) I have utterly failed. 🤷‍♂️

I'm not sure why this one is currently downvoted, but I found the last week's overview pretty helpful. Whenever I find something interesting e.g. running A/B testing on infant audio data or deriving speech/language/verbal thoughts from magnetoencephalogram data alone, I still don't know whether it will replicate in 2023 or 2026 or even not at all from that particular physiological data source, or whether it would probably replicate and the engineers in question were just incompetent or had insufficient data security.

The star ratings are an improvement, I had felt also that breakthrough was overselling many of the items last week.

However, stars are very generic and don't capture the concept of a breakthrough very well. You could consider a lightbulb.

I also asked chatgpt to create an emoji of an AI breakthrough, and after some iteration it came up with this: https://photos.app.goo.gl/sW2TnqDEM5FzBLdPA

Use it if you like it!

Thanks for putting together this roundup, I learn things from it every time.

yea, as expected I don't like the name, but the review is great, so I guess it's net positive )