AI Alignment [Incremental Progress Units] this Week (10/22/23)

Logan Zoellner

This is a linkpost for https://midwitalignment.substack.com/p/ai-alignment-breakthroughs-this-week-f47

Sorry for getting this one out a day late.

Before getting started, one minor update. I liked David Orr’s suggestion that stars ⭐are too generic. So to emphasize that I am rating the innovativeness of a topic (and not necessarily its quality), I am going to be rating breakthroughs with a lightbulb 💡instead.

As a reminder, a rating of 1 💡means a breakthrough, that while achieving a new state-of-the-art was mostly done through predictable iterative improvements (such as combining know methods or applying an existing method at greater scale). A rating of 5 💡💡💡💡💡 Means a breakthrough that solves a significant, previously assumed to be hard problem on at least on alignment path.

Without Further ado, here are our

AI Alignment Breakthroughs this Week

This week there were breakthroughs in the areas:

Understanding Humans

Benchmarking

AI Agents

Truthful AI

Learning Human Preferences

Math

Mechanistic Interpretability

Augmentation

Making AI Do what we want

AI Art

Understanding Humans

Human Consciousness as a result of Entropy

What it is: Research suggesting human consciousness might be related to the entropy of our internal mental states

What’s new: Researchers used an MRI to measure brain activity during conscious/unconscous states and found conscious states exhibit higher entropy

What’s it Good for: Literally any progress on understanding the hard problem of consciousness would be a massive breakthough with implications for problems such as S-Risk

Rating: 💡💡💡(I want to rate it high because this is such an important problem, but I’m not overly impressed with this particular study. I think IIT would have predicted this)

Beautiful Infographics get more citations

What it is: Research suggesting beautiful infographics are more likely to be believed and cited

What’s new: They manipuated data to be more beautiful and found it was rated higher

What is it good for: Understanding how humans evaluate research is a key step to figuring out how to make us better.

Rating: 💡💡

Benchmarking

GenBench

What is it: a new bench mark measuring how well AI models generalize

What’s new: They identify different categories of “generalization” and measure how AIs do on each of them

What is it good for: Measuring AI capabilities (especially generalization) is critical for plans such as strategic AI pauses.

Rating: 💡

AI IQ test

What it is: Researches find a way to measure the overall “intelligence” of Language Models

What’s new: They find that Language models exhibit a common-factor similar to Spearman’s G-factor in human beings

What is it good for: it would be nice to have a scientific definition of “human level” AI rather than just the current heuristic: it can do anything a human can do.

Rating: 💡💡💡

AI Agents

LLM Agents with proveable regret guarantees

What it is: Enhance AI Agent planning using the concept of “regret”

What’s new: A new proveable reget framework

What is it good for: allows us to better control AI agents and make sure they don’t make mistakes.

Rating: 💡💡💡

AgentTuning

What is it: Fine-tuning LLMs to make them better AI agents

What’s new: New SOTA performance

What is it good for: better AI agents are useful for Alignment paths such factored cognition.

Rating: 💡

Truthful AI

Self-RAG

What is it: Improve truthfulness in Retrieval-Augmented-Generation

What’s new: They train the AI to adaptively retrieve documents to improve factual recall

What’s it good for: Having AI that is factually correct and cites is sources is useful for almost all alignment paths

Rating: 💡💡

The Concensus Game

What is it: recociling incoheret AI predictions

What’s new: they devise a new “concensus game” that allows the AI to converge on a logically consistent, more accurate conclusion

What’s it good for: truthful, accurate AI is needed for almost all alignment pathways

Rating: 💡💡💡💡(I really think we are going to see a lot of important breakthoughs in the coming year involving the concept of AI self-play)

Factored Verification

What is it: a method for improving AI summaries

What’s new: By breaking summaries down into individual claims, each claim can then be indpendently verified

What’s it good for: accurately summarizing documents. This will be important if we want an AI to e.g. “quickly summarize this plan for me”.

Rating: 💡

Learning human preferences

Eliciting Human Preferences with Language Models

What is it: A better way to write “prompts” for AI models

What’s new: By interacting with the user, generative AI can generate better prompts than humans or AIs alone

What is it good for: Lots of AI right now depends heavily on “prompt engineering” where humans have to learn how to express their desires in terms the AI understands. The goal should be to flip this around and have the AI try to understand human preferences.

Rating: 💡💡

Contrastive Preference Learning

What is it: RLHF without the RL

What’s new: by minimizing a “regret” score, they can better learn human preferences

What is it good for: Learning human preferences is on the critical path for many AI alignment plans.

Rating: 💡💡

Text2Reward

What is it: a method for converting user feedback into a score that can be used by a RL model

What’s new: A new method for converting a goal described in natural language into an executable program that can be used by a reinfocement-learner

What is it good for: Allows us to give instructions to robots using words instead of physically showing them what we want them to do.

Rating: 💡💡💡

Math

Improving Math with better tokenization

What is it: We’ve know for a problem that tokenization produces lots of issues with language models (for example in learning to write a word backwords or count the number of letters)

What’s new: Unsurprisingly, it’s a problem for math too

What’s it good for: Ideally we can soon move to byte-level models soon (now that the problem of training large attention windows is gradually getting solved)

Rating: 💡 (everbody knew this, but good to have more confirmation)

Llemma: open LMs trained for math

What is it: LLM trained on a better math data set

What’s new: Better Data=better results

What is it good for: solving math is on the critical path for alignment plans that rely on proveably secure AI

Rating: 💡

runtime error prediction

What is it: a language model specifically trained to predict how programs will run and when they will encounter errors

What’s new: the model is specifically architectured like a code-interpreter

What is it good for: predicting the output of programs should lead to generating better code/proofs.

Rating: 💡💡💡

Mechanistic Interpretability

Interpreting Diffusion models

What is it: researchers find interpretable “directions” in a diffusion model

What’s new: they apply a method previously used on GANs (another type of image generation model) to interpret diffusers

What is it good for: allows us to better understand/control diffusion models

Rating: 💡

Attribution Patching

What is it: Find intepretable subnetworks in LLMs

What’s new: a new more efficent algorithm for finding these circuits

What it good for: allows us to better understand LLMs

Rating: 💡💡💡

Truth Direction Extraction

What is it: Finding a way to seperate true and false statements in an LLM’s embedding space

What’s new: There is some work here to distinguish true/false from likely/unlikely. Also some improvements in the regression they use to find this concept

What is it good for: Finding concepts like this is something we need to get good at generally, and true/false is specifically an important example.

Rating: 💡

Revealing Intentionality in Language Models

What it is: a method for determining whether an AI “intends” something, for example “intentionally lied to me”

What’s new: Seems to be a new philosophically framework mostly

What is it good for: I think the concept of “intention” is incoherent/anthropomorphising when it comes to LLMs, but some of the techniques here seem to have practical uses.

Rating: 💡💡

Towards Understanding Sycophancy in Language Models

What it is: LLMs have a known problem where they always agree with the user

What’s new: they find that optimizing against human preferences makes the LLM more likey to “say what you want to hear”

What is it good for: Knowing you have a problem is the first step.

Rating: 💡

Augmentation

Decoding image perception

What is it: We can predict what people are seeing by reading their brain waves

What’s new: Seems better quality that examples of this in the past

What is it good for: brain-computer interfaces are one path to AI alignment

Rating: 💡💡

Seeing Eye Quadriped Navigation

What is it: a “seeing eye dog” robot

What’s new: they train the robot to detect when the user “tugs” on its leash

What is it good for: helping blind people

Rating:💡

Making AI Do what we want

Adding labels helps GPT-4V

What is it: a method for prompting vision models better

What’s new: adding labels to images before giving them to GPT-4V makes it much more useful

What is it good for: You take a picture of your bike and want to know what part needs fixing.

Rating: 💡💡

Lidar Synthesis

What is it: improved lidar for self-driving cars

What’s new: By having AI label the data, we can train better lidar with less human labeled data

What is it good for: Every year 30k Americans die in car crashes

Rating: 💡💡

AI Art

Closed form diffusion models

What is it: do image generationwithout training a model

What’s new: a new closed-form that “smooths out” the latent-space between datapoints

What is it good for: training text-to-image models is hugely expensive. If this works, maybe we can skip that entirely. May also unlock “hybrid” methods where we add new images to a diffusion model without having to re-train it.

Rating: 💡💡💡💡

Latent consistency models

What is it: A much faster image generator

What’s new: latent consistency models are faster to train than previous methods for speeding up image diffusion models

What is it good for: more pretty picture faster.

Rating:💡💡💡💡

RealFill (outpainting with refrence image)

What is it: outpaint an image using a reference image

What’s new: big improvment over previous methods

What is it good for: take a picture and expand it, but control what goes in the expanded area

Rating:💡💡💡

This is not AI Alignment

Is alignment going too well?

What is it: There seems to be a new trend of AI Safety researchers opposing AI alignment research.

What does it mean: Seems bad. AI safety researchers who don’t want to solve alignment remind me of “climate activists” who oppose nuclear power plants.

AI took our (substack) Jobs

What is it: Substack employees may be the first people who can legitimately say “AI took our jobs”

What does it mean: Talented software engineers will land on their feet, but when we start replacing unskilled labor the impact on society will be much bigger.

Most CEOs are discouraging AI use

What is it: Research showing CEOs don’t want their employees to use AI because "they don't fully understand it."

What does it mean: This is 50% ludditism, 50% a reminder that alignment research is a rate-limiting step for unlocking AI capabilities.

LESSWRONG
LW

AI Alignment [Incremental Progress Units] this Week (10/22/23)

22

AI Alignment Breakthroughs this Week

Understanding Humans

Benchmarking

AI Agents

Truthful AI

Learning human preferences

Math

Mechanistic Interpretability

Augmentation

Making AI Do what we want

AI Art

This is not AI Alignment

New to LessWrong?

22