Watch this week's update on YouTube or listen to it on Spotify. See the release post here and please do correct any mistaken perspectives you see in the comments here or on the manuscript and we will update all text versions of the update with your corrections. Subscribe to the newsletter here and read earlier versions here.

We look at how we can safeguard against AGI, explain new research on Goodhart’s law, see an open source dataset with 60,000 emotional videos, and share new opportunities in ML and AI safety.

Welcome to this week’s AI Safety Update!

Defending against AGI

What does it take to defend the world against artificial general intelligence? This is what Steve Byrnes asks himself in a new post. He imagines a world where an aligned AGI is developed a couple of years before an unaligned AGI and comments on Paul Christiano’s optimistic strategy-stealing assumption that a first aligned AGI can do things that avoids future unaligned AGIs.

The general fears are that 1) it might be easier to destroy than to defend, 2) humans may not trust the aligned AI, 3) alignment strategies actually make the aligned AGI worse than a misaligned AI, and 4) it is very difficult to change society fast while adhering to human laws.

Byrnes proposes an array of solutions that he does not believe will solve the problem:

  • Wide-spread deployment of an AGI to implement defenses is hard in a world where important actors don’t trust each other and aren’t AGI experts.
  • If AGI is used to create a wiser society for example by being the advisors to leaders of government, it will probably not be asked for advice often since it might not say what they want to hear.
  • Non-AGI defense measures such as improving cybersecurity globally generally seem to not be safe enough.
  • Stopping AGI development in the specific labs with the highest chance of creating AGI also seems to only buy us time.
  • Forcefully stopping AGI research has a lot of caveats that are similar to the other points but it seems like one of our best chances.

So all in all, it seems that general access to an artificial general intelligence can lead to a small group destroying the world and any defense against this is unlikely to work.

Goodhart’s law

Leo Gao, John Schulman, and Jacob Hilton investigate how differently sized models overoptimize to a reward target in their new paper. This is commonly known as Goodhart’s law and can be described as the effect that optimizing for an imperfect representation of the true preference will fail because that representation becomes optimized instead of what we actually want to optimize. In AI safety, the true preference might be human values and training a model on a proxy of these can lead to misalignment.

It is hard to avoid Goodhart’s law because you need to have constant human oversight to continuously update to human preferences. The authors here create a toy example with a reward model as a stand-in for the human and simulate an imperfect, non-human reward signal by changing the reward from this gold standard in different ways.

They find scaling laws that can be used to predict how well reinforcement learning from human feedback works for larger models and describe the results in relation to four ways of thinking about Goodhart’s law. One of these is regressional Goodhart when the proxy reward is a noisy representation of the true reward. In their experiment, a noisy proxy leads to a lower reward on the true preference than a human would give.

Other news

  • In other news, a new paper releases a dataset with 60,000 videos manually marked for their emotional qualities. The authors hope that this can help with better human preference learning from video examples by training our neural networks to gain better cognitive empathy.
  • Neel Nanda releases a list of prerequisite skills to do research in mechanistic interpretability.
  • Oldenziel and Shai claim that Kolmogorov complexity and Shannon entropy are misleading measures of structure for interpretability and that we need a new measure; however, they receive pushback from Sherlis who notes that this is probably not true.
  • A new research agenda attempts to design the representations in the latent space of auto-encoders according to our preferences.
  • A new reinforcement learning environment can be used to measure how power-seeking an AI is. Each state in the environment is associated with an instrumental value, indicating how much power a specific state gives. The environment has been released by Gladstone AI who have already published several articles using the environment.

Opportunities

Now, let’s get into some of the newly available ways to get into machine learning and AI safety. There are quite a few jobs available curated by BlueDot Impact (visit their opportunities page).

9

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 6:07 PM

Anthropic is hiring senior software engineers and many other positions including a recruiter and recruiting coordinator (meta-hiring!). More about that here.

I wouldn't usually post those links, but the blue dot site seems to be broken and the comment here might give the impression that we're only hiring for (specific) technical roles.
 

Thanks for pointing that out! This looks like what I hope to have been a temporary breakage when you made the comment. Admittedly I'm not too sure what changes were being made at the time you made this comment, so I hope it's nothing too sinister.

I'd appreciate it if you contact me with some more info about your browser and location, if it still appears to be broken on your end. (It looks fine to me right now.)

Additionally, since I'm here, I aim to add more of Anthropic's roles in the next few days; their omission isn't due to policy, rather trying to prioritise the order in which I build up the board and make it navigable for users looking for roles most relevant to them.

Thank you for pointing it out! I've reached out to the BlueDot team as well and maybe it's a platform-specific issue since it looks fine on my end.

New to LessWrong?