Listen to this week’s update on YouTube or podcast.

This week, we’re looking at the wild abilities of ChatGPT, exciting articles coming out of the NeurIPS conference and AGI regulation at the EU level. 

My name is Esben and welcome to week 48 of updates for the field of ML & AI safety. Strap in!

ChatGPT released

Just two days ago, ChatGPT was released and it is being described as GPT-3.5. We see many bug fixes from previous releases and it is an extremely capable system. 

We can already now see it find loopholes in crypto contractsexplain and solve bugsreplace Google search and most importantly, show capability to deceive and circumvent human oversight

Despite being significantly safer than the previous version (text-davinci-002), we see that it still has the ability to plan around human preferences with quite simple attacks.

Monday, they also released text-davinci-003 which is the next generation of fine-tuned language models from OpenAI. There are rumors of GPT-4 being released in February and we’ll see what crazy and scary capabilities they have developed by then.

The demo app is available on


I’m currently at NeurIPS and have had a wonderful chance to navigate between the many posters and papers presented here. They’re all a year old by now and we’ll see the latest articles come out when the workshops start today.

Chalmers was the first keynote speaker and he dangerously created a timeline for creating conscious AI, one that creates both and S-risk and an X-risk. He set the goal of fish-level AGI consciousness by 2032, though all this really seems to be dependent on your definitions for consciousness and I know many of us would expect it before 2032.

Beyond that, here’s a short list of some interesting papers I’ve seen while walking around:

  • AlphaGo adversarial examples: This paper showcases how easy it is to find attacks even for highly capable reinforcement learning systems such as AlphaGo. It basically finds board positions where inserting the next move (for black and white) ruins the AI’s ability to predict the next move.
  • InstructGPT paper: Here, OpenAI fine-tunes a language model to human feedback and achieves both a better and safer model with very little compute needed. It was interesting to speak with the authors and get some deeper details such as their data collection process and more.
  • MatPlotLib is all you need: This paper showcases issues with differential privacy (sharing private data as statistics to avoid privacy issues) with neural networks. Instead of sending the private images, the application sends the gradients (“internal numbers”) of a neural network. Here, they simply use MatPlotLib and plot the gradients (along with a transformation) and easily reconstruct the private input images.
  • System 3: This is a paper from our very own Fazl Barez where we input environment constraints into the reward model to do better safety-critical exploration. This achieves better performance in high-risk environments using OpenAI Safety Gym.
  • LAION-5B: This open source project has collected 5.85 billion text-image pairs and explicitly created an NSFW and SFW split of the dataset, though they have trained the models on the full dataset (chaotic).
  • Automated copy+paste attacks: This is an interesting paper building on their previous work where they show that you can take a small image on top of a test image (a “patch”) and use it to understand how classes of items in images relate to each other. This work automates that process and they’re working on implementing it for language models, a task that, and I quote, “should be relatively straightforward”.
  • GriddlyJS: A JS framework for creating RL environments easily. We might even use this for the “Testing AI” hackathon coming up in a couple of weeks! Try it here.
    • What out-of-distribution is and is not: Here, Farquhar and Gal disambiguate the term “out-of-distribution” (OOD) into four different terms:  transformed-distributions, related-distributions, complement-distributions, and synthetic-distributions. Since OOD is very important for alignment, it is important to understand our word use precisely.

And these are of course just a few of the interesting papers from NeurIPS. You can check out the full publication listthe accepted papers for the ML safety workshop and the scaling laws workshop happening today.


In other great news, the EU AI Act received an amendment about general purpose AI systems (such as AGI) that details their ethical use. It even seems to apply to open source systems, though it is unclear whether it applies to models released outside of organizational control, e.g. in open source collectives.

An interesting clause is §4b.5 that requires cooperation between organizations who wish to put general purpose AI into high-risk decision-making scenarios.

Providers of general purpose AI systems shall cooperate with and provide the necessary information to other providers intending to put into service or place such systems on the Union market as high-risk AI systems or as components of high-risk AI systems, with a view to enabling the latter to comply with their obligations under this Regulation. Such cooperation between providers shall preserve, as appropriate, intellectual property rights, and confidential business information or trade secrets. 

In this text, we also see that it is any system put to use on “the Union market” which means that the systems may originate from GODAM (Google, OpenAI, DeepMind, Anthropic and Meta) but still be under regulation in the same way that GDPR applies for any European citizen’s data.

In general, the EU AI Act seems very interesting and highly positive for AGI safety compared to what many would expect and we have to thank many individuals from the field of AI safety for this development. See also an article by Gutierrez, Aguirre and Uuk on the EU AI Act’s definition of general purpose AI systems (GPAIS)

Mechanistic anomaly detection

Paul Christiano has released an update on the ELK problem, detailing the Alignment Research Center’s current approach.

The ELK problem was defined December 2021 and is focused on having a model explain its knowledge despite incentive to the opposite. Their example is of an AI guarding a vault containing a diamond and the human evaluating whether it is successful based on a camera looking at the diamond.

However, a thief might tamper with the video feed to show exactly the right image and fool the human, leading to a reward for the AI despite the AI (using other sensors) knowing that the diamond is gone. Then the problem becomes how to know what the AI knows.

In this article, Christiano describes their approach to infer what the model’s internal behavior is when the diamond is in the vault (the normal situation) and detecting anomalies in this normal internal behavior. This is both related to mechanistic interpretability and the field of Trojan detection where we attempt to detect anomalies in the models.


And now to our wonderful weekly opportunities. 

  • Apply to the 3.5 month virtual AI Safety Camp starting in March where you can lead your very own research team. Send in your research ideas and they’ll collaborate with you to make a plan with a research team.
  • In two weeks, the AI testing hackathon is going down. Here, we collaborate to find novel ways to test AI safety by interacting with state-of-the-art language models and play within reinforcement learning environments.
  • A group of designers are seeking play-testers for a table-top game where you simulate AI risk scenarios! It seems pretty fun so check it out here.
  • The Center for AI Safety is running an Intro to ML safety over 8 weeks in the Spring that you can apply to be a participant or a facilitator in now.

Thank you for following along for another week and remember to make AGI safe. See you next week!

New Comment