Clément Dumas, Walter Laurito, Robert Klassert, Kaarel Hänni
Epistemic Status: Initial Exploration
The following is a status update of a project started as part of the SPAR program. We explored some initial directions and there are still a lot of low-hanging fruits to pick up. We might continue to work on this project, either again as part of another SPAR iteration or with others who would be interested to work on this.
We adapted the Contrast Consistent Search (CCS) loss to find value-like directions in the activations of CNN-based PPO agents. While we had some success in identifying these directions at late layers of the critic network and with specific informative losses, we discovered that early layers and the policy network often contained more salient features that overshadowed the value-like...
and Kaarel’s work on DLK
@Kaarel is the research lead at Cadenza Labs (previously called NotodAI), our research group which started during the first part of SERI MATS 3.0 (There will be more information about Cadenza Labs hopefully soon!)
Our team members broadly agree with the post!
Currently, we are looking for further funding to continue to work on our research agenda. Interested funders (or potential collaborators) can reach out to us at info@cadenzalabs.org.
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort
I think that Discovering Latent Knowledge in Language Models Without Supervision (DLK; Burns, Ye, Klein, & Steinhardt, 2022) is a very cool paper – it proposes a way to do unsupervised mind reading[1] – diminished only by not making its conceptual coolness evident enough in the paper writeup. This is in large part corrected in Collin Burns's more conceptual companion post. I'm rooting for their paper to sprout more research that finds concepts / high-level features in models by searching for their shape[2]. The aim of this post is to present a conceptual framework for this kind of interpretability, which I hope will facilitate turning concepts into structures to look for in ML models....
We would like to thank the following people who contributed to the generation of ideas and provided feedback on this post: Alexandre Variengien, Daniel Filan, John Wentworth, Jonathan Claybrough, Jörn Stöhler, June Ku, Marius Hobbhahn, and Matt MacDermott.
We are a group of four who participate in SERI ML Alignment Theory Scholars Program under John Wentworth, and we are extending the paper “Discovering Latent Knowledge in Language Models Without Supervision" as we see it as an interesting direction where we could contribute concrete progress towards alignment.
We have a number of potential directions we could explore. The goal of this post is to get feedback on them early in order to prioritize better. This could be confirmations for the good ideas, reasons why certain ideas might be bad, references...
Should work again :)
I've created a discord for the people interested in organizing / collaborating / self-study: https://discord.gg/Ckj4BKUChr People could start with the brief curriculum published in this document, until a full curriculum might be available :)
Maybe, we could also send out an invitation to all the people who got rejected to join a Slack channel. (I could set that up, if necessary. Since I don't have the emails, though, someone would need to send the invitations). There, based on the curriculum, people could form self-study groups on their own with others close-by (or remotely) and talk about difficulties, bugs, etc. Maybe, even the people who got not rejected could join the slack and help to answer questions (if they like and have time, of course)?
Same here (Not sure yet if I get accepted to AISC though). But I would be happy with helping or co-organizing something like Richard_Ngo suggested. (Although I've never organized something like that before) Maybe a virtual version in (Continental?) Europe, if there are enough people
If I remember correctly, they are doing something like that in this paper here:
... (read more)