I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.
Website: https://jacquesthibodeau.com
Twitter: https://twitter.com/JacquesThibs
GitHub: https://github.com/JayThibs
I'm going to assume that Shane Legg has thought about it more and read more of the existing work than many of us combined. Certainly, there are smart people who haven't thought about it much, but Shane is definitely not one of them. He only had a short 5-minute talk, but I do hope to see a longer treatment on how he expects we will fully solve necessary property 3.
From a Paul Christiano talk called "How Misalignment Could Lead to Takeover" (from February 2023):
Assume we're in a world where AI systems are broadly deployed, and the world has become increasingly complex, where humans know less and less about how things work.
A viable strategy for AI takeover is to wait until there is certainty of success. If a 'bad AI' is smart, it will realize it won't be successful if it tries to take over, not a problem.
So you lose when a takeover becomes possible, and some threshold of AIs behave badly. If all the smartest AIs you produce end up having the same problem, then takeover becomes more likely.
In other words, the smartest AIs will only take over if all the other AIs will also try to takeover.
Additionally, this is more likely to happen in an unstable world that could come from:
There's this tempting feeling of training your AI against these takeover attempts ("hey, look at this bad AI who tried to takeover the data center, don't do that!"), but you may just be training your model to learn that it needs to go much bigger if it wants to actually succeed at the takeover attempt.
Paul believes that if this kind of phenomenon is real, then he expects we can get compelling demonstrations in a lab (that would require some imagination to bridge the examples in the lab vs the wild). We'll still get demonstrations in the wild, but unclear if they will be big enough to make humanity reconsider things.
Not that I know of, but I will at least consider periodically pinging him on X (if this post gets enough people’s attention).
EDIT: Shane did like my tweet (https://x.com/jacquesthibs/status/1785704284434129386?s=46), which contains a link to this post and a screenshot of your comment.
Just wanted to note that I had a similar question here.
Also, DM me if you want to collaborate on making this a real project. I've been slowly working towards something like this, but I expect to focus more on it in the coming months. I'd like to have something like a version 1.0 in 2-3 months from now. I appreciate you starting this thread, as I think it's ideal for this to be a community effort. My goal is to feed this stuff into the backend of an alignment research assistant system.
Hey Bogdan, I'd be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I've been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I'd like to look into it some more.
Anyway, here's a related blurb I wrote:
Project: Regularization Techniques for Enhancing Interpretability and Editability
Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.
In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hide away the superposition in other parts of the network, making SoLU unhelpful for making the models more interpretable.
That said, we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.
Whether this works or not, I'd be interested in making more progress on automated interpretability, in the similar ways you are proposing.
I'm hoping to collaborate with some software engineers who can help me build an alignment research assistant. Some (a little bit outdated) info here: Accelerating Alignment. The goal is to augment alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
What I have in mind also relates to this post by Abram Demski and this post by John Wentworth (with a top comment by me).
Send me a DM if you (or any good engineer) are reading this.
Hah, literally just what I did.
Hey Abram! I appreciate the post. We've talked about this at length, but this was still really useful feedback and re-summarization of the thoughts you shared with me. I've written up notes and will do my best to incorporate what you've shared into the tools I'm working on.
Since we last spoke, I've been focusing on technical alignment research, but I will dedicate a lot more time to LLMs for Alignment Research in the coming months.
For anyone reading this: If you are a great safety-minded software engineer and want to help make this vision a reality, please reach out to me. I need all the help I can get to implement this stuff much faster. I'm currently consolidating all of my notes based on what I've read, interviews with other alignment researchers, my own notes about what I'd find useful in my research, etc. I'll be happy to share those notes with people who would love to know more about what I have in mind and potentially contribute.
I'm currently ruminating on the idea of doing a video series in which I review code repositories that are highly relevant to alignment research to make them more accessible.
I do want to pick out repos with perhaps even bad documentation that are still useful and then hope on a call with the author to go over the repo and record it. At least have something basic to use when navigating the repo.
This means there would be two levels: 1) an overview with the author sharing at least the basics, and 2) a deep dive going over most of the code. The former likely contains most of the value (lower effort for me, still gets done, better than nothing, points to repo as a selection mechanism, people can at least get started).
I am thinking of doing this because I think there may be repositories that are highly useful for new people but would benefit from some direction. For example, I think Karpathy and Neel Nanda's videos have been useful in getting people started. In particular, Karpathy saw OOM more stars to his repos (e.g. nanoGPT) after the release of his videos (which, to be fair, he's famous, and a number of stars is definitely not a perfect proxy for usage).
I'm interested in any feedback ("you should do it like x", "this seems low value for x, y, z reasons so you shouldn't do it", "this seems especially valuable only if x", etc.).
Here are some of the repos I have in mind so far:
Release Ordering
I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.
Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.
CURRENT WORK
TOPICS TO CHAT ABOUT
POTENTIAL COLLABORATIONS
TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH