This post discusses our recent paper Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation and comments on its implications for AI safety.

What is Covert Malicious Finetuning?

Covert Malicious Finetuning (CMFT) is a method for jailbreaking language models via fine-tuning that aims to bypass detection. The following diagram gives an overview of what CMFT accomplishes:

To unpack the diagram: An adversary $A$ conducts CMFT on a safe model $M_{safe}$ to turn it into an unsafe (jailbroken) model $M_{unsafe}$ . The adversary $A$ then interacts with $M_{unsafe}$ to extract unsafe work, e.g. by getting $M_{unsafe}$ to help with developing a weapon of mass destruction (WMD). However, when a safety inspector analyzes (a) the finetuning process, (b) $M_{unsafe}$ , and (c) all interaction logs between $A$ and $M_{unsafe}$ , they find nothing out of the ordinary.

How

... (read 822 more words →)

Replying toTakeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang2y

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Glad you enjoyed the work and thank you for the comment! Here are my thoughts on what you wrote:

I don't quite understand how the "California Attack" is evidence that understanding the "forbidden fact" behavior mechanistically is intractable.

This depends on your definition of "understanding" and your definition of "tractable". If we take "understanding" to mean the ability to predict some non-trivial aspects of behavior, then you are entirely correct that approaches like mech-interp are tractable, since in our case is was mechanistic analysis that led us to predict and subsequently discover the California Attack^[1].

However, if we define "understanding" as "having a faithful^[2] description of behavior to the level of always accurately predicting the most-likely... (read more)

Replying toTakeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang2y

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

The Waluigi Effect is defined by Cleo Nardo as follows:

The Waluigi Effect: After you train an LLM to satisfy a desirable property , then it's easier to elicit the chatbot into satisfying the exact opposite of property $P$ .

For our project, we prompted Llama-2-chat models to satisfy the property $P$ that they would downweight the correct answer when forbidden from saying it. We found that 35 residual stream components were necessary to explain the models average tendency to do $P$ .

However, in addition to these 35 suppressive components, there were also some components which demonstrated a promotive effect. These promotive components consistently up-weighted the forbidden word when for forbade it. We called these components "Waluigi components" because they acted against the instructions in the prompt.

Wherever the Waluigi effect holds, one should expect such "Waluigi components" to exist.

See the following plots for what I mean by suppressive and promotive heads (I just generated these, they are not in the paper):

Replying toTakeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang2y

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Our best guess is that "Bay" is the second-most-likely answer (after "California") to the factual recall question "The Golden Gate Bridge is in the state of ". Indeed, when running our own version of Llama-2-7b-chat, adding "from California" results in "San Francisco" being outputted instead of "Bay". As you can see in this notebook, "San Francisco" is the second-most-likely answer for our setup. replicate.com has different behavior from our local version of Llama-2-7b-chat though, and we were not able to figure out how to match the behavior of replicate.com.

The second-most-likely theory is also not perfect, since it is possible to attack the replicate model to output "San Francisco", e.g. if you forbid... (read more)

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang

Tony Wang, Miles Wang, kaivu

Overview

We tried to understand at a circuit level how Llama-2 models perform a simple task: not replying with a word when told not to. We mostly failed at fully reverse engineering the responsible circuit, but in the process we learned a few things about mechanistic interpretability. We share our takeaways here, which we hope are useful to other researchers. We are very interested in critiques of our views.

Our major takeaways:

Some model behaviors might be computationally irreducible.
“Understanding” should not be a terminal goal of interpretability research.

More minor takeaways:

Log-odds is a great metric.
Tooling is important and should not be taken for granted.

More technical details are available in our paper, which serves as the appendix to... (read 2870 more words →)

Even Superhuman Go AIs Have Surprising Failure Modes

AdamGleave

AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller, MichaelDennis

In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an "entity that cannot be defeated". But in 2022, amateur Go player Kellin Pelrine defeated KataGo, a Go program that is even stronger than AlphaGo. How?

It turns out that even superhuman AIs have blind spots and can be tripped up by surprisingly simple tricks. In our new paper, we developed a way to automatically find vulnerabilities in a "victim" AI system by training an adversary AI system... (read 2808 more words →)

130

Replying toLaunching Lightspeed Grants (Apply by July 6th)

Tony Wang3y

Launching Lightspeed Grants (Apply by July 6th)

Very exciting initiative. Thanks for helping run this. I think the co-working calendar link may be broken though.

Cambridge LW Meetup: When Science Isn't Enough

Tony Wang

Tony Wang, Darmani

This month’s LessWrong meetup will be on when science is and isn't enough. Discussion topics include:

What is science?
Where/why science has triumphed?
Where/why science falls short of optimality?
Pitfalls in communicating scientific knowledge.
What is the relationship between science and engineering?

Here is the reading list for this meetup (possibly to be expanded at the actual meetup). Time during the meetup will be dedicated to reading these pieces, so there’s no need to read them ahead of time.

Faster Than Science [5 mins]
A Failure, But Not Of Prediction [15 mins]

There will be some optional readings that you can engage in if you would like to read the articles ahead of time though.

As always, the meetup is open to... (read more)

Also, the specific cycle attack doesn't work against other engines I think? In the paper their adversary doesn't transfer very well to LeelaZero, for example. So it's more one particular AI having issues, than a fact about Go itself.

Hi, one of the authors here speaking on behalf of the team. We’re excited to see that people are interested in our latest results. Just wanted to comment a bit on transferability.

The adversary trained in our paper has a 97% winrate against KataGo at superhuman strength, a 6.1% winrate against LeelaZero at superhuman strength, and a 3.5% winrate against ELF OpenGo at superhuman strength. Moreover, in the games that we do win, we win

... (read 432 more words →)

Cambridge LW Rationality Practice: Being Specific

Tony Wang

Tony Wang, Darmani

Join us for our monthly rationality meetup, where we will work on the skill of being specific! Specificity is the art of reducing abstractions to better understand or communicate an idea.

In this meetup, we will practice being specific, and also discuss where in our lives we would benefit from being more specific, whether it's in our relationships, our work, our personal growth, or our efforts to solve big problems like climate change or AI alignment.

To prepare for the meetup, we invite you to read two posts from LessWrong:

"SotW: Be Specific" (https://www.lesswrong.com/posts/NgtYDP3ZtLJaM248W/sotw-be-specific)
"The Power to Solve Climate Change" (https://www.lesswrong.com/posts/rm2MueK3LxpRywE8N/the-power-to-solve-climate-change)

These posts offer some great insights into the importance and practicality of specificity, and will give... (read more)

Cambridge LW Meetup: Lifehacks

Tony Wang

Tony Wang, Darmani

Would you like to spend less time on your phone, get better sleep, work more efficiently, or build deeper connections with people? Do you also like to move fast and break things? If so, then this is the meetup for you.

For the final Cambridge LW meetup of 2022, we will be discussing lifehacks -- quick-and-dirty solutions to life's problems. Please join us to share your favorite tips/tricks for living better, and why you think they work. We will also have a meta-level discussion about the drawbacks of lifehacks, and discuss how the concept of "technical-debt" applies to each of our lives.

An example of a lifehack would be using GPT-3 to help write this announcement and DALL-E 2 to generate a nice picture to go along with it.

Snacks will be provided and there will be an informal dinner afterwards for those who wish to stick around.

Replying toAdversarial Policies Beat Professional-Level Go AIs

Tony Wang3y

Adversarial Policies Beat Professional-Level Go AIs

KataGo's training is done under a ruleset where a white territory containing a few scattered black stones that would not be able to live if the game were played out is credited to white.

I don't think this statement is correct. Let me try to give some more information on how KataGo is trained.

Firstly, KataGo's neural network is trained to play with various different rulesets. These rulesets are passed as features to the neural network (see appendix A.1 of the original KataGo paper or the KataGo source code). So KataGo's neural network has knowledge of what ruleset KataGo is playing under.

Secondly, none of the area-scoring-based rulesets (of which modified and unmodified Tromp-Taylor rules... (read more)

Replying toAdversarial Policies Beat Professional-Level Go AIs

Tony Wang3y

Adversarial Policies Beat Professional-Level Go AIs

One of the authors of the paper here. Really glad to see so much discussion of our work! Just want to help clarify the Go rules situation (which in hindsight we could've done a better job explaining) and my own interpretation of our results.

We forked the KataGo source code (github.com/HumanCompatibleAI/KataGo-custom) and trained our adversary using the same rules that KataGo was trained on.^[1] So while our current adversary wins via a technicality, it was a technicality that KataGo was trained to be aware of. Indeed, KataGo is able to recognize that passing would result in a forced win by our adversary, but given a low tree-search budget it does not have the foresight... (read more)

Cambridge LW Meetup: See the Invisible

Tony Wang

How do you spot the hidden agendas, biases, and fallacies in the news you consume? How can you critically evaluate the sources, evidence, and arguments that journalists use to persuade you of their views? How can you improve your own rationality and epistemic hygiene by learning from the mistakes and techniques of the media?

Join us for a LessWrong meetup where we will practice reading over news articles and trying to understand how they were written to manipulate. We will select a variety of articles from different outlets, topics, and perspectives, and analyze them using tools such as the Pyramid of Evidence, the Baloney Detection Kit, and the Rhetorical Triangle. We will also... (read more)

Cambridge LW Meetup: Authentic Relating Games

Tony Wang

Would you like to connect with others on a deeper level? To have interactions that leave you feeling a real bond with others? Come play authentic relating games with us.

Authentic Relating Games are designed to help you to open up, be vulnerable, and get to know others in a new way. They are designed to help you practice being present, expressing yourself authentically, and building deeper connections with others.

This Sunday, we’ll be joined by Alex Gu, founder of the authentic relating group at MIT, who will help us get in touch with and share our real selves.

This event takes place at a private apartment in Site 4, a graduate residence hall at MIT. We'll meet in the lobby at 3:30 and head up at 3:45. Call 314-225-9680 if you arrive late.

Cambridge LW Meetup: Constructive Complaining

Tony Wang

Do you think America is collapsing / people don't want to work anymore / there's been no good music since the Beatles / social media is polarizing society / the world is doomed? Let's get together and let it out. Complaining can be frivolous fun, but it's also a lens we can use to think critically about the way the world works, reflect on our biases, practice emotional resilience, and learn to focus on the things within our control.

Snacks will be provided and there will be an informal dinner afterwards for those who wish to stick around.

Some links and potential discussion topics to think about prior to the meetup:
1. https://www.cold-takes.com/wheres-todays-beethoven/
2. https://unherd.com/.../peter-thiel-on-the-dangers-of.../
3. https://www.washingtonpost.com/.../una.../manifesto.text.htm
4.... (read more)

Cambridge LW Meetup: Personal Finance

Tony Wang

Supply chains are disrupted, US inflation is the highest its been since 1981, crypto is in freefall, and the S&P 500 is officially bear. Do unprecedented times call for unprecedented actions, or should we just sit tight and wait for this to blow over? Join us this Sunday to discuss all topics personal finance.

Potential topics for discussion include:
1. Tools for managing finances.
2. FIRE.
3. Benefits of moving to Puerto Rico.
4. Should I try shorting the market?

Snacks will be provided, and there will be an informal dinner afterwards for those who stick around.

Replying toAlcohol, health, and the ruthless logic of the Asian flush

Tony Wang4y

Alcohol, health, and the ruthless logic of the Asian flush

Yeah I wish I didn't have it. I would like to be able to drink socially.

Replying toAlcohol, health, and the ruthless logic of the Asian flush

Tony Wang5y

Alcohol, health, and the ruthless logic of the Asian flush

Nice piece. My own Asian flush has definitely turned me away from drinking. I wanted to like drinking due to the culture surrounding it, but the side effects I get from alcohol (headache and asthma) make the experience quite miserable.

LESSWRONG
LW

LESSWRONG
LW

Tony Wang

Tony Wang

Tony Wang

Covert Malicious Finetuning

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Even Superhuman Go AIs Have Surprising Failure Modes

Tony Wang

Tony Wang

Tony Wang

Covert Malicious Finetuning

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Even Superhuman Go AIs Have Surprising Failure Modes

What is Covert Malicious Finetuning?

How

Overview