Dhruv Nathawani

Message

Backdoors have universal representations across large language models

by Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Amirali Abdullah This work was done by Narmeen Oozeer as a research fellow at Martian, under an AI safety grant supervised by PIs Amirali Abdullah and Dhruv Nathawani. Special thanks to Sasha Hydrie, Chaithanya Bandi and Shriyash Upadhyay at Martian for suggesting researching...

Dec 6, 2024•18

Message

14 karma

Member for a year

Dhruv Nathawani — LessWrong

Dhruv Nathawani

Message

Dhruv Nathawani

Backdoors have universal representations across large language models

Dec 6, 2024•18

Message

14 karma

Member for a year

Backdoors have universal representations across large language models

Amirali Abdullah

Amirali Abdullah, Narmeen, Dhruv Nathawani, nirmalendu prakash+ 0 more

Amirali Abdullah, Narmeen, Dhruv Nathawani, nirmalendu prakash

by Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Amirali Abdullah

This work was done by Narmeen Oozeer as a research fellow at Martian, under an AI safety grant supervised by PIs Amirali Abdullah and Dhruv Nathawani. Special thanks to Sasha Hydrie, Chaithanya Bandi and Shriyash Upadhyay at Martian for suggesting researching generalized backdoor mitigations as well as extensive logistical support and helpful discussions.

TLDR:

We show that representations across models of different sizes are weakly isomorphic when trained on similar data, and that we can "transfer" activations between them using autoencoders.
We propose a technique to transfer safe behavior from one model to another through the use of steering vectors.
Our representation transfer technique paves the way for transferring insights across

... (read 4785 more words →)

LESSWRONG
LW

LESSWRONG
LW

Dhruv Nathawani

Dhruv Nathawani

Dhruv Nathawani

Backdoors have universal representations across large language models

Dhruv Nathawani

Dhruv Nathawani

Dhruv Nathawani

Backdoors have universal representations across large language models