Research Engineering Intern at the Center for AI Safety. Helping to write the AI Safety Newsletter. Studying CS and Economics at the University of Southern California, and running an AI safety club there. Previously worked at AI Impacts and with Lionel Levine and Collin Burns on calibration for Detecting Latent Knowledge Without Supervision. 


AI Safety Newsletter

Wiki Contributions


Unfortunately I don't think academia will handle this by default. The current field of machine unlearning focuses on a narrow threat model where the goal is to eliminate the impact of individual training datapoints on the trained model. Here's the NeurIPS 2023 Machine Unlearning Challenge task:

The challenge centers on the scenario in which an age predictor is built from face image data and, after training, a certain number of images must be forgotten to protect the privacy or rights of the individuals concerned.

But if hazardous knowledge can be pinpointed to individual training datapoints, perhaps you could simply remove those points from the dataset before training. The more difficult threat model involves removing hazardous knowledge that can be synthesized from many datapoints which are individually innocuous. For example, a model might learn to conduct cyberattacks or advise in the acquisition of biological weapons after being trained on textbooks about computer science and biology. It's unclear the extent to which this kind of hazardous knowledge can be removed without harming standard capabilities, but most of the current field of machine unlearning is not even working on this more ambitious problem. 

This is a comment from Andy Zou, who led the RepE paper but doesn’t have a LW account:

“Yea I think it's fair to say probes is a technique under rep reading which is under RepE ( Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best elicit the desired neural activity patterns requires careful design of 1) the experimental task and 2) locations to read the neural activity, which contribute to the success of LAT over regular probing (section 3.1.1).

In general, we have shown some promise of monitoring high-level representations for harmful (or catastrophic) intents/behaviors. It's exciting to see follow-ups in this direction which demonstrate more fine-grained monitoring/control.”

Nice, that makes sense. I agree that RepE / LAT might not be helpful as terminology. “Unsupervised probing” is more straightforward and descriptive.

What's the relationship between this method and representation engineering? They seem quite similar, though maybe I'm missing something. You train a linear probe on a model's activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft. 

Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seeking. You'd collect the model's activations from a particular layer during those forward passes, and then construct a linear model to distinguish between positives and negatives. 

Both methods construct a linear probe to distinguish between positive and negative examples of catastrophic behavior.  One difference is that your negatives are generic instruction following examples from the Alpaca dataset, while RepE uses negatives generated by the model. There may also be differences in whether you're examining activations in every token vs. in the last token of the generation. 

When considering whether deceptive alignment would lead to catastrophe, I think it's also important to note that deceptively aligned AIs could pursue misaligned goals in sub-catastrophic ways. 

Suppose GPT-5 terminally values paperclips. It might try to topple humanity, but there's a reasonable chance it would fail. Instead, it could pursue the simpler strategies of suggesting users purchase more paperclips, or escaping the lab and lending its abilities to human-run companies that build paperclips. These strategies would offer a higher probability of a smaller payoff, even if they're likely to be detected by a human at some point. 

Which strategy would the model choose? That depends on a large number of speculative considerations, such as how difficult it is to take over the world, whether the model's goals are time-discounted or "longtermist," and whether the model places any terminal value on human flourishing. But in the space of all possible goals, it's not obvious to me that the best way to pursue most of them is pursuit of world domination. For a more thorough discussion of this argument, I'd strongly recommend the first two sections of Instrumental Convergence?

Very nice, these arguments seem reasonable. I'd like to make a related point about how we might address deceptive alignment which makes me substantially more optimistic about the problem. (I've been meaning to write a full post on this, but this was a good impetus to make the case concisely.)

Conceptual interpretability in the vein of Collin Burns, Alex Turner, and Representation Engineering seems surprisingly close to allowing us to understand a model's internal beliefs and detect deceptive alignment. Collin Burns's work was very exciting to at least some people because it provided an unsupervised method for detecting a model's beliefs. Collin's explanation of his theory of impact is really helpful here. Broadly, because it allows us to understand a model's beliefs without using any labels provided by humans, it should be able to detect deception in superhuman models where humans cannot provide accurate feedback. 

Over the last year, there's been a lot of research that meaningfully extends Collin's work. I think this could be used to detect deceptively aligned models if they arise within the next few years, and I'd be really excited about more people working on it. Let me highlight just a few contributions:

  • Scott Emmons and Fabien Roger showed that the most important part of Collin's method was contrast pairs. The original paper focused on "logical consistency properties of truth" such as P(X) + P(!X) = 1. While this is an interesting idea, its performance is hardly better than a much simpler strategy relegated to Appendix G.3: taking the average difference between a model's activations at a particular layer for many contrast pairs of the form X and !X. Collin shows this direction empirically coincides with truthfulness.  
  • Alex Turner and his SERI MATS stream took seriously the idea that contrast pairs could reveal directions in a model's latent space which correspond to concepts. They calculated a "cheese vector" in a maze-solving agent as the difference in the model's activations between when the cheese was present and when it was missing. By adding and subtracting this vector to future forward passes of the model, its behavior could be controlled in surprising ways. GPT-2 can also be subtly controlled with this technique
  • Representation Engineering starts with this idea and runs a large number of empirical experiments. It finds a concept of truthfulness that dramatically improves performance on TruthfulQA (36% to 54%), as well as concepts of power-seeking, honesty, and morality that can control the behavior of language agents. 

These are all unsupervised methods for detecting model beliefs. They've empirically improved performance on many real world tasks today, and it seems possible that they'd soon be able to detect deceptive alignment. 

(Without providing a full explanation, these two related papers (1, 2) are also interesting.)

Future work on this topic could attempt to disambiguate between directions in latent space corresponding to "human simulators" versus "direct translation" of the model's beliefs. It could also examine whether these directions are robust to optimization pressure. For example, if you train a model to beat a lie detector test based on these methods, will the lie detector still work after many rounds of optimization? I'd also be excited about straightforward empirical extensions of these unsupervised techniques applied to standard ML benchmarks, as there are many ways to vary the methods and it's unclear which variants would be the most effective. 

#5 is appears to be evidence for the hypothesis that, because pretrained foundation models understand human values before they become goal-directed, they’re more likely to optimize for human values and less likely to be deceptively aligned.

Conceptual argument for the hypothesis here:

Kevin Esvelt explicitly calls for not releasing future model weights. 

Would sharing future model weights give everyone an amoral biotech-expert tutor? Yes. 

Therefore, let’s not.

Nuclear Threat Initiative has a wonderfully detailed report on AI biorisk, in which they more or less recommend that AI models which pose biorisks should not be open sourced:

Access controls for AI models. A promising approach for many types of models is the use of APIs that allow users to provide inputs and receive outputs without access to the underlying model. Maintaining control of a model ensures that built-in technical safeguards are not removed and provides opportunities for ensuring user legitimacy and detecting any potentially malicious or accidental misuse by users.

More from the NTI report:

A few experts believe that LLMs could already or soon will be able to generate ideas for simple variants of existing pathogens that could be more harmful than those that occur naturally, drawing on published research and other sources. Some experts also believe that LLMs will soon be able to access more specialized, open-source AI biodesign tools and successfully use them to generate a wide range of potential biological designs. In this way, the biosecurity implications of LLMs are linked with the capabilities of AI biodesign tools.

Load More