# 16

Thanks to Flo Dorner for feedback on the technical content in this post.

I have recently been accepted to the SERI ML Alignment Theory Scholars Program program, where applicants are paired with mentors. The application process was nonstandard: each mentor has a set of open-ended questions on safety research you need to answer in short essay form.
I am posting my application below. Reasons:

• It might help future applicants to similar programs overcome writing anxiety;
• I want to share the ideas from my answers.

I decided against editing my application post-hoc, to preserve a real "example application" for reference. There were some minor writing corrections. [1] Note:  the answers were written in a single pass in a short amount of time.

## Alex Gray: Language model risks

Pick a risk that might be present in systems that use very large language models in the future. (...) What risk did you pick? Describe it in your own terms.

It is likely that in future language models, interpretability will be one of the main methods to (try to) enforce safety. One simple interpretability method is making the multi-layer model X "articulate its thoughts" at each layer, by returning a natural language description of the representation at each layer. Think logit lens, but with longer outputs.  In what follows, we call the resulting outputs "thought descriptions".

To keep the model safe, one could do human inspection of the thought descriptions, or more scalably, automated detection of "harmful" thought descriptions through a model Y. We could train Y on a large dataset of harmful texts and AI apocalypse scenarios.

Why would this even work, given that now we need another large model, which is trained on very harmful data? Because, the harmful thought detector Y does not need to be as capable as the model X to detect harmful thoughts well. The model Y wins when it detects any harm probability, while X has to work perfectly each time.

We do not need to use any of the above to train X; the described method would best serve as a detection method for safety risks of models trained in other ways.

There are two risks with this method:

1.  The model X learns to fool the "thought articulation method", making its internal representation always be translated into inherently non-harmful thought descriptions.
2. The first risk is solved in some way, and we can rely on thought descriptions being faithful. Still, the model X produces harmful thought descriptions which are not detected as harmful by Y.

In this proposal, we pick the second risk described above, and connect it to current research problems in machine learning.

Explain why the risk could be highly impactful in the future, or with very capable systems.

If the described interpretability method is used in future large models, the risk could render it worse than useless by making us think that X is not plotting any harm. Bad security measures could be worse than no security measures, because researchers would be more likely to train larger models if they think they are covered by a good safety measure.

There is also a possibility of the opposite effect: we may inherently need interpretability methods for building large safe models. If this risk is widely recognized as important and unsolved, the research community could give up on this kind of interpretability methods in general. This scenario could increase the probability of an unaligned model dooming humanity by some degree. [2]

Explain how you could induce or demonstrate that failure in systems available today (e.g. GPT-2, etc).

We want to know if model X could learn to produce harmful thought descriptions which are classified as non-harmful by Y. This is similar to the problem of producing an "adversarial example" for Y. The simplest proxy is actually making the *output* of a language model X be such an adversarial example for Y.

Creating adversarial examples is easy with algorithms explicitly designed to produce them. However, X is not optimized for producing adversarial examples. Even the scenario of X being aware of Y in the loop in any way should ideally be detected by Y long before X could produce harmful thought descriptions.

There is another related problem: paraphrasing. The model X could at some point diversify its "writing style" of thought descriptions, for reasons unrelated to the presence of Y. This could be an issue for Y, given that it would now need to detect various paraphrases of harmful thought descriptions.

Models available today are very capable of paraphrasing. It is considered a good sign by the machine learning community if the model does not memorize the exact sentences in its training data. However, in multiple cases it is just the model rewriting the sentences it saw and reinterpreting them in new settings. Note that this is an "accidental" property: large language models are not being intentionally optimized for this in any way!

Explain how you could build a benchmark, metric, or measure that would be useful to tracking progress on solving this risk.

The first thing would be to investigate capabilities of language models to paraphrase prompts. It could be done either via an Instruct-GPT setup ("Reword the prompt sentence in many different ways''), or by finetuning a language model on pairs of distinct pieces of text with the same semantic content.

Another cool idea would be to set up a "paraphrased memorization" metric, where you measure how much of the training text was memorized only semantically and not in the exact form it appears in the training data. Now, this is a problem important in the machine learning community for another reason: data privacy. The overlap between the machine learning research problems and what the AI safety community cares about is far smaller than it will have to be for strong models to be safely deployed in the future. The proposed metric could be an opportunity for getting more machine learning researchers to work on safety-adjacent topics.

Given an initial sentence, let a model produce a batch of paraphrased sentences which should all ultimately mean the same thing. (We enforce this using a predefined list of all sentences of some size meaning the same thing.) Then a metric is some sort of diversity of the produced batch, according to either a string-based algorithm or another state-of-the-art language model.

Explain how you could spend $10,000 generating data that would be useful today for research on this risk. Explain why that data would be useful. Current models are not able to say with certainty whether two pieces of text have the same meaning. [3]Moreover, we are working under the assumption that the harmful thought detector Y will not be capable of this in the future. Thus, we need to generate a strong complete paraphrased text collection, containing sentences very close (in some string metric) to all semantic equivalents of some sentence, for lots of sentences. For the dataset, we would need pieces of text and lots of different ways to rephrase them. At 15$ per hour, we could only get full semantic equivalency classes for a few thousand sentences; but this could be significantly accelerated by a current language model "proposing" paraphrases and the person accepting or rejecting them.

There is no issue if the dataset is incomplete, because again, a human could manually inspect any new paraphrases that a strong language model produces.

Explain an idea for a technique that could be used on today's language models (e.g. GPT-2, etc) which would improve performance on your proposed benchmark / metric / measure.  Explain why you think it would help.  (Keep it brief, but it should be realistically implementable)?

One could finetune a language model with pairs of sentences, and penalize the model when it produces something semantically different, or too similar to the input in some string metric. This would likely force the model to produce diverse outputs from very similar internal representations. The finetuned model would serve as an upper bound for how much other models of similar size are unsafe with regards to the second risk.

At last, we need to point out that improving the paraphrasing or memorization benchmark does not actually solve the risk, it only makes it more clear whether the second risk might be important.

Do we have an idea to actually solve the second risk?

We could optimize for the output of a language model to be "consistent" in some way; if the model does not produce a diverse enough set of examples (and its thought descriptions are very predictable given the context), then the second risk is less relevant. However, we do not propose an output consistency measure here; it might be difficult to get the right notion here

## Leo Gao: Observations about ELK

In the post "Observations about ELK", Leo talks about how you have to have a prior with a finite support for worst case ELK. What would the implications of having such a prior be?

We assume a finite set of plausible models reporting truthful information from the camera given a set of actions, which we call the *prior*. Without losing of generality (because we are only talking finite vs non-finite), we order the models by some measure of complexity and take all models less complex than some threshold X. This is exactly equivalent to "finite support".

Then we ask many questions, get many answers, and we hope that after some time there is only one consistent model and it is the direct translator.

Two implications with possibly bad outcomes:

1.  We assume the direct translator has complexity less than X. This might not be the case; the world model might be extremely complex and a human simulator might need to be as complex or even more complex (if we use some automated method for asking questions) than a human brain to answer questions correctly. Then we get an empty posterior and we can't train the model. [6]
2.  We assume our complexity measure always ranks the direct translator less complex than any human simulators which migh be consistent with questions. But our complexity measure might be flawed.

This can be very bad if combined with (1): after asking a lot of questions, it could happen that the only remaining model in the posterior is a human simulator which accidentally has complexity less than X. We could ask more questions to be sure, but if it survives enough questions, eventually we have to proclaim it the direct translator; hence we failed at ELK.

Explain why, in the Eliciting Latent Knowledge (ELK) post, penalizing models that are similar to human simulators in some sense (i.e. similar answers on unseen data, similar weights/subnetworks, etc.) doesn’t work.

I will refute the black box penalization and the "similar weights" penalization separately.

For the black box penalization, the model could still model the other human simulators we have trained. If all basic human simulators do similar things on unseen data, then it's easy for it to model the decisions and be different in some way. On the other hand, if basic human simulators are diverse in their outputs, then black-box penalization may not work. [4]

As for the "similar weights" and similar penalizations, it is likely you expect an answer along the lines of the above. However, note that:

1) Neural networks are overparameterized; there are zillions of possible weight combinations. Depending on the initialization, when you run gradient descent all the way, you end up in a random local minimum which overfits the training set. See also the lottery ticket literature.

2) The problem of "find whether these two neural networks are the same function" is likely NP-hard. [5]

Answering "are the two neural networks the same function on this dataset", without running them both on most samples in the dataset, might also be impossible.
Hence I doubt such a penalization can work in any problem, this is not specific to ELK. I'd like to hear if you think I am wrong here.

## Stuart Armstrong: How to train independent classifiers

Assume you have access to two labelled image sets of equal size,  and  (which you could consider as showing pictures of huskies and lions, but that isn't important). You also have access to a large unlabelled image dataset, . Let , etc... be classifiers that distinguish  from , giving each image a score between 0 () and 1 (). They need not be perfect, but they are all better than chance.

Problem 1 (~500 words)
Given classifiers , how would you define a classifier  that correctly classifies  and , but is "independent" of the other ; i.e. uses different features to do its classification. You can do this in some intrinsic way, or relative to the set . We're interested in how you choose to formalize the concept of independence, of what "using different features" means (note that you don't have access to a list of features, so if you need them explicitly, you'll need to explain how you get them).

(We are interested in what you come up with; there is not necessarily a canonical solution to this problem).

Problem 2 (~500 words)
Same as for problem 1, but now imagine you've got  as black box classifiers, and you want to actually construct  using a neural net. What loss function can you use to capture your previous definition of "independence"? Can you make it differential? Can it be trained by gradient descent given batches of images from  and ? If  were not black boxes, would this make any difference?

Unfortunately, I wrote this in , and properly parsing it into the LW editor is a bit too much for me. See my original pdf with the answers.

Comment about the similarity measure part: now I think it makes a lot more sense to go with some sort of KL divergence metric instead. [7]

1. ^

In particular, I have crossed out things that I no longer agree with, removed incoherent sentences and redacted paragraphs, and put footnotes where I could have been clearer. No new text has been added.

2. ^

This is a stretch, and an example of how writing everything at the deadline can accidentally hurt credibility. The described risk will definitely not be the main issue which makes people give up on interpretability.

3. ^

See the MPRC and STS benchmarks. The newer models are actually getting quite good; I am really curious on how GPT-3 would perform if finetuned for the purpose.

4. ^

5. ^

While I was writing the application, I was almost sure that this was an established fact, but it seems this is actually still unknown.

6. ^