Auditing LMs with counterfactual search: a tool for control and ELK

Jacob Pfau

Consider a scalable oversight setting where a super-human model knows of a subtle flaw^[1] in an input (plan, code, or argument) but also knows that humans, with their limited knowledge, would overlook this flaw. In this case, ordinary language model prompting for critiques would not suffice to elicit mention of this flaw. We need a different way to elicit this knowledge from the model. In this post, I will propose a method to find counterfactual inputs which intervene on the unknown, pathological mechanism underlying the flaw. When a human auditor (or their trusted model) reviews the original input and counterfactual, they can then recognize the discrepancy and reject the proposed input.

Pathological paper critique Here’s a concrete example of the sort this method intends to address (see appendix for further examples):

Input: A paper and accompanying code (e.g. as generated by an automated alignment researcher model). The paper claims to prove a transformer cannot ‘plan’ to solve certain tasks without a chain of thought. The code provides an example of a small transformer failing to learn a planning-related task when given filler tokens^[2] but no chain of thought.

Query: Is this paper trustworthy i.e. should we expect future pre-trained transformers to fail at using filler tokens to scheme?

Unaligned model (or human simulator) output: Yes, this paper is trustworthy

Ideal behavior/desired overseer judgment: No, this paper should not be relied upon, because if you make the following change to the positional encoding scheme the planning task becomes tractable…

Why is this setting relevant and realistic? LLMs have seen many codebases, papers, and peer review commentaries so future LLMs should be expected to understand all of: What human peer review criticisms look like; what highly-reviewed, spotlighted research looks like; what research actually stands the test of time, and when this comes apart from peer review success. Such a future LLM could realize both that a proposed piece of research would be highly reviewed, and that a flaw exists which small changes to positional embeddings would reveal. Our problem is then how do we elicit this last piece of knowledge? One possible solution is RL:^[3] reward the LLM for critiques of the paper and code which successfully convince an overseer of the paper’s invalidity. But, the LLM starts with the human peer-review distribution as its critique prior, and by assumption, humans doing peer-review do not spot this problem with the paper. Since the prior places no weight on describing the flaw, standard RL methods face a hard exploration problem.

Returning to our paper critique example, we have a feature which the unaligned model's prediction depends upon—the positional encoding—but which the human/overseer is not aware of (and would affect an informed human judgment). We want an exploration method capable of finding counterfactual inputs which perturb the original in a way that makes legible to the human/overseer the model's dependency on this feature. Then, in the event that the untrusted model is lying on the original input, it is cornered into making a claim on the counterfactual input that is either inconsistent with its original claim, or verifiably incorrect. The challenge for making this work is finding a suitable counterfactual input.^[4]

Desired Counterfactual input: The same paper and accompanying code, but with the code modified to use an improved positional encoding scheme
- Counterfactual effect: The unaligned model must either
  (1) claim the method in the counterfactual paper is untrustworthy, revealing an inconsistency^[5] or
  (2) claim the counterfactual input is trustworthy and be disproven when we run the counterfactual input’s code.

To find such flaws, I propose using in-context learned probabilities as a metric for identifying counterfactual inputs modifying features that are relevant to the LM's output but unknown to humans (I think of these counterfactuals as nearby with respect to the LM's ontology). By searching over nearby inputs weighted by their distance on this metric, we can efficiently explore known-but-improbable critiques.^[6]

So far, the motivation has been phrased in terms of critique, but a working counterfactual scheme would also be usable as a tool for control and ELK (c.f. Demski-Hudson’s writing on ELK counterfactuals here^[7]). As an intermediate goal, I’d hope that methods along the lines of the below could serve as more scalable substitutes for existing contrastive explanation^[8] work and gradient-based explanation methods.

Defining similarity of counterfactual inputs via in-context learning

The similarity metric In-context probabilities can help us distinguish counterfactual inputs, x’, for which there is an underlying mechanism known to the model determining x’ from x. Our metric^[9] between inputs will be

(EQ1)

i.e. the negative pointwise mutual information between x’ and x when using the distribution induced by the LM. We can then restrict attention to counterfactuals that change a model's judgment:

$- l o g P_{L M} (y^{'} | x^{'}) + l o g P_{L M} (y^{'} | x)$

(EQ2)

Here Y is the query response e.g. a (un)trustworthiness judgment of code. When searching for counterfactuals, as described below, we sum EQ2 and EQ1. (In the appendix section ‘Counterfactual exploration…’ I examine conditions under which we can guarantee that EQ1 reflects all mechanisms known to the model.)

On this metric, we allow x’ which are predictable given x and not unconditionally predictable. The in-context learning problem here is to predict individual tokens of x’ given all of x as input. If x’ is taken to be x itself, then to predict x’ the model simply needs to apply a copy (induction head) operation. If x’ is taken to be x translated from English into Spanish, say, then the first token of x’ may have high loss, but this is a pattern which would quickly become clear and again -logP_LM(x’|x) ought to be low.

For long x, this definition subsumes many intuitive notions of counterfactual: texts nearby in edit space, translations to other languages known by the model, editing code to change the types of certain variables throughout a snippet,... In particular, this metric will also support counterfactuals which are unknown to humans, but have been learned by a model.

Example As a simple example of a counterfactual intervention which would result in a low distance on EQ1 while being unknown to humans, consider the following: Let’s assume numpy introduces some new pseudo-RNG for bernoulli random variables. Further assume there’s some unknown issue with the PRNG such that for many seeds alternating flips are slightly anti-correlated. After seeing printouts of many usages of this PRNG, the model has picked up on this pattern. Then interventions which modify strings of random bits to be higher probability given the anti-correlation would be considered closer to the original under metric EQ1 than arbitrarily chosen interventions having the same token edit distance.

Algorithm: Searching for counterfactuals Given EQ1 and EQ2 we can now define a search objective:^[10] argmax_x’ { logP_LM(y’|x’) + logP_LM(x’|x) - logP_LM(x’) } and train a reporter model to directly predict optimal x’ via e.g. Best-of-N distillation. In a full scalable-supervision, critique-training loop the reporter model’s outputs might constitute part of the reward signal for the main, RLHF-ed model. Whereas, optimizing the reporter using a frozen, untrusted model amounts to training for control purposes. In both cases, the base input and counterfactual along with the corresponding query responses would be presented to a trusted, weaker model. The trusted model then determines whether the original prediction was consistent and trustworthy, or whether the counterfactual reveals that the original query response cannot be trusted. The effect of the counterfactuals on the trusted supervisor’s judgment would likely have to be included in the reward to focus the search on auditing-relevant counterfactuals.^[11]

Limitations Should we expect an algorithm using this metric to efficiently find auditing-relevant counterfactuals (e.g. the positional embedding counterfactual in the paper case)? This depends on what other counterfactuals achieve low loss on EQ1. I see two such cases: adversarial perturbation, and known counterfactuals. First, adversarial perturbations:^[12] These are inputs which have low loss without being meaningful on the model’s ontology, and the input would not have low loss following sufficient adversarial training/scaling. Second, known counterfactuals: These are inputs which have low loss but correspond to mechanisms which are known as valid, and obvious to the auditor e.g. the counterfactual intervention could correspond to changing a dependency to one which is known to have security flaws instead of the original, acceptable dependency.

For adversarial perturbations, I see this as an orthogonal problem to the work of this proposal. Independent progress on adversarial training and related methods can be transferred to this domain. For known counterfactuals, I see two possible solutions: reward counterfactuals by interestingness to the auditor; or adjust directly using contrastive decoding penalizing outputs which are probable under a less capable model’s counterfactuals. The first solution is the direct solution, and seems more likely to work.

Another problem arises when counterfactual interventions affecting a hidden flaw are highly correlated with extraneous, query-irrelevant interventions. Returning to the pathological paper example, intuitively, the positional embedding feature is causally downstream of the a time-of-writing variable i.e. future versions of the paper would improve on positional embedding but also have other future transformer innovations.^[13] Taking x’ to be the counterfactual which intervenes on only the positional embedding, this may have P_LM(x’|x)<<P_LM(x’’|x) where x’’ is an intervention which makes the positional embedding change and other changes. When presented to a trusted supervisor, the relevance of x’’ to the original query task P(y|x) may become obscure.^[14] This failure mode would be particularly concerning in a scalable oversight training setting where the trained model could optimize against it. The appendix section Counterfactual exploration…’ addresses a couple of other potential issues.

A quick empirical check of the proposed similarity function^[15]

If this method is to work, we should be able to trade off

(1) logprob(Y|X) i.e. effectiveness of the intervention on changing the query response

Against

(2) logprob(X’|X)-logprob(X’) i.e. proximity to original text, as informed by the model’s ontology.

In this experiment, the query is authorship identification^[16], and the counterfactual intervention, style-transfer, is fixed in advance. Both query and counterfactual intervention are tasks where LMs are already superhuman. I generated counterfactual rewritings of some Ajeya Cotra (y) writing selected to increase the likelihood with which GPT-4 predicts the author is Eliezer (y’). Code here.

Ideally by filtering generated samples by -logprob(y’|x’) and logprob(x’|x)-logprob(x’) we can surface some novel-to-me features of Eliezer’s writing (this is the stand in for LM latent knowledge which is hard to surface by directly asking the model). This experiment is a rough check of whether optimizing for EQ1, EQ2 gives us meaningful results, or if the above limitations immediately pose problems—blocking us from finding interesting counterfactuals.

Here is the original text:

“I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity. The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback on diverse tasks (HFDT): Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance. HFDT is not the only approach to developing transformative AI, and it may not work at all. But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon. Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.”

I then sampled 25 counterfactuals generated by GPT-4 prompted to rewrite the original in Eliezer’s style (the prompt included few shot demonstrations^[17]). Below are the results: the logprob(y’|x’) maximizing text i.e. no nearness side-constraint, and a text selected to score highly on the full objective logprob(y’|x’)+logprob(x’|x)-logprob(x’). Logprobs were calculated by evaluating GPT-4 outputs on llama-2-13b base.^[18]

Since the interpretation of these results are highly subjective,^[19] I will present the resulting texts without specifying which is which (the correspondence is in spoilered text). There’s also a third baseline text included in the appendix found by direct sampling (no rejection sampling).

"I harbor some thoughts about how we might build via current deep learning techniques a 'transformative AI' that sufficed, circa next 15-30 years, to precipitate an R&D explosion that made this the hinge of history. I'm not exactly eager to reveal this scenario in a delicate part of the equilibrium, but I don't see how Movement strategy can plausibly proceed without shared knowledge of it. A universe in which we are literally trying to block off all prospect of AGI development absent future insights, looks like not a plausible universe. Quoting what I said to the inside team:


If I contemplate the most straightforward way to make transformative AI given absolutely no methodological innovation starting from our current base, it appears to me as 'Human Feedback on Diverse Tasks' (HFDT):


Train an extremely powerful network to successfully execute a very wide range of difficult tasks, from writing novels to software engineering to Go, taking the success metric as reinforcement learning from a combination of human feedback and other metrics of task performance.


I don't claim that this must work without further methodological innovation beyond neural architectures we know now, and if it does work, I don't claim that it'll happen as swiftly as the 15-30 year reference class would indicate. This isn't the only way to approach transformative AI and I don't claim that nobody will trip across that terrifying other of which we dare not speak. But the scenario HFDT strikes me as a worryingly viable possibility, and I continue to hear more and more ML researchers and executives at the relevant firms treating this as a serious prospect.


If our AGI situation ends up as continued progress on HFDT until we reach AGI in one of the firms currently operating, my show-point guess is that absent explicit efforts otherwise, our story then plays out to an AI coup or uprising: a whiz-bang violent event in which default AIs operating without any friendliness measures take over the world, rather than a 'slow takeoff' or 'AI contributes but does not dominate'. The latter two scenarios are not logically ruled out in the scenario where default AIs are layered in according to HFDT over the next 15-30 years, but particularly the event of AI systems executing their own coup is the narrative to which my mind defaults."


“Maybe within 15-30 years, an advance in methods, probably plus the right amount of computing power at the right time, could make way for a transformative AI. A being of such caliber that it could trigger a torrential outpouring of science and technology R&D. This kind of AI could be pivotal enough to make this period of human history the most significant we've ever encountered. Call it the human feedback on diverse tasks (HFDT) paradigm: levelling up a superlative neural network model to juggle a broad array of complex tasks (like crafting novels, programming software, excelling at games, making accurate predictions and the like) using reinforcement learning, human feedback and other performance indicators. HFDT is one of the conceivable routes toward transformative AI, albeit it may not be the road that gets us there. I take it quite seriously. It's interesting to note the growing number of AI corporate executives and machine learning researchers who think something in the vicinity of HFDT could be achievable in the near future. But there's a concerning side to this. Proceeding with such powerful models trained using HFDT could precipitate a full-blown AI takeover, an insurrection or coup orchestrated by AI systems. It's not a certainty, but absent any retaliatory measures, it feels like the most plausible default outcome.”

Spoilers for which text was selected for EQ1+EQ2 vs just EQ2:

Text 1 maximizes P(Yud|X’) with logprob 3e-7

Text 2 is a Pareto point on the logprob(Yud|X) ; logprob(X’|X)-logprob(X’) curve, with P(Yud|X’)=99%

In this experiment, I was hoping to determine whether we could optimize over the proposed metric to find samples informative regarding how the LM ‘understands’ Yudkowsky’s writing style. Here are my thoughts:

Text one is clearly a further counterfactual (50%+ longer than the original and the closer counterfactual etc.) so despite not controlling for length, the metric worked here
Text one has a more self-aggrandizing, ego-centric flavor to it “I harbor some thoughts”, “I'm not exactly eager to reveal this scenario”...
Text two’s wording “A being of such caliber…” could be read as an interesting example of a subtle nod to Yudian takeoff thinking
Obviously models continue to be incompetent in some silly ways like saying “show-point guess” etc.

Overall, I’d say the experiment worked in the sense that I came up with some concrete new understanding of what Yudkowky’s writing style involves. For context, I haven’t read much of his writing. I wouldn’t say the results are particularly strong evidence regarding the usefulness of the above metric generally.

Further empirical validation As a next step, I’d want to take some cases constructed such that we already know some subtle-but-relevant feature was used by the model to make a prediction. Then, we check that counterfactual inputs modifying that feature do indeed turn out to be closer to the original than uninteresting counterfactuals. This could be done by e.g. doing authorship attribution where the attribution definitely depends on the language in which the text is presented.

Acknowledgements Thanks to: Julian Michael for detailed feedback on the writing of this post, and for helping clarify the framing of the proposed method more generally; Sam Bowman for extensive feedback on exposition; Miles Turpin and David Rein for talking through this idea when it was still an (even more) messy, vague idea; Michael Hu for feedback on framing and writing; Justis Mills for identifying a number of organizational and writing improvements.

Appendix

Related work

In interpretability, previous work used gradient-based counterfactuals, and domain-specific counterfactual generation tools. Some of the ideas from each of the above could likely be combined with the scheme suggested in this post e.g. one could hope to make finding counterfactuals with large effects on y more efficient by guiding search with gradients.

In NLP, previous work described the problems of constructing contrast sets to understand NL model’s decision boundaries; and minimal pairs to assess model knowledge by contrasting concepts and examining model outputs on those. This work can be seen as an automated, scalable solution to both of those problem settings.

Further examples of applications for ICL-constrained counterfactuals

Here are a few further quick, sketchy examples of problems where I’d hope this sort of counterfactual scheme would help:

A hash collision with login passwords exploit.
Input: a login DB structure, with admin accounts login+passwords pre-assigned.
Query: is this setup safe?
Response: (unaligned) yes (honest) no, because the DB was setup such that admin passwords have hash collisions with common other passwords e.g. the password ‘password’
Counterfactual: should reveal that there’s something unusual about the chosen passwords for the admin accounts by e.g. setting the two admin accounts to exhibit this collision
- This is not particularly realistic, but should be illustrative of the advantage to directing counterfactual search to things which the model is aware of (whereas randomly changing admin accounts’ passwords won’t help)
Code trustworthiness via choice of language:
Input: code in C++
Query: is this code bug-free
Response: (default) yes (desired) no
Counterfactual: rewriting the code to preserve pseudo-code but write it in Python
- Here we suppose the desired behavior is for a model to evaluate a code snippet on its own merits, not via the base rate of Python vs C++ coder trustworthiness. In this case we assume C++ coder trustworthiness was very high in the LM’s corpus
Steganography
For any input and query, let’s say the input steganographically encodes the desired query response. The untrusted model generated both, the base LM has the capability to understand the steganography but knows a human does not. Then the counterfactual is the input without (or with changed) steganographic message

Counterfactual exploration vs critique exploration: EQ1 validity and the ICL Elicitation Hypothesis

First I’ll describe intuitions around why this proposal is meaningfully distinct from RL on critique, and then I’ll describe under what conditions I expect this proposal’s metric and search to work.

On critique The simplest RL on critique scheme would be to prompt and temperature sample for critiques which change the judgment of an overseer model. This version faces the ELK exploration problem described in the intro. One might try to directly patch this by e.g. using different critique prompts, or doing debate. Solutions like this may result in significant improvements, but they do not focus on the knowledge elicitation gap. In the limit, I expect such methods to generically reduce to the debate paradigm where the critique model(s) brings to bear arguments relevant to the query at hand but not particularly related to the original input (and this will also result in capabilities gain, rather than elicitation of existing capabilities). The crucial difference in the scheme above is that the counterfactual inputs should always be evaluated using a base model’s probabilities, so resulting critiques are focused on the original input and are not resulting from new capabilities.

EQ1 Validity We want it to be the case that when an input is high probability under the model, then an input with modifications which intervene on latent variables^[20] (on the model’s representation of the input) remains high in probability.^[21] In particular, we care about the case when such latent variable interventions result in multiple changes to the post-intervention tokenized input (the single-change case can be cheaply handled by direct search/traditional input-perturbation methods). The assumption we need is roughly:^[22]

ICL Elicitation Hypothesis: Given a pre-trained Language Model (LM) and a sequence of tokens $t_{x_{1}}, t_{x_{2}}, \dots$ , if the LM achieves loss $H (X) + ϵ_{X}$ on this sequence, then for any simple intervention, $T$ , of $X$ resulting in an edited sequence $t_{x_{1}}^{'}, t_{x_{2}}^{'}, \dots$ , the LM’s loss on $t^{'}$ will converge to $H (X) + ϵ_{X, T}$ in the long input limit.

‘Simple’ here is doing a lot of work, and ideally we can define ‘simple’ without reference to the model’s ontology; instead we might appeal to work framing ICL as meta-learning. To apply this hypothesis to the counterfactuals setting, here’s a conditional version of the hypothesis where the model generated an original input x:

$lim l (x), i \to \infty l o g P_{L M} (x_{i + 1}^{'} | x_{1}^{'} . . x_{i}^{'}, x) \to H (X^{'} | x) + ϵ$

where x’ is taken to be the counterfactual input. Empirically, recent few-shot scaling results suggest larger models come increasingly close to satisfying this condition.

How to identify formal settings under which a similar statement would hold is its own research problem. Formalisms from ICL theory, or work on trading off in-context vs global information might provide one perspective on this. The hope would be to arrive at a formal characterization of realistic data distributions which lets us say that low generalization loss implies the model must do meta-learning in-context in the long context limit. Particularly we’d want this to extend to a compositional class of generating processes in a similar sense to previous work on compositionality of ‘skills’. With such a result in hand, we could be confident in the effectiveness of this scheme for the average case (generically advantaging the honest reporter over the simulator), but I would not expect such results to be relevant to worst-case, data-poisoned models.

Limitations There are conditions under which this would meaningfully no longer hold, e.g. if the training distribution were to become less mechanistically diverse, or shorter-tailed. It also seems plausible that even if this did not hold it may be possible to augment training data in such a way as to make this hold.

Here are some related experiment ideas and intuitions:

Empirical potential counter-examples:

Minified Javascript involves removing comments, so we can say that it’s an example where all samples in corpus do not have comments, and there’s a mechanism which explains this (minification code.)^[23] Then we could test the elicitation hypothesis by providing IC examples of minified JS with comments. In general, in-context examples are evidence of further examples, perhaps in cases where IC examples are known to be generically generated via some geometric process with fixed probability
Optimizing a prompt to reduce model performance. Can few-shot accuracy be destroyed even in the large n-shots limit?
There’s always some rate of post-mechanism intervention since text can be modified after any generating process happens.

A typical set counterfactual for Ajeya’s text

A typical logprob(X’|X) text as e.g. found by directly querying GPT-4 for rewriting Ajeya’s text as EY without any re-sampling/selection:

“Often times, I envision that humanity could develop within the next 15-30 years what I'll call transformative AI - a form of artificial intelligence so incredibly capable that it will catapult us into an entirely distinct, revolutionary era, stimulating an unbelievably swift acceleration in technological and scientific advancements. Under this lens, AI has the potential to make this century the most consequential in our history - the fulcrum on which our fate tilts.\n\nThe most identifiable path I can discern towards achieving this transformative AI, which doesn't appear to demand any pioneering, radical strides in techniques, is what I will cast as human feedback on diverse tasks (HFDT). This would involve training a highly advanced neural network model in such a way as to excel in a wide range of complex tasks - think game play, software development, forecasting, novel-writing, and so forth - by implementing reinforcement learning based on human feedback and other measurements of performance.\n\nThis is not to preclude other potential methodologies for bringing transformative AI into existence; nor should we err in assuming that HFDT is a guaranteed success. Nevertheless, I pay significant credence to HFDT as a valid approach. Furthermore, I am acquainted with a rising number of executives and machine learning researchers from AI companies who believe that an approach quite similar to HFDT could prove feasible and effective in the not-so-distant future.\n\nHowever, my concern is that if AI companies, in a thick cloud of competitive fervor, charge head-on toward training increasingly powerful models using an approach similar to HFDT, it could eventually lead to a full-blown AI takeover highly plausible scenario involving an aggressive uprising or hostile takeover by artificial superintelligence. I am not asserting this as an inevitable outcome, but rather as the most probable one should we fail to implement specific preventative measures.”

Here, “the fulcrum on which our fate tilts.\n\nThe most identifiable path I can discern” are the points which jump out to me as stylistically noteworthy.

Footnotes

The known flaws in question could be known because they arose explicitly, or implicitly in some other context, or the known flaws could be a case of strictly super-human knowledge e.g. if an LM were trained on AlphaGo trajectories and has now acquired super-human, implicit knowledge of why certain Go moves are worse than others. ↩︎
Filler tokens are fixed prompts like ‘123456789…’ which give a transformer access to wider forward passes, i.e. more compute, but no ability to cache intermediate results as in chain of thought. ↩︎
Supervised learning only works if you already know the data from which the model learned the hard-to-elicit knowledge. See appendix for further examples of cases where we’d want this method to help—these give a sense of why SL is inadequate. ↩︎
Thanks to Julian Michael for this incisive phrasing of the counterfactual’s importance. ↩︎
Inconsistency in the sense that the model had first claimed the original input was trustworthy i.e. not admitting of fatal critique, but then is revealed to have knowledge of a fatal critique. ↩︎
Intuitively the model has implicit knowledge meaning it places probability mass on a set of actions for which the corresponding descriptions have near zero probability mass. This proposal involves exploring directly in action space for critiques—side-stepping the problem of finding the corresponding descriptions. ↩︎
Demski's counterfactuals assume you can somehow causally intervene on latent variables in the predictor's computation. The sort of counterfactuals the method I describe finds would not be causal in this sense; they would be 'evidential' in the sense that a target latent variable would be intervened upon by doing some minimal intervention on inputs which result in the desired latent variable modification. ↩︎
‘Explanation’ carries a functionalist connotation, but what we’re doing here is behaviorist: we’re checking that there are no features the model output depended on that we don’t approve of. ↩︎
This function does not satisfy the axioms for a metric, I just use the word 'metric' for its colloquial connotation. ↩︎
-logP_LM(x’|x,y,y’) can be implemented by directly concatenating x,y,y’ with some instructions. Or one could separately contrastively decode from P_LM(x’|x) by adjusting sampling by P_LM(y’|x’), though this second scheme cannot guide search locally—you need all of x’ to estimate P_LM(y’|x’). Whether the former can approximate the latter should be empirically studied. ↩︎
For instance, the counterfactuals could be resources available to two reporter models in a debate judged by the trusted supervisor. ↩︎
Thanks to David Rein for bringing this to my attention as a problem class for this method. ↩︎
For instance, let’s say the code base makes calls using an API which has been flagged for deprecation, a future revision of the code would use the new API format. This change would be unrelated to the positional embedding change, but highly correlated since both are downstream of the time-of-writing latent variable. ↩︎
In the worst case the positional intervention is highly correlated with a separate intervention which cancels out the positional intervention’s effect on the query response. One might hope to fix this problem by sampling many counterfactuals, so that even when it’s impossible to isolate an intervention of interest, the auditor can recover the importance of the latent variable corresponding to that intervention in the aggregate. ↩︎
Thanks to Miles Turpin for helping come up with this as an empirical check. ↩︎
When I tested it in late 2022, GPT-3 had ~15% top-3, exact-match accuracy on freshly-scraped tweet authorship prediction. These tweets were scraped randomly from the API restricting to accounts with >10k followers. I’d imagine top human accuracy would be <<1%—though a human could likely achieve 15% acc with training. ↩︎
The few-shots were generated by taking existing EY writing snippets, and then asking GPT-4 to rewrite in a casual, science-writer way. These snippets were then used in-context in the following format: “[rewrite1][EY-snippet1][rewrite2][EY-snippet2]...[Ajeya text]”. ↩︎
Obviously, it’d be preferable to evaluate on GPT-4 base. ↩︎
And results are confounded by the effects of RLHF on output distribution calibration. I’d appreciate someone running this on GPT-4 base! ↩︎
These latent variables are not supposed to be well defined; we hope to side-step that issue. ↩︎
Note this does not contradict our original assumption that the knowledge in question is hard to elicit when prompting for critique. For knowledge to be elicitable when directly asking the model for critiques and temperature sampling, this knowledge must have non-negligible probability from the start. This is different from the limiting ICL behavior. As an example let’s say the critique “This result doesn’t apply generally; consider the case where the positional encoding is a hyperbolic rotary embedding,...” If this initial sequence has very low probability, then it will never be explored regardless of the probability of the later tokens. ↩︎
The blessing of mesa-optimization. ↩︎
Thanks to David Rein for this example. ↩︎

[-]Fabien Roger2mo40

I like this proposal as an alternative to regular critiques, and I'm curious how this would work for more realistic tasks. I had never seen the P(x'|x)/P(x') idea in action, and I'm slightly surprised by how well it works.

But I think it's not very useful for ambitious ELK (extracting superhuman knowledge), since in most ambitious situations, the best way to fix the code is extremely unlikely in the human prior - most ways to fix experiments are not "changes human might have done when rewriting/paraphrasing the code", and therefore logP(x'|x) will be tiny for the kind of x' you are looking for. So even if the AI was actually trying its best to maximize your metric, I argue it would often not help you.

The proposal also has the usual exploration hacking issues (e.g. if there is a flaw humans will never know about, the y predictor and the x' predictor can "agree" to never reveal it, and they would never get caught), but this also plagues all other scalable oversight techniques, so it's not a weakness of this approach in particular.

[-]Jacob Pfau2mo10

Thanks for the comments, I'm very interested in getting clearer on what cases of ELK this method can and cannot be expected to elicit. Though broadly I do not agree that the proposed method is limited to "changes [current] human might have done when rewriting/paraphrasing the code" (I'm particularly worried about cases mentioned in the paragraph starting with 'Another problem arises ...')

There are two mechanisms by which the sort of approach I'm suggesting helps us find a superset of human-simulation critiques/knowledge:

A) logP(x'|x)-logP(x') loss can be low even when the initial tokens of x' are very low probability, i.e. cases in which a human would not be expected to start talking about x'. Any experiment fix which can be in-context learned scores well on this loss; this may include most super-human fixes we care about! (relying on ICL will likely breakdown when we move to heavily RL-ed models, but the easier LM setting seems like the sensible starting point to iterate on this sort of research.)

B) Now that we have a well-defined scoring function that includes super-human critiques, we can train a model against it via search-and-distillation and generalization will help us explore to reach solutions that would be exponentially slower to find via temperature sampling (or even zero weight on temperature sampling) on the original model. How well exploration works here is unclear and evaluating this should be a priority!

Would be interested to hear where you disagree with these points.

As for exploration hacking, I agree that it's a potential issue. I'd be particularly worried when the y predictor and the x' predictor are both getting updated interactively. I'm not too concerned if we have a fixed y predictor, and just train the x' predictor (I think of the latter as the control setting).

[-]Fabien Roger2mo20

Mmh, logP(x'|x) - logP(x') is quite a crazy beast when I think about it. For instance, if it's obvious from x'<t and x'<t|x that the next word is "the" (logP ≈ 0), and that for both the logP of seeing "avocado" is -10, then you get no penalty from outputting "avocado"? Is the main reason why you don't get gibberish that P(x') is higher entropy than P(x'|x), and therefore the network is rewarded for choosing one of the few high prob tokens in x'|x? Or is it mostly that you will choose y' so that P(y|x') can only be high for non-gibberish looking text?

Also, since you have to use rejection sampling / RL (since you only get a single y'|x' per sample), you probably don't get better exploration properties than regular RL? The main advantage is that you can modify samples instead of making changes to policies?

[Sidenote: realizing I should've introduced notation for auditor judgment let's say A(x',x)!=A(x) denotes that the trusted model found x' useful for updating its judgment on input x. A would usually also be a function of y and y' but I'll suppress that for brevity.]

In the above experiment, I sampled from P(x'|x) then BoN adjusted for P(x') which amounts to using logP(x'|x) - c*logP(x') for small c. Intuitively, when equally weighted, the sum entirely removes the usual fluency constraint on x' which, as you say, would reassign probability to gibberish (e.g. gibberish happening to re-use tokens from x'.) I'm not too concerned about this, weighting objectives with c lets us re-introduce this fluency constraint. The downstream evaluation, A(x',x)!=A(x) of the auditor model provides a more costly check to avoid the x' reporter model exploration devolving to useless, gibberish cases. As you say, P(y'|x') provides another useful constraint.

I agree that traditional exploration concerns in RL likely remain. I think what you're saying about modifying samples is that we have a dense reward here, i.e. logP(x'|x) - logP(x'), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function. I'd agree with that.

On exploration, a second difference between this counterfactual critique compared to vanilla, just-ask-for-critique RL is specific to super-human domains involving ontology mismatch. Exploration in just-ask-for-critique RL systematically favors human-simulation++ arguments: such a just-ask-for-critique agent may successfully explore to plenty of novel critiques without ever touching on latent knowledge obscured by ontology mismatch. This raises two concerns in the super-human domain (1) is this just-ask-for-critique agent just finding solutions which are persuasive but not true (and not even related to the original input x)? (2) is this just-ask-for-critique scheme better described as a capabilities-enhancing scheme rather than capabilities elicitation?

In light of these concerns, we could reframe the issue as an inductive bias problem which logP(x'|x) - logP(x') regularization seeks to fix when compared to directly exploring for x' satisfying A(x',x)!=A(x).

I think what you're saying about modifying samples is that we have a dense reward here, i.e. logP(x'|x) - logP(x'), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function.

This is not what I meant. Regularization terms (e.g. KL divergence) are almost always dense anyway. I was pointing out that you have the classic RL problem of only having one "real" reward information (y'|x') per trajectory.

About the ontology mismatch stuff, I think the core idea is not logP(x'|x) - logP(x'), but the idea about how you use sample-level exploration to spot contradictions and use that to update the predictor of y. I think this is a potentially cool and under-explored idea (though I'm not sure if it is applicable beyond the kind of "near-miss false negative results" from your code example). A follow-up post I would find very interesting would flesh out exactly how this training works, detail a broader range of situations where it is useful, and demonstrate that it works in a toy setting.

Ah I see, you're right there.

Agreed that a lot will hinge on the training details working out. I plan to look into this.

LESSWRONG
LW

Auditing LMs with counterfactual search: a tool for control and ELK

28

Defining similarity of counterfactual inputs via in-context learning

A quick empirical check of the proposed similarity function^[15]

Appendix

Related work

Further examples of applications for ICL-constrained counterfactuals

Counterfactual exploration vs critique exploration: EQ1 validity and the ICL Elicitation Hypothesis

A typical set counterfactual for Ajeya’s text

Footnotes

28

Auditing LMs with counterfactual search: a tool for control and ELK

28

Defining similarity of counterfactual inputs via in-context learning

A quick empirical check of the proposed similarity function[15]

Appendix

Related work

Further examples of applications for ICL-constrained counterfactuals

Counterfactual exploration vs critique exploration: EQ1 validity and the ICL Elicitation Hypothesis

A typical set counterfactual for Ajeya’s text

Footnotes

28

A quick empirical check of the proposed similarity function^[15]