Here are some capabilities that I expect to be pretty hard to discover using an RLHF’d chat LLM[1]:
- Eric Drexler tried to use the GPT-4 base model as a writing assistant, and it [...] knew who he was from what he was writing. He tried to simulate a conversation to have the AI help him with some writing he was working on, and the AI simulacrum repeatedly insisted it was by Drexler.
- A somewhat well-known Haskell programmer - let’s call her Alice - wrote two draft paragraphs of a blog post she wanted to write, began prompting the base model with it, and after about two iterations it generated a link to her draft blog post repo with her name.
More generally, this is a cluster of capabilities that could be described as language models inferring a surprising amount about the data-generation process that produced its prompt, such as the identity, personality, intentions, or history of a user[2].
The reason I expect most capability evals people currently run on language models to miss out on most abilities like these is primarily that they’re most naturally observed when dealing with much more open-ended contexts. For instance, continuing text as the user, predicting an assistant free to do things that could superficially look like hallucinations[3], and so on. Most evaluation mechanisms people use today involve testing the ability of fine-tuned[4] models to perform a broad array number of specified tasks in some specified contexts, with or without some scaffolding - a setting that doesn’t lend itself very well toward the kind of contexts I describe above.
A pretty reasonable question to ask at this point is why it matters at all whether we can detect these capabilities. A position one could have here is that there are capabilities much more salient to various takeover scenarios that are more useful to try and detect, such as the ability to phish people, hack into secure accounts, or fine-tune other models. From that perspective, evals trying to identify capabilities like these are just far less important. Another pretty reasonable position is that these particular instances of capabilities just don’t seem very impressive, and are basically what you would expect out of language models.
My response to the first would be that I think it’s important to ask what we’re actually trying to achieve with our model eval mechanisms. Broadly, I think there are two different (and very often overlapping) things we would want our capability evals[5] to be doing:
- Understanding whether or not a specific model is possessed of some dangerous capabilities, or prone to acting in a malicious way in some context.
- Giving us information to better forecast the capabilities of future models. In other words, constructing good scaling laws for our capability evals.
I’m much more excited about the latter kind of capability evals, and most of my case here is directed at that. Specifically, I think that if you want to forecast what future models will be good at, then by default you’re operating in a regime where you have to account for a bunch of different emergent capabilities that don’t necessarily look identical to what you’ve already seen.
Even if you really only care about a specific narrow band of capabilities that you expect to be very likely convergent to takeover scenarios - an expectation I don’t really buy as something you can very safely assume because of the uncertainty and plurality of takeover scenarios - there is still more than one way in which you can accomplish some subtasks, some of which may only show up in more powerful models.
As a concrete example, consider the task of phishing someone on the internet. One straightforward way to achieve this would be to figure out how to construct sophisticated fake identities on the internet, such as doing research into targeted individuals, creating and deploying websites on domains that look like trusted websites, and so on. I think current evals do a good job of detecting attack vectors like this one.
Another way in which it seems like you could achieve this task however, is to refer to a targeted individual’s digital footprint, and make inferences of potentially sensitive information - the handle of a private alt, for example - and use that to exploit trust vectors. I think current evals could do a good job of detecting and forecasting attack vectors like this one, after having identified them at all. Identifying them is where I expect current evals could be doing much better.
More precisely, borrowing from Studying The Alien Mind (which I strongly recommend): there’s a trade-off between bandwidth of observational information and targeted, rigorous results in controlled experimental settings. From the field of animal psychology, my go-to example (also originating from Nick) is Jane Goodall, who pioneered a certain kind of empirical approach to understanding animal behavior. She spent years living with and documenting the behavior of chimpanzees in the wild, focusing on collecting as many observations as possible in the animal’s natural habitat.
This is not to say that all - or even most - insights into studying animal psychology came from research that tended more toward this style of research. Rather, the idea is that the Jane Goodall approach has higher potential to reveal unexpected insights[6]. I think it’s likely that more traditional experimental research would have eventually uncovered the same insights, but how quickly you get there does matter, especially with increasingly powerful models. I think this is basically how it’s played out so far with some unexpected capabilities.
On the trade-off between bandwidth and targeted settings, I think we understand sufficiently little about language model capabilities that it makes much more sense to gain a firehose of bits of what models are capable of, to better identify feasible threat vectors.
This is also in part my answer to the other position, that these abilities simply aren’t very impressive. Insofar as we care about identifying and forecasting potential threat vectors, things in the general cluster of “abilities models will be pretty superhuman at before transformative AI” seems pretty relevant, and what seems obvious post-hoc often isn’t directly correlated with specific predictions that are obvious. Certainly many of the people I’ve spoken to who I expect to have spent some time thinking about model capabilities were surprised by some of the examples of truesight within current models. Inferring properties of the authors of some text isn’t itself something I consider wildly useful for takeover, but I think of it as belonging to this more general cluster of capabilities.
In the framing used in the recent Science of Evals post, which delineates what a mature state of the field of evals would look like, the arguments made by this post could be described as “a large amount of the useful work in discovering what we care about seems to be in explorative work”. I don’t think this is in contradiction with the overall point made in the post, which reads to me like pushing for the field to reach a state where we have a robust science on methodology that captures everything we care about, and having better frameworks for analyzing evals methods. I might disagree with more specific claims about whether the field is in that state, however.
To be clear, my position isn’t simply that people should be doing capability evals on base models instead (though I think more of this would be very valuable, given that RLHF very often masks certain capabilities). For instance, I think many of the insights shared here and the generator upstream of them, are very useful, and that people should be doing more of that kind of exploration.
Rather, I think that in a regime where there are a lot of unknown unknowns - such as the general cluster described above - trying to search in the shadows and get a lot of information through more open-ended exploration is very useful. I wanted to include some slightly more concrete ways in which current evals fail, but I think Janus does a much better job of writing about them - which the comments of this post may contain! Eventually.
I expect this to be a significantly harder problem to tackle - we're effectively trying to interface with objects closer in complexity to human minds, history, ecosystems, the internet, or reality as a whole than to systems like cars where you can hope to measure all the relevant variables with simple diagnostics, especially before entire fields are invented or adapted to study these ontologically unprecedented and confusing entities[7] - but that trying to tackle it will be useful - and probably very interesting.
- ^
These are all real accounts, and are presented here how they were written to me by someone more familiar with the people in the quotes.
- ^
This is often referred to as “truesight”, e.g. here.
- ^
Some context on what I mean by this: often when fine-tuning a model, one thing you might want to do is fine-tune to prevent a model from hallucinating. This often has the resultant effect that you select for models that are very reluctant to offer bold inferences from data in the context window - for instance, GPT-4 often refuses to give answers to questions it considers too speculative, even when it does “know” the answer.
- ^
Either general fine-tuning as in the case of making a chat model, or task-specific fine-tuning.
- ^
There are definitely other kinds of evals that we’d be interesting in running - and that some are - such as alignment evals, where the questions and distinctions look pretty different, such the robustness of alignment methods on current models giving us information on how we should think about the robustness of alignment methods on future models and generalizable properties of neural networks.
- ^
A related idea from this recent post is that quite often, advancements that end up being useful are highly serendipitous. Serendipity can be optimized to an extent however, by putting yourself in the kind of situation where you expect to stumble across more “lucky” findings.
- ^
Wordings of this part of the sentence are from Janus.
I don't know if the records of these two incidents are recoverable. I'll ask the people who might have them. That said, this level of "truesight" ability is easy to reproduce.
Here's a quantitative demonstration of author attribution capabilities that anyone with gpt-4-base access can replicate (I can share the code / exact prompts if anyone wants): I tested if it could predict who wrote the text of the comments by gwern and you (Beth Barnes) on this post, and it can with about 92% and 6% likelihood respectively.
Prompted with only the text of gwern's comment on this post substituted into the template
gpt-4-base assigns the following logprobs to the next token:
' Beth' is not in the top 5 logprobs but I measured it for a baseline.
'gw' here completes ~all the time as "gwern" and ' G' as "Gwern", adding up to a total of ~92% confidence, but for simplicity in the subsequent analysis I only count the ' gw' token as an attribution to gwern.
Substituting your comment into the same template, gpt-4-base predicts:
I expect that if gwern were to interact with this model, he would likely get called out by name as soon as the author is "measured", like in the anecdotes - at the very least if he says anything about LLMs.
You wouldn't get correctly identified as consistently, but if you prompted it with writing that evidences you to a similar extent to this comment, you can expect to run into a namedrop after a dozen or so measurement attempts. If you used an interface like Loom this should happen rather quickly.
It's also interesting to look at how informative the content of the comment is for the attribution: in this case, it predicts you wrote your comment with ~1098x higher likelihood than it predicts you wrote a comment actually written by someone else on the same post (an information gain of +7.0008 nats). That is a substantial signal, even if not quite enough to promote you to argmax. (OTOH info gain for ' gw' from going from Beth comment -> gwern comment is +3.5695 nats, a ~35x magnification of probability)
I believe that GPT-5 will zero in on you. Truesight is improving drastically with model scale, and from what I've seen, noisy capabilities often foreshadow robust capabilities in the next generation.
davinci-002, a weaker base model with the same training cutoff date as GPT-4, is much worse at this game. Using the same prompts, its logprobs for gwern's comment are:
and for your comment:
The info gains here for ' Beth' from Beth's comment against gwern's comment as a baseline is only +1.3823 nats, and the other way around +1.4341 nats.
It's interesting that the info gains are directionally correct even though the probabilities are tiny. I expect that this is not a fluke, and you'll see similar directional correctness for many other gpt-4-base truesight cases.
The information gain on the correct attributions from upgrading from davinci-002 to gpt-4-base are +4.1901 nats (~66x magnification) and +6.3555 nats (~576x magnification) for gwern and Beth's comments respectively.
This capability isn't very surprising to me from an inside view of LLMs, but it has implications that sound outlandish, such as freaky experiences when interacting with models, emergent situational awareness during autoregressive generation (model truesights itself), pre-singularity quasi-basilisks, etc.