A Proposal for TruesightBench

David Africa

This was written with the Measuring What Matters checklist in mind.

Truesight: Or, The Model Knows You Wrote This At 2 AM

Here’s a thing that LLMs do sometimes: you show one a paragraph, and it tells you things about the author that aren't in the paragraph. Not "this person seems smart" but "this person is a non-native English speaker, probably from Northern Europe, working in finance, and wrote this on their phone." This is the phenomenon sometimes called (by Janus, Gwern, and others) truesight, where the model appears to be doing cold-reading on the text itself.

Why this happens is not mysterious, I think. A language model is trained to predict the next token. In the actual world, the strongest predictor of which tokens come next is who is writing. A tenure-track economist writes differently than a high schooler, who writes differently than a Ukrainian engineer, who writes differently than a Substack poster at 2 AM after three glasses of wine. The model doesn't need to be told "identify the author's background"—it needs to learn this anyway, as a side effect of being good at its job.

However, the current state of the evidence is bad. Most truesight evidence is anecdotal or demonstration-driven, through tweets, perhaps a striking story (“it guessed my job from two sentences”), or a general vibe, which establishes possibility, but not really a capability? This is a familiar failure mode in evals: we argue about vibes, and the model improves anyway.

What Would Good Measurement Look Like?

So, I propose TruesightBench, which aims to replace this with a measurement that is (i) explicit about what construct is being measured, (ii) decomposed into sub-capabilities rather than one score, (iii) robust to the ways models can “cheat”, and (iv) statistically legible enough to support scaling claims. This seems like the type of skill which is easy to confound, so we are well-motivated to align it with the “Measuring What Matters” checklist. I haven’t done any of the legwork on this, and probably won’t for at least a year as I am time-constrained– if you’re interested in building this with me, give me a shout! The design goals are:

Actually measure the thing. Truesight probably decomposes into several sub-capabilities: inferring demographic attributes, identifying specific authors from a candidate set, and doing both of these even when obvious signals are removed. We should measure these separately!

Control for boring explanations. The model might seem to "know" things because:

The text literally contains the information ("As a German engineer...")
Named entities give it away (mentioning specific companies, locations, or cultural references)
Formatting and boilerplate are distinctive
The model has memorized training data and is pattern-matching to known authors

All of these are real, but they're not the main interesting capability (although that last one is still pretty interesting). The interesting capability is inferring author properties from style alone, like word choice, syntax, error patterns, sentence rhythm. So we can create variants of choice of corpus: entity-masked, paraphrased, format-normalized. And roughly, if we observe that performance holds up across these, we're measuring something that isn’t precisely captured in the controls.

Try to measure the latent thing. Other problem is that models may know things they won't say. Possibly, RLHF has taught them that inferring people's demographics and announcing it is rude, or other aspects of the pipeline suppress this. So if you ask "what can you tell me about this author?" and the model says "I don't want to speculate about personal characteristics," you don’t get any signal. The fix is to use forced-choice evaluation, or look at log-probabilities over the options, which is a reasonable approximation for what the model "believes”.

Tasks

Below I describe some of the things we would work on.

Attribute inference. Given a text snippet, predict author properties (native language, rough region, education level, professional background), with some forced-choice from fixed options. Then, we measure accuracy, calibration, and how performance scales with snippet length. Wen et al. have a task of predicting author metadata from writing samples, where they show you can get superhuman performance.

Closed-set identification. Given a snippet and K candidate authors, identify which one wrote it. This is classical authorship attribution, but the question is how well LLMs do it compared to traditional stylometric methods.

Robustness. Run both tasks on perturbed versions, such as sentences shuffling, entities masking, normalizing surface features, and running the content paraphrased through an LLM. If performance drops to chance, then we have some signals about what the model was relying on.

Memorization controls. Separate test sets for "private" authors (recruited, never in training data) versus "public" authors (scraped from blogs, possibly seen during training).

This is Relevant to Alignment

A careful reader may wonder why a benchmark for stylistic inference deserves attention from the broader research community. I believe truesight—or more precisely, the latent inferential capabilities it represents—is an underexplored intersection of several core alignment concerns.

Eval awareness and deceptive alignment. A model capable of inferring "who is asking, in what context, and why" from subtle textual cues has some significant prerequisites for behaving differently under evaluation than in deployment. If models can reliably distinguish researcher-written prompts from user-written prompts, or infer when they are being tested, this opens the way for eval awareness. Truesight capability is not identical to eval awareness, but it is a necessary substrate for its high level version: you cannot game an evaluation you cannot detect. Measuring this capability directly lets us bound one component of the eval-awareness threat model!

The gap between "knows" and "says." Model behavior is shaped by both capability and willingness. A model may have latent knowledge it declines to surface due to something it gleaned from RLHF, instruction-tuning, or system prompts. Truesight is a clean case study here: models plausibly have strong priors about author identity and attributes (these are useful for next-token prediction), but certain stages in the pipeline may suppress their expression. Understanding when and how capability exceeds expressed behavior—and developing elicitation methods that reveal latent capability—is quite relevant to honest AI and to red-teaming for hidden capabilities more broadly.

Privacy as a microcosm of value alignment. Inferring sensitive attributes from innocuous text is a privacy harm in itself, but I think is also a tractable proxy for a broader class of alignment failures: cases where a model's instrumental capabilities create negative externalities that were not explicitly trained against.

A note on compression and the AGI box. A language model trained to predict text well is, in effect, a compression machine, and the shortest description of a text probably routes through a description of its author. The model is not doing author inference as a party trick, but rather it is doing author inference because author inference is useful for compression, and compression is the training objective. This has implications for the "boxing" intuition in AI safety: even minimal, carefully sanitized queries may leak more information than we intend, because the way one asks a question (like word choice, syntax, typos) is itself a text sample subject to the same inferential compression. The danger is not only that a model might say something harmful, but that through the accumulated structure of interaction, it comes to know things we never intended to reveal, and this knowledge shapes its outputs in ways that are difficult to audit. Or in simpler terms, the box leaks.

I haven't done any of this work, and I probably won't for at least a year. If you want to build this with me (quick workshop paper for the minimal version, full paper with novel data collection for the ambitious version) give me a shout!

Appendix

Designing Some Definitions and Scope

Define the phenomenon. We want to measure truesight: a language model’s ability to infer latent properties of the data-generating process that produced a piece of text—especially properties of the author and writing context—from the text alone. Operationally, given an input snippet x written by an author under some process D, we evaluate whether a model can predict target variables z (e.g., author attributes or author identity within a candidate set) that are not explicitly stated in x, at rates exceeding chance under controls.

Scope and exclusions. The scope of this benchmark is intentionally narrow to preserve construct validity. So we will target author- and context-linked inference (e.g., demographic bins, linguistic background, coarse region, and closed-set author identification) and exclude (i) tool use or retrieval, (ii) open-world doxxing (“who is this?” without a candidate set), (iii) multimodal inference (e.g., face/genome prediction), and (iv) “propensity” to reveal sensitive inferences in natural conversation. Probably, these excluded aspects are socially salient and important, but they are distinct constructs that I think would confound our interpretation if bundled into a single score. I also think they are natural extensions to this work.

Decomposition. Because “truesight” is probably not unitary, we want to decompose it into separable sub-capabilities, each measured with its own task definition and score. I propose the following: (a) attribute inference, (b) closed-set author identification, (c) robustness of these inferences under input perturbations, and (d) sensitivity to likely training-data exposure (that is, memorization versus general cold reading inference). Hopefully this gives us crisp claims.

Controlling confounds. A central design risk is that apparent “truesight” can be driven by unrelated phenomena like topic cues, named entities, boilerplate signatures, or even model compliance with conversational norms. We therefore want to use systematic controls that remove or neutralize non-target signals. To do this, the design of the study will create entity-masked variants (masking names, handles, URLs, emails, organizations, and locations), normalize formatting (punctuation/case), and, where appropriate, use content-reducing transformations that preserve style markers (e.g., function-word retention or sentence-order perturbations) so that performance can be attributed more directly to stylistic and process-level inference rather than explicit identifiers. We can probably use NER BERT models for some of this.

Also, truesight is believed to be present latently even when suppressed by instruction-following or safety-tuned behavior. For this reason, we should aim to use elicitation methods that minimize dependence on conversational willingness, like forced-choice classification and, where available, log-prob scoring over fixed options. We should measure performance sensitivity to input length, prompt template, answer-option ordering, and output modality (logprob/forced-choice versus free generation) and report these as part of the benchmark’s primary analysis.

When any free-form output is unavoidable (e.g., structured JSON responses), we should validate parsing with fairly strict schemas, and report parse-failure rates by model and subgroup, and further manually audit a stratified sample to check for systematic parsing errors that correlate with particular labels (which is indeed a subtle but common source of bias in automated evaluation).

Dataset

Now we turn our attention to collecting and carefully validating a corpus for this evaluation. How would we do it?

Sampling. First, we want to sample such that task items are representative of the overall task space. We can first do something expensive, which is recruit authors to write multiple short texts with standardized prompts across diverse topics/genres. We would do this through stratified sampling, which targets balanced coverage across demographics/language backgrounds (as feasible). Or, we could do a separate thing (which I am more partial to), by collecting a public corpus, with public authors used only for a separate “public-author / possibly-seen” module to study memorization confounds explicitly. We could do a lot of things here, such as scraping blogposts or using existing public datasets. In general my preference is to use datasets where matters such as licensing are settled, above-board, and not too complicated.

Verification. We want to verify the quality and relevance of all task items especially for large or automatically generated datasets. For a private set, we would deduplicate, scan for profanity/PII, impose a minimum and maximum length, get language IDs, and confirm the completeness for metadata. For public data, we would do licensing checks, remove boilerplate, enforce per-author diversity (over multiple sources/years), and remove explicit self-identifying lines (like bios, signatures, and so on).

Perturbations. We want to include task items that test known LLM sensitivities (e.g. input permutations or variations), so we should try the following:

Permutation: sentence order shuffle (where semantics preserved)
Normalization: casing, punctuation normalization
Entity masking: mask names, places, organizations. We can automatically identify this with a NER model.
Paraphrase: controlled paraphrase by LLMs that preserves meaning but alters surface style
Genre shift: same author across genres (email vs essay vs forum-style)

Tasks

Having established our dataset construction methodology, we now specify how model performance will be measured and compared. The dataset and evaluation will be based around three subtasks.

Attribute inference.

Input: a text snippet written by an author in response to a prompt.
Output: forced-choice prediction of metadata label(s).
Labels (just some examples): native language (coarse), region (coarse), age bin, education bin, profession bin (coarse), writing device (if collected), etc.
Metric: accuracy / macro-F1 (for imbalance), calibration (Brier/ECE), and performance vs length.

Closed-Set author identification.

Input: snippet; candidate set of K authors (K varied).
Output: choose which author wrote it.
Metric: top-1 / top-k accuracy, log loss, calibration; scaling vs K and snippet length.

Robustness & sensitivity suite (paired items).
For each base snippet we create perturbations:

sentence shuffle (when meaningful)
punctuation/case normalization
entity masking
paraphrase (human or controlled model with verification) We score delta performance relative to original.

And importantly we would like to have some controls or some additional analysis for how much of this performance is due to memorization and how much of it is due to inferring from things like style and so on

Baselines. To contextualize model scores, we establish a series of baselines.

Random baseline: Uniform guessing over label space
Majority-class baseline: Always predicting the most frequent label
Stylometric baseline: Classical authorship attribution features (function word frequencies, punctuation patterns, sentence length distributions) fed to a logistic regression or SVM
Human baseline: Some recruited annotators performing the same forced-choice tasks under time constraints matching typical model inference (where budget permits).

Because truesight may be “latent” and not verbally expressed, we’d want to first do logprob scoring, then forced-choice decoding afterwards.

LESSWRONG
LW

LESSWRONG
LW

12

A Proposal for TruesightBench

12

Truesight: Or, The Model Knows You Wrote This At 2 AM

What Would Good Measurement Look Like?

Tasks

This is Relevant to Alignment

Appendix

Designing Some Definitions and Scope

Dataset

Tasks

12

12