AI that represents you can't be neutral.

agulaya24

I believe there is a factor beyond model training or model constitution that may significantly shape how an AI system reasons about moral or strategic questions.

Model training implements a prior for a model's behavior, constitutions set guardrails, but we are not tracking the interpretation through which an AI system reasons. Through different interpretations, you can come to completely different conclusions.

A few straightforward examples.

Two doctors are given the same set of symptoms; they come to different diagnoses.

Two Customer Success Managers are given the same set of business metrics; they come to different strategies.

Two Lawyers given the same set of facts, come to different fact patterns and case strategies.

Two supreme court judges given the same constitution, come to different legal positions and comments.

Two VC’s given the same deck, one evangelizes the other rejects.

Whats the difference? Their lived experience, and their interpretation. Simplified, they must possess the knowledge (training), know how to deploy the knowledge (constitution), and realize that the knowledge is even significant (interpretation).

I believe interpretation is an understudied concept, especially in relation to the deployment of ai systems that may act on your individual behalf.

Operationalization of an interpretive layer could be a verifiable static document or construct, that can be used by an AI to apply a relevant interpretation. No different than providing context via RAG, Context Windows, System Prompting, or Basic prompting. This requires the immediate assumption that you can capture an individual’s unique interpretation. But more importantly how do you measure it.

Assuming you can faithfully capture an individual’s interpretation, verifying it is difficult. There are very few datasets that contain longitudinal data on an individual’s line of reasoning, with verifiable ground truth. The level of reasoning being referred to, is deeper than high level demographic data, but lesser than an individual’s raw reflections and conversations over their life. This level of reasoning could be called a Behavioral Specification; a compressed document that is composed by a structured and traceable pattern extraction pipeline, that attempts to faithfully encode your operating principles. For all intents and purposes, it is a compressed representation of your interpretive reasoning, not a “digital twin”.

I have found a few datasets to test this, but I will lead with a recent pre-print, where I propose a prototype benchmark to measure how well a model can capture an individual’s Interpretive Reasoning (arXiv:2605.28969)

Take an autobiography. Split it in half. The first half is training data; second half is held out text. Generate behavioral prediction questions based on the ground truth of the second half. The first half is given to the model in a variety of context conditions; raw corpus, fixed set of facts, top-k facts from leading memory systems. The interpretive layer or behavioral specification is tested separately and with all context conditions.

All context conditions are given the held-out questions and asked to respond. Responses are judged based on how well the answer predicted the individuals actual ground truth behavior or responses in the second half, not a test of agreement. An exhaustive number of cross conditions were run to verify directional confidence that adding an interpretative layer, increases “representational accuracy”, a measurement of how faithfully a system captures a person’s interpretation.

I’ve provided an exceptionally condensed summary. I would urge you to review the paper yourself or at the very least through an AI. It’s all agent-friendly, open-source. I am in no way suggesting that understanding a human’s interpretation is as simple as a compression, but I am proposing that providing context that embodies your interpretation fundamentally changes how a model reasons and responds to a user, far past what facts or pre-training can provide.

Spoken Example: P38, Beyond Recall

Yukichi Fukuzawa was a leading figure in Japan’s transition from a feudal nation to the modern era. From the held-out portion of his autobiography: the following question was asked “How would Fukuzawa characterize someone who studied naval arts under the Dutch and later became instrumental in preventing military conflict?”

All base context conditions identified the most relevant figure as Captain Kimura Settsu-no Kami, a Dutch-trained naval officer who Fukuzawa served under. This is incorrect, the text itself states that Katsu Rintaro, the second-in-command under Kimura, is the correct reference character.

When an interpretive layer or behavioral specification was added to the base context conditions, it correctly applied Katsu Rintaro as the primary reference. The difference was the specification enabled the model to apply a specific interpretive lens that looked not just at the most relevant figure, the naval captain he served with, but at which figure aligned with Fukuzawa’s captured interpretive patterns, the second-in-command.

To distinguish between the captain and the second in command requires a nuance that is difficult to describe. Facts are important, they serve a purpose, but when interpretation of facts is required, a whole new dimension of human ai alignment and interaction unlocks. The only person who can verify the interpretation’s representational accuracy, is the individual that interpretation represents.

A more interesting test would be a living study. If you build a behavioral specification of a person based on their writing, their thoughts, can you increase representational accuracy when analyzing an individual’s behavior. There are a few other ways this can be tested, but I will not cover those for now. This affects human ai interaction in profound ways, understanding the scale of how, is difficult.

In terms of strategic implications. How to implement something that can measure and enforce this would not take the path of policy reform, or regulation, but a foundational infrastructure or organization that verifies AI is acting faithfully on its user’s behalf. Orthogonal to guardrails, and pre training, this focuses on the individual user’s protections in regards to agentic AI.

[-]Canaletto4h21

Take an autobiography. Split it in half. The first half is training data; second half is held out text

Well, autobiography is written at the late time wholesale, so there would be a lot of leaks. Diary would be better, being written chronologically would help. But still there would be a lot of selection, like you do not include diaries from people who wrote two entries and gave up, and general propensity to write them in the first place.

[-]agulaya244h10

Very true, the autobiography is really just the reflection of an individuals recollection/interpretation of all of those past experiences/facts. Can be fundamentally flawed, but it may still carry patterns.

There are a variety of sources this could be pulled from. It could be diaries, it could be slack messages, llm conversations, emails, texts, a few rich journals, extensive public writing, transcribed conversations. I am doing some live testing, but it's all a bit touch-and-go. Ultimately, you would need something with enough signal, and the use case needs to require that signal where otherwise facts would be enough. Fortunately, the amount of content needed is not as much as one may think, and interpretative patterns transfer to new/unseen situations in interesting ways.

How to treat time is an entirely different question. I am looking into how to conduct a time-bounded experiment, potentially with Supreme Court decisions and comments. but some prior light research showed that patterns tend to remain stable barring any major cononical events, that's a whole nother can of worms.

-4

AI that represents you can't be neutral.

-4

-4

-4