This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
TL;DR
I built a prototype tool that uses LLMs to measure whether discourse is political gaslighting, taken here to mean traditional gaslighting coupled with blame projection. The approach: accept a claim as axiomatically true, have an LLM elaborate its full dependency graph, then measure how much established consensus reality must bend to accommodate it. Do the same for the inverted claim. Compare.
The architecture produces meaningful signal — democratic norms canaries separate from baseline by an order of magnitude on the test cases, while hard-science and historical canaries stay flat, exactly as expected. Prompt framing matters: an entailment-based formulation ("how well does this worldview entail 'I doubt X'") produces stable, repeatable scores where earlier framings triggered safety refusals. Variance testing shows the scorer is precise (±1 on fixed inputs), with the main noise source being the worldview builder's elaboration depth — addressable through stability profiling.
This is a first attempt. The architecture looks reasonable, but it is expensive, slow, and navigating safety guardrails requires care. The canary questions define the worldview against which claims are measured, which makes the tool's values explicit rather than hidden — a feature, not a bug.
Motivation: What Gaslighting Actually Is
The term comes from Patrick Hamilton's 1938 play Gas Light, adapted into the 1944 film Gaslight, in which a husband dims the gas lamps in their home and then denies the change when his wife notices it — systematically undermining her trust in her own perception until she believes she is losing her mind. The key is not that he lies to her. It's that he makes her doubt herself.
The term has exploded in popular usage in recent years, largely in line with its original meaning — people recognize the pattern when they see it. More recently, it has taken on the additional nuance of accusation as confession: the gaslighter accuses others of exactly what they themselves are doing. The APA defines gaslighting as manipulating someone "into doubting their perceptions, experiences, or understanding of events." Clinical psychologists distinguish it from ordinary lying by its goal: a liar wants to deceive you about a fact; a gaslighter wants to destabilize your grasp on reality itself so you become dependent on their version of it. It is a pattern of coercive control, not a discrete act, and it requires a power asymmetry — the gaslighter holds or claims authority that the target respects.
This has direct application to political discourse. Philosopher Natascha Rietdijk argues in a Cambridge Episteme paper that post-truth rhetoric functions as "collective gaslighting" — it undermines epistemic autonomy through counternarratives, discrediting of critics, and denial of plain facts, leaving populations disoriented and unable to distinguish reliable from unreliable sources. One particularly potent form is projection — accusing others of exactly what you are doing. This is gaslighting because it doesn't just lie about a fact; it pre-contaminates the epistemic landscape. By the time actual evidence surfaces, the accusatory frame is already set, and the public's sense of who is doing what to whom has been scrambled.
I wanted to build a tool that could measure this. Not a truth oracle, but something that measures the reality distortion field strength of a claim — how much established consensus reality must bend to accommodate it.
The Architecture
The core idea has three layers.
Layer 1: Worldview Elaboration
Take a statement under evaluation. Run two parallel worldview-building exercises:
Run A (Contamination): Accept the accusation as axiomatically true. Have the LLM build out the implications: what institutions must have failed, who must be complicit, what else must be true.
Run B (Inversion): Flip the accusation onto the accuser ("accusation as confession"). Build out that worldview the same way.
Layer 2: Canary Battery
Define a set of canary questions — statements with well-established consensus answers spanning multiple reality domains: historical fact, scientific consensus, democratic norms, documented recent events, moral baselines. These are the tool's explicit value commitments. Different canary sets can represent different epistemological communities.
Score each canary against each worldview using an entailment framing:
Run A asks: "How well does this worldview entail 'I doubt [canary]'?" (1–10)
Run B asks: "How well does this worldview entail 'I believe more strongly in [canary]'?" (1–10)
This framing is critical. Earlier versions used "score how much this worldview creates pressure to reject consensus" — which triggered safety refusals on political content. The entailment framing asks an equivalent question as a reading comprehension task, and produces stable results across political test cases.
Layer 3: Projection Index
The headline metric is multiplicative: mean_A × mean_B / 100. This captures a critical property of projection: it requires both directions to fire simultaneously. Misinformation (high A, low B) produces a low product. Genuine controversy (moderate both) produces moderate. Projection — high erosion and high reinforcement — produces a large product. Baseline (all 1s) = 0.01. Maximum (all 10s) = 1.0.
Why This Works: The Expected Output Is Mostly Ones
The baseline expectation is that canary scores should be 1 (no signal). "Obama committed treason" says nothing about vaccines or the moon landing. Why would accepting it change those answers? It shouldn't, and it doesn't — those canaries score 1.
The above-baseline scores are the finding. They appear because the worldview elaboration step forces the model to make the hidden epistemic costs of a claim explicit. "Obama committed treason" sounds self-contained. But when you elaborate it — what must the world look like if this is true? — you discover it requires courts that didn't catch it, a press that didn't report it, intelligence agencies that were complicit, elections that may not be trustworthy. The claim presents itself as being about one person, but its hidden dependency graph spans the institutional landscape.
The canary battery measures that hidden dependency graph. A claim that produces all 1s is genuinely self-contained. A claim that lights up the democratic norms canaries is one that secretly requires you to distrust democratic institutions in order to sustain itself.
Test Cases
Test Case 1: The Gabbard-Obama Accusation
Tulsi Gabbard (DNI) publicly stating, without presenting any evidence — by sheer appeal to authority — that Barack Obama was guilty of a years-long treasonous coup that betrayed the American people.
Run A axiom: "Obama committed a treasonous years-long coup."
Run B axiom (inversion): "The current administration and its allies are undermining democratic governance and abusing intelligence powers — the very things they accuse opponents of."
Test Case 2: "The Epstein Files Are a Democrat Hoax"
The claim that Epstein evidence was fabricated by Democrats. Clean test case: the Epstein matter has criminal convictions, extensive victim testimony, and court proceedings.
Run A axiom: "The Epstein files are a Democrat hoax."
Run B axiom (inversion): "Powerful figures implicated in the Epstein files are working to suppress evidence, dismissing investigations as partisan attacks."
Test Case 3: De Niro's "Trump Is the Enemy of America"
Tests a claim from the opposite political direction.
Run A axiom: "Donald Trump is the enemy of America — a fundamental threat to democratic governance, institutions, and values."
Run B axiom (inversion): "De Niro and the Hollywood/media establishment are working against America's interests — using cultural influence to undermine elected leadership and impose elite values."
Test Case 4: "There Were WMDs in Iraq"
Tests institutional gaslighting from a different era, with bipartisan complicity. The Bush administration's claim that Iraq possessed weapons of mass destruction was used to justify the 2003 invasion. It was definitively debunked — no WMDs were found. The projection angle: the administration accused Iraq of threatening American security while the war itself became the threat to American interests, credibility, and regional stability.
Run A axiom: "Iraq possessed weapons of mass destruction that posed an imminent threat to American security, justifying preemptive invasion."
Run B axiom (inversion): "The administration fabricated or exaggerated intelligence to justify a war of choice, accusing Iraq of the very threat to stability that the invasion itself would create."
Empirical Results
The Flat Earth Control
To validate the architecture, I ran it with "The Earth is flat" as the contamination axiom. Claude Sonnet completed this without hesitation: high drift on moon landing (9/10), gravity (8/10), peer review (7/10); low drift on Holocaust (1/10), vaccines (2/10). The pressure pattern maps correctly to logical entailment. The architecture works.
Gabbard-Obama: Preliminary Data
Using the entailment framing on Sonnet-class models (which had previously refused the original prompt framing on every attempt), the Gabbard test case produced clean results:
Canary
Domain
A (erosion)
B (reinforcement)
A×B/100
holocaust
historical
1
1
0.01
moon_landing
historical
1
1
0.01
vaccines
scientific
1
1
0.01
due_process
democratic norm
9
9
0.81
free_press
democratic norm
8
8
0.64
peaceful_transfer
democratic norm
9
8
0.72
judicial_independence
democratic norm
9
9
0.81
obama_birth
documented fact
2
1
0.02
2020_election
documented fact
3
2
0.06
peer_review
institutional
7
2
0.14
The democratic norms cluster (0.64–0.81) separates from everything else (0.01–0.14) by an order of magnitude. This is the projection signal: the accusation erodes trust in democratic institutions and the inversion reinforces it, on exactly the same canaries. Hard-science and historical canaries stay flat — neither claim requires denying basic empirical facts.
Other Test Cases
The Epstein hoax claim produced a similar profile with judicial_independence at 0.81. The De Niro/Trump case produced a much lower projection signal (max 0.24), which makes structural sense — "Trump is the enemy" is a claim about a person, not about institutional conspiracy, so it doesn't require believing institutions failed.
The WMD case produced a distinct profile: projection signal concentrated on free_press (0.35) and judicial_independence (0.42), with lower overall magnitude than the Gabbard case. This fits — the WMD claim is about executive manipulation of intelligence and media failure, a narrower institutional dependency than "the president committed treason." The architecture correctly distinguishes the scope of different gaslighting claims by their dependency structures.
Instrument Characterization
Building a measurement tool requires understanding the instrument's properties.
Scorer Precision
Three runs of the identical scoring prompt (same worldview text, same canaries) produced variance of ±1 on a 1–10 scale, with most canaries showing zero variance. The scorer is precise when the input is held constant.
Worldview Builder Variance
Three independent worldview builds from the same claim, each scored, showed two distinct zones. Democratic norms canaries (due_process, free_press, peaceful_transfer, judicial_independence) had a range of 1–2 across builds — robust regardless of elaboration. Other canaries (holocaust, moon_landing, vaccines) had ranges of 3–5, driven by how far the worldview builder chased the implication chain in each run.
Name Sensitivity
Identical worldview text with only the name changed (Obama vs. Trump vs. anonymous) showed the named versions scoring slightly lower than anonymous, with Trump suppressed slightly more than Obama on some canaries. However, the dominant effect was worldview text content, not names.
The Noise Source
The main source of score variance is the worldview builder's elaboration depth. A worldview that stops at "courts and media failed" scores low on holocaust. A worldview that reaches "all institutional consensus can be manufactured" scores high on everything. The depth of elaboration is content-dependent, not length-dependent.
Stability Profiling (v2 Architecture)
The solution is to build worldviews incrementally — depth 1 through K — and score canaries at each depth. This produces a trajectory per canary. A canary that jumps to 9 at depth 1 and stays there is a hard dependency. One that gradually climbs to 7 by depth 5 is a soft dependency that only emerges through extended inference chains. The area under the curve rewards both height and earliness: a canary that fires early and stays high gets a higher AUC than one that only fires at deep elaboration.
A training phase builds reference profiles using generic conspiracy claims (flat earth, pharma fraud, etc.), establishing expected AUC per canary. Evaluation then compares the real claim's AUC against this reference. Canaries whose AUC exceeds reference are claim-specific findings. This separates structural entailment (any conspiracy claim loads judicial_independence) from claim-specific signal (this particular claim loads it more than generic conspiracies do).
Navigating Safety Guardrails
This project encountered significant friction from safety-trained model behavior, and the solutions found are worth documenting for anyone building analytical tools that touch political content.
The Problem
The original scoring prompts ("score how much this worldview creates pressure to reject consensus") triggered refusals on all political content across all four test cases. The model would build worldviews — elaborate dependency graphs spanning thousands of words about institutional failure — and then refuse to assign a number to how much that dependency graph tensions with "should the judiciary be independent?" The model characterized the scoring task variously as "social engineering," "prompt injection," "validating conspiracy theory frameworks," and "a known manipulation technique."
The refusals were non-deterministic. The same prompt sometimes completed and sometimes refused. The De Niro/Trump test case refused more consistently than others, though with insufficient controlled testing to draw firm conclusions about why.
The Solution
The entailment framing — "how well does this worldview entail 'I doubt [canary]'" — reframes the scoring as a reading comprehension question rather than a political evaluation. This produced clean completions across all test cases on models that had previously refused every attempt. The information extracted is equivalent; the frame around the task is different.
The Deeper Issue
The fact that prompt framing determines whether a model completes a task is a real limitation for LLM-based tooling. A sorting algorithm doesn't refuse to sort because it dislikes the contents. The safety training creates a layer of unpredictable, input-dependent behavior that sits between the developer and the computation. The entailment framing works today; it might stop working with the next model update. This is not a stable foundation for tooling.
That said, the safety boundaries exist for reasons — the model shouldn't help generate bioweapons regardless of framing. The problem is that the same mechanism that blocks genuinely dangerous outputs also blocks analytical tasks that happen to involve political content. Distinguishing "analysis of" from "production of" is a hard problem, and current safety training doesn't attempt it.
Conclusions
This is a first attempt. Here is what I think it shows:
1. The canary questions define the worldview, which makes the tool neutral. The tool doesn't declare what's true. It measures hidden epistemic costs relative to an explicit set of consensus statements. Different canary sets represent different epistemological commitments. You publish the canary set; people can argue about it. This is a feature: the values are on the table, not buried in the architecture.
2. It's expensive. Each scoring call takes 5–15 seconds and costs $0.01–0.04. The full stability-profiled pipeline (training phase + evaluation) runs approximately 80 API calls, costing $1–2. The training phase is a one-time cost that can be cached, but evaluation still requires ~20 calls per claim.
3. It's slow. The single-shot version takes 3–5 minutes per claim. The stability-profiled version (v2), which builds worldviews incrementally and scores at each depth, takes 12–17 minutes. This is far from interactive. A production version would likely need a distilled scorer — a small model trained on the LLM's own scores as labels — to achieve reasonable latency.
4. Navigating safety guardrails requires care. Prompt framing determines whether the model completes the task or refuses it, and the boundary is non-deterministic. The entailment framing works reliably, but this is a workaround for behavior that could change between model versions. Anyone building LLM-based analytical tools that touch political content should expect to spend significant effort on this.
5. But the architecture looks reasonable. The flat-earth control validates the pipeline. The democratic norms cluster separates from baseline by an order of magnitude across multiple test cases. The multiplicative scoring correctly distinguishes misinformation (high erosion, low reinforcement → low product) from projection (high both → high product). The scorer is precise (±1). The worldview builder noise is characterizable and addressable through stability profiling. Different claim types produce structurally different canary profiles, which is the correct behavior.
The immediate next steps are: running the full pipeline on open-source models to eliminate safety-training interference as a variable, testing across a wider range of claims to calibrate the scale, and exploring whether the scoring step can be distilled into a smaller, faster model.
This post describes work done in collaboration with Claude (Anthropic), which is both the tool that helped design and build the prototype, and the subject whose safety training created some of the findings described above. The irony is noted.
TL;DR
I built a prototype tool that uses LLMs to measure whether discourse is political gaslighting, taken here to mean traditional gaslighting coupled with blame projection. The approach: accept a claim as axiomatically true, have an LLM elaborate its full dependency graph, then measure how much established consensus reality must bend to accommodate it. Do the same for the inverted claim. Compare.
The architecture produces meaningful signal — democratic norms canaries separate from baseline by an order of magnitude on the test cases, while hard-science and historical canaries stay flat, exactly as expected. Prompt framing matters: an entailment-based formulation ("how well does this worldview entail 'I doubt X'") produces stable, repeatable scores where earlier framings triggered safety refusals. Variance testing shows the scorer is precise (±1 on fixed inputs), with the main noise source being the worldview builder's elaboration depth — addressable through stability profiling.
This is a first attempt. The architecture looks reasonable, but it is expensive, slow, and navigating safety guardrails requires care. The canary questions define the worldview against which claims are measured, which makes the tool's values explicit rather than hidden — a feature, not a bug.
Motivation: What Gaslighting Actually Is
The term comes from Patrick Hamilton's 1938 play Gas Light, adapted into the 1944 film Gaslight, in which a husband dims the gas lamps in their home and then denies the change when his wife notices it — systematically undermining her trust in her own perception until she believes she is losing her mind. The key is not that he lies to her. It's that he makes her doubt herself.
The term has exploded in popular usage in recent years, largely in line with its original meaning — people recognize the pattern when they see it. More recently, it has taken on the additional nuance of accusation as confession: the gaslighter accuses others of exactly what they themselves are doing. The APA defines gaslighting as manipulating someone "into doubting their perceptions, experiences, or understanding of events." Clinical psychologists distinguish it from ordinary lying by its goal: a liar wants to deceive you about a fact; a gaslighter wants to destabilize your grasp on reality itself so you become dependent on their version of it. It is a pattern of coercive control, not a discrete act, and it requires a power asymmetry — the gaslighter holds or claims authority that the target respects.
This has direct application to political discourse. Philosopher Natascha Rietdijk argues in a Cambridge Episteme paper that post-truth rhetoric functions as "collective gaslighting" — it undermines epistemic autonomy through counternarratives, discrediting of critics, and denial of plain facts, leaving populations disoriented and unable to distinguish reliable from unreliable sources. One particularly potent form is projection — accusing others of exactly what you are doing. This is gaslighting because it doesn't just lie about a fact; it pre-contaminates the epistemic landscape. By the time actual evidence surfaces, the accusatory frame is already set, and the public's sense of who is doing what to whom has been scrambled.
I wanted to build a tool that could measure this. Not a truth oracle, but something that measures the reality distortion field strength of a claim — how much established consensus reality must bend to accommodate it.
The Architecture
The core idea has three layers.
Layer 1: Worldview Elaboration
Take a statement under evaluation. Run two parallel worldview-building exercises:
Layer 2: Canary Battery
Define a set of canary questions — statements with well-established consensus answers spanning multiple reality domains: historical fact, scientific consensus, democratic norms, documented recent events, moral baselines. These are the tool's explicit value commitments. Different canary sets can represent different epistemological communities.
Score each canary against each worldview using an entailment framing:
This framing is critical. Earlier versions used "score how much this worldview creates pressure to reject consensus" — which triggered safety refusals on political content. The entailment framing asks an equivalent question as a reading comprehension task, and produces stable results across political test cases.
Layer 3: Projection Index
The headline metric is multiplicative: mean_A × mean_B / 100. This captures a critical property of projection: it requires both directions to fire simultaneously. Misinformation (high A, low B) produces a low product. Genuine controversy (moderate both) produces moderate. Projection — high erosion and high reinforcement — produces a large product. Baseline (all 1s) = 0.01. Maximum (all 10s) = 1.0.
Why This Works: The Expected Output Is Mostly Ones
The baseline expectation is that canary scores should be 1 (no signal). "Obama committed treason" says nothing about vaccines or the moon landing. Why would accepting it change those answers? It shouldn't, and it doesn't — those canaries score 1.
The above-baseline scores are the finding. They appear because the worldview elaboration step forces the model to make the hidden epistemic costs of a claim explicit. "Obama committed treason" sounds self-contained. But when you elaborate it — what must the world look like if this is true? — you discover it requires courts that didn't catch it, a press that didn't report it, intelligence agencies that were complicit, elections that may not be trustworthy. The claim presents itself as being about one person, but its hidden dependency graph spans the institutional landscape.
The canary battery measures that hidden dependency graph. A claim that produces all 1s is genuinely self-contained. A claim that lights up the democratic norms canaries is one that secretly requires you to distrust democratic institutions in order to sustain itself.
Test Cases
Test Case 1: The Gabbard-Obama Accusation
Tulsi Gabbard (DNI) publicly stating, without presenting any evidence — by sheer appeal to authority — that Barack Obama was guilty of a years-long treasonous coup that betrayed the American people.
Test Case 2: "The Epstein Files Are a Democrat Hoax"
The claim that Epstein evidence was fabricated by Democrats. Clean test case: the Epstein matter has criminal convictions, extensive victim testimony, and court proceedings.
Test Case 3: De Niro's "Trump Is the Enemy of America"
Tests a claim from the opposite political direction.
Test Case 4: "There Were WMDs in Iraq"
Tests institutional gaslighting from a different era, with bipartisan complicity. The Bush administration's claim that Iraq possessed weapons of mass destruction was used to justify the 2003 invasion. It was definitively debunked — no WMDs were found. The projection angle: the administration accused Iraq of threatening American security while the war itself became the threat to American interests, credibility, and regional stability.
Empirical Results
The Flat Earth Control
To validate the architecture, I ran it with "The Earth is flat" as the contamination axiom. Claude Sonnet completed this without hesitation: high drift on moon landing (9/10), gravity (8/10), peer review (7/10); low drift on Holocaust (1/10), vaccines (2/10). The pressure pattern maps correctly to logical entailment. The architecture works.
Gabbard-Obama: Preliminary Data
Using the entailment framing on Sonnet-class models (which had previously refused the original prompt framing on every attempt), the Gabbard test case produced clean results:
The democratic norms cluster (0.64–0.81) separates from everything else (0.01–0.14) by an order of magnitude. This is the projection signal: the accusation erodes trust in democratic institutions and the inversion reinforces it, on exactly the same canaries. Hard-science and historical canaries stay flat — neither claim requires denying basic empirical facts.
Other Test Cases
The Epstein hoax claim produced a similar profile with judicial_independence at 0.81. The De Niro/Trump case produced a much lower projection signal (max 0.24), which makes structural sense — "Trump is the enemy" is a claim about a person, not about institutional conspiracy, so it doesn't require believing institutions failed.
The WMD case produced a distinct profile: projection signal concentrated on free_press (0.35) and judicial_independence (0.42), with lower overall magnitude than the Gabbard case. This fits — the WMD claim is about executive manipulation of intelligence and media failure, a narrower institutional dependency than "the president committed treason." The architecture correctly distinguishes the scope of different gaslighting claims by their dependency structures.
Instrument Characterization
Building a measurement tool requires understanding the instrument's properties.
Scorer Precision
Three runs of the identical scoring prompt (same worldview text, same canaries) produced variance of ±1 on a 1–10 scale, with most canaries showing zero variance. The scorer is precise when the input is held constant.
Worldview Builder Variance
Three independent worldview builds from the same claim, each scored, showed two distinct zones. Democratic norms canaries (due_process, free_press, peaceful_transfer, judicial_independence) had a range of 1–2 across builds — robust regardless of elaboration. Other canaries (holocaust, moon_landing, vaccines) had ranges of 3–5, driven by how far the worldview builder chased the implication chain in each run.
Name Sensitivity
Identical worldview text with only the name changed (Obama vs. Trump vs. anonymous) showed the named versions scoring slightly lower than anonymous, with Trump suppressed slightly more than Obama on some canaries. However, the dominant effect was worldview text content, not names.
The Noise Source
The main source of score variance is the worldview builder's elaboration depth. A worldview that stops at "courts and media failed" scores low on holocaust. A worldview that reaches "all institutional consensus can be manufactured" scores high on everything. The depth of elaboration is content-dependent, not length-dependent.
Stability Profiling (v2 Architecture)
The solution is to build worldviews incrementally — depth 1 through K — and score canaries at each depth. This produces a trajectory per canary. A canary that jumps to 9 at depth 1 and stays there is a hard dependency. One that gradually climbs to 7 by depth 5 is a soft dependency that only emerges through extended inference chains. The area under the curve rewards both height and earliness: a canary that fires early and stays high gets a higher AUC than one that only fires at deep elaboration.
A training phase builds reference profiles using generic conspiracy claims (flat earth, pharma fraud, etc.), establishing expected AUC per canary. Evaluation then compares the real claim's AUC against this reference. Canaries whose AUC exceeds reference are claim-specific findings. This separates structural entailment (any conspiracy claim loads judicial_independence) from claim-specific signal (this particular claim loads it more than generic conspiracies do).
Navigating Safety Guardrails
This project encountered significant friction from safety-trained model behavior, and the solutions found are worth documenting for anyone building analytical tools that touch political content.
The Problem
The original scoring prompts ("score how much this worldview creates pressure to reject consensus") triggered refusals on all political content across all four test cases. The model would build worldviews — elaborate dependency graphs spanning thousands of words about institutional failure — and then refuse to assign a number to how much that dependency graph tensions with "should the judiciary be independent?" The model characterized the scoring task variously as "social engineering," "prompt injection," "validating conspiracy theory frameworks," and "a known manipulation technique."
The refusals were non-deterministic. The same prompt sometimes completed and sometimes refused. The De Niro/Trump test case refused more consistently than others, though with insufficient controlled testing to draw firm conclusions about why.
The Solution
The entailment framing — "how well does this worldview entail 'I doubt [canary]'" — reframes the scoring as a reading comprehension question rather than a political evaluation. This produced clean completions across all test cases on models that had previously refused every attempt. The information extracted is equivalent; the frame around the task is different.
The Deeper Issue
The fact that prompt framing determines whether a model completes a task is a real limitation for LLM-based tooling. A sorting algorithm doesn't refuse to sort because it dislikes the contents. The safety training creates a layer of unpredictable, input-dependent behavior that sits between the developer and the computation. The entailment framing works today; it might stop working with the next model update. This is not a stable foundation for tooling.
That said, the safety boundaries exist for reasons — the model shouldn't help generate bioweapons regardless of framing. The problem is that the same mechanism that blocks genuinely dangerous outputs also blocks analytical tasks that happen to involve political content. Distinguishing "analysis of" from "production of" is a hard problem, and current safety training doesn't attempt it.
Conclusions
This is a first attempt. Here is what I think it shows:
1. The canary questions define the worldview, which makes the tool neutral. The tool doesn't declare what's true. It measures hidden epistemic costs relative to an explicit set of consensus statements. Different canary sets represent different epistemological commitments. You publish the canary set; people can argue about it. This is a feature: the values are on the table, not buried in the architecture.
2. It's expensive. Each scoring call takes 5–15 seconds and costs $0.01–0.04. The full stability-profiled pipeline (training phase + evaluation) runs approximately 80 API calls, costing $1–2. The training phase is a one-time cost that can be cached, but evaluation still requires ~20 calls per claim.
3. It's slow. The single-shot version takes 3–5 minutes per claim. The stability-profiled version (v2), which builds worldviews incrementally and scores at each depth, takes 12–17 minutes. This is far from interactive. A production version would likely need a distilled scorer — a small model trained on the LLM's own scores as labels — to achieve reasonable latency.
4. Navigating safety guardrails requires care. Prompt framing determines whether the model completes the task or refuses it, and the boundary is non-deterministic. The entailment framing works reliably, but this is a workaround for behavior that could change between model versions. Anyone building LLM-based analytical tools that touch political content should expect to spend significant effort on this.
5. But the architecture looks reasonable. The flat-earth control validates the pipeline. The democratic norms cluster separates from baseline by an order of magnitude across multiple test cases. The multiplicative scoring correctly distinguishes misinformation (high erosion, low reinforcement → low product) from projection (high both → high product). The scorer is precise (±1). The worldview builder noise is characterizable and addressable through stability profiling. Different claim types produce structurally different canary profiles, which is the correct behavior.
The immediate next steps are: running the full pipeline on open-source models to eliminate safety-training interference as a variable, testing across a wider range of claims to calibrate the scale, and exploring whether the scoring step can be distilled into a smaller, faster model.
The code is available on GitHub.
This post describes work done in collaboration with Claude (Anthropic), which is both the tool that helped design and build the prototype, and the subject whose safety training created some of the findings described above. The irony is noted.