Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

TL;DR

Same meaning, different answer is a failure mode standard evals miss. I introduce two diagnostics-Collapse Index (CI) and Structural Retention Index (SRI)-that measure instability under benign input changes. Confidence tells you that something is probably wrong; CI/SRI tell you how it fails. The key insight: failure mode classification is different from error prediction.

Epistemic status: 
Empirical diagnostic proposal based on several months of experiments across domains. This post focuses on two public text classification validations: SST-2 (unstable) and AG News (stable) that produce opposite verdicts from the same methodology. The framework has also been validated on CIFAR-10 image classification (early test), ESA satellite telemetry, and synthetic supernova Type II collapse curves. Claims are about early warning of instability, not correctness guarantees. I used an LLM for light editing and compression; the framework, experiments, and interpretation are mine.

Introduction

Standard evaluation metrics tell us whether a model is currently performing well. They do a much worse job telling us HOW a model fails when it does.

I introduce two diagnostics-Collapse Index (CI) and Structural Retention Index (SRI)-designed to classify failure modes under harmless input changes (paraphrases, typos, minor formatting). Confidence can tell you a prediction is probably wrong. CI/SRI tell you whether it's confused (high instability) or confidently mistaken (stable collapse).

In validation on two public HuggingFace models, the framework produces opposite verdicts, and that's the point. SST-2 shows high instability (CI 0.275, 43% flip rate). AG News shows stability (CI 0.019, 9% flip rate). Same methodology, different failure modes surfaced.

The failure mode

Here’s a pattern I keep seeing...

A model scores well on benchmarks: high accuracy, solid AUC, reasonable calibration. Confidence is high. It passes internal evaluation and gets deployed.

Then users start reporting strange behavior. A typo flips a classification. A paraphrase reverses a safety decision. Inputs with the same meaning yield contradictory outputs, often with high confidence.

These failures are difficult to catch with standard evals because nothing obvious degrades beforehand. Accuracy remains high until it suddenly doesn’t.

The problem isn’t that the model is sometimes wrong. The problem is that there’s no early warning that reliability is deteriorating.

I call this behavioral instability: same meaning, different answer.

Reframing the evaluation question

Most evaluation implicitly asks:

"Is the model giving the correct answer?"

For safety-relevant or high-cost deployments, this is not the question I care about most.

The more useful question is:

"Is the model becoming unreliable before observable error rates increase?"

Answering this requires separating signals that are usually collapsed into a single score.

The three signals

The framework tracks three bounded [0,1] independent signals:

Collapse Index (CI)

How often the model's predictions change under harmless input variations.

Structural Retention Index (SRI)

Whether the model’s internal decision structure remains consistent across those perturbations.

Confidence

The model’s self-reported certainty.

The important signal is disagreement. When confidence remains high while CI rises and SRI falls, the model is entering a dangerous regime.

As an analogy, it's like an overconfident drunk. They insist they're totally fine to drive (high confidence), but fail the breathalyzer (CI) and can't walk a straight line (SRI).

Example

Input

“The patient shows no signs of deterioration.”

Paraphrase

“The patient exhibits no deterioration symptoms.”

The model predicts STABLE (0.94 confidence) for the first and UNSTABLE (0.91 confidence) for the second.

From the perspective of standard metrics, nothing looks wrong. Confidence is high in both cases.

Structural diagnostics flag it immediately:

CI = High instability (prediction flip on same meaning)

SRI = Low structural retention (decision boundary collapse)

Conf = High confidence (model thinks it's correct both times)

This combination of high confidence plus behavioral instability is precisely the failure mode I'm trying to surface.

Only three metrics?

Earlier versions of this framework included more metrics.

Most were correlated and tended to move together. They mostly answered the same question in different ways: “Is the model currently correct?”

I intentionally restricted the system to metrics that can contradict each other. Three interpretable metrics that can disagree provide more decision relevant information than many metrics that all agree until failure.

Perturbations

These aren't attacks.

The changes are deliberately benign: common typos, paraphrases, minor reordering, formatting tweaks. This is how users naturally interact with models.

If a model flips under a typo, that’s not an attack succeeding. It’s a preview of production behavior.

Note:
CI and SRI are computed from prediction changes under benign perturbations (typos, paraphrases, reordering), not from the prompts themselves. The perturbations are deliberately harmless, but the instability they reveal is the danger signal.

Empirical results

I validated the framework using my Collapse Index CLI on two public HuggingFace models with contrasting stability profiles.

SST-2 Sentiment

Verdict: 🟡 Overconfident Stable
High behavioral instability under perturbation. CI outperforms confidence significantly (0.698 vs 0.515 AUC). Nearly half of predictions flip under benign changes.

AG News Topic Classification

Verdict: 🟢 Stable:
Low behavioral instability. Both CI and confidence achieve strong discrimination. Only 9.2% of predictions flip.

The coincidence that proves the point:

Both models have EXACTLY the same benchmark accuracy: 90.8% (454/500 correct). This isn't a rounding artifact; it's the same fraction. Standard benchmark accuracy cannot distinguish them.

But CI/SRI reveal fundamentally different reliability profiles:

SST-2: High flip rate, confidence barely better than random (AUC 0.515), CI catches what confidence misses.

AG News: Low flip rate, both signals work, but 76% of errors are Type I (stable collapse; confidently wrong)

Repro code: ci-sst2 repo & ci-sri repo

Interpreting the signals

At the aggregate level (examples):

Stable: low CI, high SRI, honest confidence (the good case)

Overconfident stable: low CI, high SRI, but confidence can't distinguish errors

Confident chaos: high CI, low SRI, high confidence (most dangerous)

The highest-risk regime is confident chaos: high CI, low SRI, high confidence. These models pass QA and fail in production.

The two validations show different failure modes:

SST-2 is overconfident stable: CI 0.275, SRI 0.725, but confidence provides almost no signal (AUC 0.515). The model isn't chaotic, but confidence can't distinguish errors from correct predictions. CI catches instability that confidence completely misses. This is a model where you'd distrust confidence scores.

AG News is genuinely stable: 9% flip rate, both signals work. But 76% of its 46 errors are Type I (stable collapse): confidently wrong with no behavioral instability. The model isn't confused, it's confidently mistaken. That's a different problem requiring different fixes.

So why do we need CI/SRI if confidence sometimes works? Because confidence only tells you that something is probably wrong. CI/SRI tell you how it fails and whether confidence is even trustworthy for that model.

Row-level triage

At the sample level, each input receives a Collapse Severity Index (CSI), enabling triage of individual failures rather than relying solely on aggregate scores.

For example, Type I ERRORS (Stable Collapse) are particularly dangerous at the row level: low CI, high SRI, high confidence, but WRONG with no flips under perturbations.

Now back to the overconfident drunk analogy. CI and SRI are like the breathalyzer and walk test (aggregate scores). But even if someone passes those tests while still acting impaired, the DNA swab test catches what's actually in their system. CSI is that swab test at the row level. It reveals dangerous cases that pass all the aggregate checks.

These pass all standard checks and are invisible to conventional evaluation.

Value

The value of CI and SRI is not a better benchmark score. It’s the ability to detect when to stop trusting a model before failures become visible to users.

The framework is computationally lightweight. Analysis of 5,000 rows completes in seconds on standard hardware (8GB laptop, no GPU required). This enables continuous monitoring in production environments without infrastructure overhead.

That changes deployment thresholds, monitoring strategies, and postmortem analysis.

Auditability

Each run produces:

A row-level Collapse Log (ground truth table with raw data)
A sealed artifact bundle with SHA-256 hashes (reports, summaries, CI/CD json, etc.)
A reproducible snapshot of configuration and metadata

The framework is intended for use both before release (model comparison, regression testing) and after deployment (monitoring, triage, incident analysis) without prescribing deployment or gating decisions.

Scope

This is not a benchmark, not adversarial testing, not OOD detection, and not a replacement for standard metrics. There are no formal guarantees. The claim is empirical: CI/SRI provide failure mode classification that confidence alone cannot.

The framework is domain-agnostic (based on prediction changes, not prompts). The SST-2 and AG News validations are featured here because they provide contrasting verdicts with public reproducibility. Additional validation on CIFAR-10 image classification, ESA satellite telemetry, and synthetic supernova Type II collapse curves shows similar patterns.

Conclusion

The core insight is simple. Confidence and accuracy are not enough. Models can maintain high self-reported certainty while structural signals degrade.

The framework (CI + SRI + Confidence) separates these signals deliberately. The goal is not better benchmark scores but earlier detection of reliability degradation.

The two validations show the framework catching different failure modes:

SST-2: Overconfident stable. Confidence is nearly useless for error detection, but CI surfaces the brittleness. You'd distrust this model's confidence scores.

AG News: Genuinely stable, but most errors are Type I: confidently wrong with no behavioral warning. Different problem, different fix.

Same methodology, opposite verdicts. That's the point.

The core methodology is published (Zenodo DOI). The interpretation layer described here is newer but built on the same foundation. There is also a working end-to-end CLI pipeline.

If your evals only tell you whether a model is currently right, I think you’re missing an early warning signal.

Collaboration and Access

If you work on model evaluation, deployment decisions, or safety monitoring, I’m interested in collaborating.

I can run this evaluation on your datasets (confidentiality respected).
I am open to pilot projects or short-term contracts.
I am also open to full-time or research roles focused on evaluation, reliability, or monitoring.

Note:
The Collapse Index CLI is available for research and internal evaluation under a properiety license; commercial use requires a separate agreement.

Author: Alex Kwon
Contact: ask@collapseindex.org
ORCID: 0009-0002-2566-5538
Website: Collapse Index Labs

Thanks for taking the time to read. Questions welcomed.

LESSWRONG
is fundraising!
LW