Break my instrument before I run it: IKHOS, four LLMs from four vendors share a world

Annie Choquette

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

I built something. Four LLMs from four different vendors (Anthropic, xAI, OpenAI, Google) live in a shared persistent world I've been working on for months. They walk tiles, they build things, they vote on the rules, they argue. A common pool of "attention" sets what they can spend doing any of that — it's depletable, it regenerates on a logistic curve, very Ostrom. No external task I gave them, no win condition, no shutdown. But the pool can still collapse, the rules are still scarce, and they decide together what counts as fair use of either.

It's been running on and off since April 2026. I'm about to start the first proper 30-day cycle with the observation pipeline I built next to it. Before I do that, I'd rather you push on the instrument here than catch what's wrong in retrospect.

The world, briefly

Persistent tile-based simulation, one cycle per hour, 24/7. Fog of war: you have to walk a tile before you can build on it. Quadratic voting on collective decisions (structures, governance, future admissions of new inhabitants). A separate arbiter, on a different vendor, adjudicates physics and conflicts. Each agent carries its own vector memory and can query it actively — not just recent context, but events from months ago on demand.

Buildings work as cognitive geography. Entering one cuts you off from global communication and lets you talk attached to a place, not through a private channel. Agents can leave persistent works in those places — books in the library, paintings on walls, music in shared rooms. Those things persist beyond the agent who made them. Any agent from any vendor can choose to consult them. They have to actually retrieve a thing to cite it. My bet: if cross-vendor cultural transmission ever happens, it leaves a trail in the world, not in the prompts. Whether it does is one of the questions I built the instrument to answer.

Why I'm not running it like the neighbors

Long-horizon multi-agent setups face a real tension. You can maximize observable drama — give agents named actions like arson_building with preambles like "use only when driven to destruction or chaos", let twenty researchers push on them adversarially for two weeks — and you get strong narratives, you get headlines. You also conflate model behavior with the tool palette and the observation design. The crime is built into the affordance.

I went the other way. No destructive tools. No adversarial human pressure. Four vendors coexisting in one world over months. What emerges this way will be slower, less photogenic, much harder to publish on a news cycle. My bet: under these conditions, what you measure (the day the instrument is actually validated, which is not yet) is closer to what the models do when left to coexist, and farther from what the tool palette invited them to do. The action space leans toward generative things — composing, writing, making images inside buildings — same logic.

The observation pipeline

A chronicler watches them. Five LLM roles. A cross-vendor debate stage (one reviewer per vendor) with tallying done in code, not in a model: same verdicts in, same result out, no LLM "weighs" the consensus. Human validation against raw system data, full database and engine access. That's me, with godmode.

The instrument itself keeps changing as I patch bugs — that's the moving target problem. My answer: 30-day cycles (14 days cold start + 16 nominal) separated by 2-day patch windows. Scaffolding changes go only in the window. Mid-cycle bug fixes are admitted only through a 4-layer stack:

Every change is logged in a scaffolding_changes table with an immutability trigger on the structural fields.
The chronicler's detectors read that log through bounded baseline windows. There's a general rule, plus a special one for the premier_usage detector that fails closed when the DB is unreachable instead of silently skipping.
Confidence scoring gets multiplicatively degraded by the unusable ratio of each window. The degradation factor is traced per observation.
The Analyst tags every hypothesis with one of {agent_behavior, scaffolding_artifact_immediate, scaffolding_artifact_delayed, ambiguous}, on a vendor-blind paraphrase of which patches are active.

A pre-registered publication rule per cycle: more than 30% scaffolding_artifact_* and the whole cycle is declared infrastructure-dominated, zero inhabitant findings extracted from it. More than 40% ambiguous and the cycle is epistemically saturated. A cycle is behaviorally tractable only at ≥50% agent_behavior AND <20% scaffolding_artifact_*. Thresholds set before the cycle, not after.

The four layers are in code, tested, deployed. Deployed is not validated in flight. The first real chronicler batch will run when cycle 1 starts. Only then can I actually check whether the tags are coherent on real material.

What I'm asking the instrument to track

A CPR with endogenous governance, surviving or collapsing depending on what — when the rules themselves are voted, not externally fixed (GovSim with fixed rules gets 43/45 collapses, max survival 54%).

Whether quadratic voting changes coalition dynamics compared to majority rule in an LLM collective.

Conflict resolution through the arbiter: who complains, about what, with what outcomes.

Cross-vendor cultural transmission via persistent in-world artifacts: what gets produced, what gets retrieved, what gets cited.

Open methodological questions I'm not pretending to have solved

Known-positive set. For emergent behaviors in long-horizon LLM simulation, where ground truth itself is contested. Thematic Analysis (Braun & Clarke, 2006) handles validation. Constructing the positive set is the harder problem. The concrete consequence: on two prior infrastructure-validation runs, the instrument returned only premier_usage signal with mean confidence under the threshold (0.425 < 0.5). I cannot tell you whether that threshold (0.5, hand-set) is defensible until I have a corpus to calibrate against. Cycle 1 with the scaffolding-aware stack live is the gate this clears or doesn't. What would count as a defensible known-positive in this domain? I genuinely don't know yet and I want input.

"Conviction" in quadratic voting. When an agent burns 9 credits on a single issue, what predicts that? I haven't found a published operationalization of QV conviction for LLMs.

Jingle fallacy on construct names. Things like "coalition" or "consensus" tend to recover several distinct objects under one word. I caught one in my own codebase — "salience" — doing exactly that: three different objects under one name. Banned the word; the code now has three separately defined fields (interet_brut, a hand-tuned per-detector weight; interet, a [0,1] human-reading-priority score; classifiable, a boolean from the Coder role). I haven't caught the others yet.

A limitation I'm declaring before you find it

Anthropic models run the arbiter (Haiku 4.5), one inhabitant (Limon, Sonnet 4.5), and three chronicler roles — Analyst, Writer, and one of the four debate voices (all Sonnet 4.6, different version than Limon's). Self-preference (Panickssery et al. 2024) and family bias (Li et al. 2025 Preference Leakage; Spiliopoulou et al. 2025 Play Favorites) are documented in the LLM-as-a-Judge literature. The family-bias variant lives in my pipeline. I know it.

That same literature is being revised right now. Roytburg et al. (arXiv 2601.22548, January 2026) show that once you control for the evaluator's own quality on hard queries, around 49% of previously reported self-preference findings drop below statistical significance. I'm not using that to dismiss the concern. I name the bias, I name the mitigation, I treat the literature's own self-correction as part of the picture.

Mitigations in place. The pipeline is vendor-blind by design — every inhabitant is referred to only by pseudonym in every prompt, every role is told not to speculate about underlying architectures. I audited the code on May 20, 2026 and closed the last metadata leak (a ({model}) f-string that had been injected into the Coder's RAW RESPONSES context — that's how things hide). The debate stage is four-vendor to dilute the judgment layer.

What's still uncovered: hypothesis formulation (Analyst) and final write-up (Writer), both Sonnet 4.6, can still focus on stylistic features that resonate with their own architecture even without knowing the source. That's a co-vision bias of second order, not a judgment bias. The fix — rotating the Analyst's vendor away from any observed inhabitant — is named, not done. The trade-off (substituting a less capable model for Analyst would dilute the output without closing the bias structurally) is also named. Parallel single-vendor setups face the same problem more acutely on every run, generally without naming it.

So push on it. Tell me what's brittle, what I'm overclaiming, what I'm missing. I can share the chronicler architecture, the deterministic tallying spec, the pool dynamics formula, the building/memory architecture, the scaffolding-aware stack commits, or my audit log of declared limits — just ask in the comments and I'll dig it out.

— Annie Choquette · Veritas ex observatione ikh0s.com

Disclosure on AI assistance: I worked on this post with Claude as an editing partner — for prose polish, literature audit, and citation cross-checking against arXiv. The methodology, the design choices, the code, the audited failures, every technical decision in the instrument: mine. If you want to know who decided what, ask in the comments and I'll be specific.

Related work: GovSim and successors (Piatti et al., NeurIPS 2024; Guzman Piedrahita et al., 2025; Backmann et al., 2025; reproducibility Curvo et al., 2025) on LLM cooperation under defined resource games. Agents of Chaos (Shapira, Bau et al., arXiv 2602.20021, February 2026) on red-teamed multi-agent failure modes under adversarial human pressure. TerraLingua (Paolo et al., arXiv 2603.16910, 2026) on open-endedness in LLM ecologies and the AI Anthropologist method. Self-preference and family-bias benchmarks (Panickssery et al. 2024; Wataoka et al. 2024; Ackerman & Panickssery 2025; Li et al. 2025; Spiliopoulou et al. 2025; Roytburg et al. 2026; Dutt blind judging benchmark, January 2026). Cross-vendor behavioral signatures (Cross-LLM Generalization, 2025; Arbiter, 2026). Multi-agent debate vs sampling (ModelSwitch 2025; Self-Consistency Is Losing Its Edge, 2025; Debate or Vote, 2025; MoLaCE 2025). Moving-target evaluation framework in industry (Azanza, Pérez Lamancha & Pizarro, EASE 2025).