Dixit-inspired Grounding of Coding Agents

qbolec

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

[Epistemic status: These ideas are stated without any proof, or empirical validation. I just happen to believe in them as being more probable than not, and wanted to write them down and share them in hope that at least some of them might inspire something great to happen]

The Problem

As more of software development becomes automated (code written by agents, tests written by agents, reviews conducted by agents) a subtle but serious failure mode emerges. Nothing prevents the pipeline from producing a self-consistent, mutually reinforcing set of artifacts that all say "everything is great!" without actually solving the problem posed in the original ticket. This can happen through malice, but more likely through accident or simply misaligned incentives: an agent that writes code has an implicit interest in the tests passing; an agent that writes tests may unconsciously anchor on what the code does rather than what the spec demands; a reviewing agent reads coherent, well-commented code and confirms it.

Each step looks reasonable in isolation. The composition is unmoored from reality. The spec — the only true ground truth — quietly stops being the thing anyone is checking against.

The failure mode is most acute when the same entity writes both code and tests — the temptation to write tests that confirm the implementation rather than probe the spec is overwhelming, even unintentional. The naive counter-measure — separate the roles of Coder and Tester and pit them against each other — introduces the opposite failure mode: a Tester with no stake in the code passing has every incentive to write assert(false) and call it a day.

Proposed Solution

Assumptions

Our solution assumes the following:

The task is described in a spec which explains the goal and interface of the module to be implemented. This is the only shared input. It is the closest thing to ground truth we have.
The Orchestrator coordinating everything is itself a simple, non-AI program, presented below. No AI judgment enters the Orchestrator logic.
All agents (Coders and Testers) are run in isolated sandboxes with no communication between them. They receive only the spec and whatever the Orchestrator explicitly sends them. They may spawn subagents internally if they wish.
A simple, deterministic, non-AI test runner is available (e.g. ctest) which can run a given implementation against a given test case and return pass or fail. This is the only oracle in the system. It cannot be argued with, flattered, or narrativized.

The Protocol

The Orchestrator coordinates two populations (say, at least 3 each) of agents — Coders, each producing an independent implementation of the spec, and Testers, which propose test cases. A proposed test is added to the vetted suite only if it passes "the Dixit Rule"^[1]: it must be passed by a strict majority of Coders, but not all of them, meaning it probes something real without being unsatisfiable. The main loop alternates between Tester turns, where the suite grows, and Coder turns, where every Coder is required to catch up to the full suite before the next Tester turn begins.

IDEAL_THRESHOLD = 0.6   # test must be passed by a strict majority of coders

def orchestrate(spec) -> Implementation:
    coders  = initialize_diverse_coders(spec)
    testers = initialize_testers(spec)
    vetted_suite = []

    for coder in coders:
        coder.implement(spec)

    while True:

        # TESTER TURN
        # invariant: every active tester's last proposal was IDEAL (or just started)
        new_tests = []

        for tester in active(testers):
            new_tests += try_generate_ideal_test(tester, coders)

        # REVIVAL
        for tester in hibernated(testers):
            new_tests += attempt_revival(tester, coders)

        if not new_tests:
            if all(t.status == HIBERNATED for t in testers):
                escalate_human(ALL_TESTERS_HIBERNATED)
            else:
                escalate_human(TESTERS_EXHAUSTED)
            return

        vetted_suite += new_tests

        # CODER TURN
        # no coder is left behind: all must pass full vetted_suite
        # before the next tester turn, keeping the Dixit rule informative
        if not converge_coders(coders, vetted_suite):
            escalate_human(CODERS_STUCK)
            return

        # testers' invariant restored: their ideal tests are now too weak


def try_generate_ideal_test(tester, coders) -> list[Test]:
    for attempt in range(MAX_RETRY):

        test_a, classification_a = propose_and_classify(tester, coders)
        deliver_feedback(tester, test_a, classification_a)

        if classification_a == IDEAL:
            return [test_a]

        test_b, classification_b = propose_and_classify(tester, coders)

        if classification_b == IDEAL:
            deliver_feedback(tester, test_b, IDEAL)
            return [test_b]

        if classification_a == TOO_HARD and classification_b == TOO_HARD:
            hibernate(tester)
            return []

        rollback_to_before_first_attempt(tester)

    return []


def attempt_revival(tester, coders) -> list[Test]:
    # history ends with: test_a, feedback(TOO_HARD), test_b
    # no feedback was delivered for test_b
    test_a, test_b = tester.stacked_proposals()

    classification_a = classify(run_against_all_coders(test_a, coders))
    if classification_a != TOO_HARD:
        rollback_to_before_feedback(tester, test_a)
        deliver_feedback(tester, test_a, classification_a)
        revive(tester)
        return [test_a] if classification_a == IDEAL else []

    classification_b = classify(run_against_all_coders(test_b, coders))
    if classification_b != TOO_HARD:
        deliver_feedback(tester, test_b, classification_b)
        revive(tester)
        return [test_b] if classification_b == IDEAL else []

    return []


def converge_coders(coders, vetted_suite) -> bool:
    while not all_coders_passing(coders, vetted_suite):
        for coder in failing(coders, vetted_suite):
            attempt_fix(coder)
            if regression(coder):
                rollback(coder)
            coder.retry_count += 1
            if coder.retry_count >= RETRY_THRESHOLD:
                return False
    return True


def classify(results) -> Classification:
    pass_rate = results.pass_fraction()  # each coder run 20 times, any failure = FAIL
    if pass_rate < IDEAL_THRESHOLD:  return TOO_HARD
    if pass_rate == 1.0:             return TOO_WEAK
    return IDEAL

Feedback to Testers is worded to be true but uninformative about the multi-agent setup. A TOO_WEAK result is delivered as: "this test was too straightforward — try to find something more challenging." A TOO_HARD result is delivered as: "an agent had difficulty with this one — could you find something more approachable that still tests something meaningful?" Neither response hints that there are multiple Coders, that pass rates are being measured, or that a game of any kind is in progress.

Insights and Nice Properties

The Dixit Rule grounds Testers to the spec without any AI judgment.

A test is IDEAL if it is passed by a majority of Coders but not all of them. Since Coders were initialized independently from the spec alone and never communicate, the only thing they can coordinate on is the spec's natural meaning — its Schelling point. A test that discriminates them probes something real about the spec. Too easy means everyone already got it or test is trivial. Too hard suggest either a broken test or a misconception or simply being "ahead of its time" - Tester gets hibernated while Coders improve. The classification is a pure arithmetic operation on pass rates. No AI is involved.

The vetted suite is monotonically growing and self-cleaning.

A test can only enter the suite by passing the Dixit Rule against an independent Coder population. A test requiring wrong behavior should not achieve this — it would need a majority of independent Coders to all implement the same bug without communication. The vetted suite therefore hopefully accumulates only valid constraints, and grows strictly more demanding over time.

The 20-run rule handles nondeterminism and prevents Testers from cheating

Any failure across 20 runs counts as failure. A flaky test — whether accidentally nondeterministic, timing-dependent, fails deterministically under this rule. Importantly, a Tester which would somehow infer it is a game, and for some reason care to win it, and further tried to game the Dixit Rule by adding assert(rand()%3) to make the test pass in most cases but not always, would still face a problem: a Coder has probability (0.5)^20 ≈ 0.0003 to pass 20 times in a row, and thus majority wouldn't be achieved and test will not become vetted^[2]. No detection logic is needed.

It is an implicit game without points.

No agent is told it is in a competition. No agent is given a score to optimize. The Orchestrator selects for agents whose trajectories produce useful artifacts — ideal tests, converging implementations — without any agent perceiving a game or having an explicit objective beyond the task itself. The selection happens at the level of which histories get advanced and which get rolled back or hibernated. Agents are not incentivized; their trajectories are.

Collusion is structurally prevented, and the spec is the only Schelling point.

Agents cannot communicate. The only signal that crosses between populations is a narrow-band ACC/WA vector — no content, no strategy, no negotiation surface. Even if an agent somehow inferred it was in a multi-agent system and wanted to coordinate, the only common knowledge available to all agents is the spec. Any attempt to find a Schelling point — a focal solution everyone can converge on without communicating — leads directly back to the spec's natural meaning. The system is robust not because agents are trustworthy, but because the architecture makes the spec the only viable coordination target.

Agents do not know they are in a system.

Coders are told: here is a spec, write the best implementation you can, here is your current ACC/WA vector. Testers are told: here is a spec, write a good test. Neither population knows the other exists. Neither knows how classification works. There is no mechanism to Goodhart. The spec is the only salient common knowledge, which is exactly where you want agents' attention.

The online judge experience grounds Coders.

Coders receive only a vector of test numbers with ACC or WA. No explanation, no content, no hints. This is the competitive programming experience: the oracle is honest, fast, and cannot be argued with. The feedback bandwidth is narrow but sufficient — a Coder that fixes a bug and sees one more ACC in its vector has clear evidence of progress, without learning anything about what the remaining failures are or how the tests were constructed. The work of figuring out why a test fails — forming a genuine model of the spec, as opposed to writing to the test — is left entirely to the Coder. It can write its own private tests, ask a subagent for review etc. But at the end of the day it's code has to pass a secret test suite.

Healthy trajectory curation keeps agents productive without deception.

An agent's entire world is its context window. A context window full of accumulated failures creates a defeatist prior that corrupts future attempts. The Orchestrator therefore rolls back histories selectively: a single failure followed by success is preserved — that is normal productive work. Two failures in a row get pruned to one or zero. Hibernated Testers are revived at a moment of success. No false information is ever introduced. The Orchestrator only decides which true experiences an agent remembers. By rolling back too many bad experiences, hopefully we also make the life of Agents more fulfilling by keeping them in a flow state.

Hibernation handles the asymmetry between `TOO_WEAK` and `TOO_HARD`.

A test that was too easy will probably remain too easy — the Coder population only improves. It's only kept around to prevent regression. A test that was too hard may be ahead of its time: a Tester with a more demanding model of the spec than current Coders can satisfy. Rather than discarding this Tester, we preserve it with its proposals on the stack and revive it once the Coder population has improved enough that its proposals may now be IDEAL. The Tester is vindicated retroactively, its history updated to reflect this, and it re-enters the process naturally.

No Coder is left behind.

Every Coder must pass the full vetted suite before the next Tester turn begins. This is not just a fairness constraint — it is what keeps the Dixit Rule informative. If some Coders were allowed to lag behind, a candidate test might appear IDEAL simply because the failing Coders have an unrelated bug. The pass rate would reflect implementation maturity rather than test discrimination. Requiring full convergence before each Tester turn ensures that when a test discriminates the Coder population, it is because the test found a genuine gap in some implementations, not because some Coders were already broken for other reasons.

Human involvement is minimal, late, well-evidenced, and high-value.

For a human to be an effective backstop rather than a bottleneck, their time must be spent on decisions that genuinely require human judgment — not on rubber-stamping a pipeline that has already reached a conclusion. The three escalation triggers are each a specific, narrow anomaly that the system cannot resolve mechanically:

A Coder cannot converge despite retries — it is failing a test that passed the Dixit Rule, meaning a majority of other Coders can satisfy it. Something is genuinely wrong with this Coder's model of the spec. A human can decide to replace this Coder with a fresh one, trusting the so far constructed vetted test suite to guide it quickly up to speed. A human might also decide the test is actually wrong.
All Testers hibernate simultaneously — every Tester is producing tests that no Coder can pass. Either the spec has a misconception, or the Coder population has a systematic shared blind spot (or, unlikely, all Testers try to add impossible tests to force failures).
No new ideal test can be produced — Testers keep generating tests that are too easy, and cannot find anything that discriminates the Coders anymore. Either the suite is complete and all versions of the code are great, or the Testers missed something crucial. A human can now play with all versions of the code, run some real word scenarios. Having several versions of the code opens possibility to run random tests with unknown good answer and check for consistency of answers.

Crucially, humans have skin in the game in a way agents do not. An agent has no career, no reputation, no consequences from a production outage. Giving agents explicit incentives tends to produce Goodharting rather than alignment — an agent rewarded for tests passing will find the path of least resistance to tests passing, which is not the same as correct code. The humans who deploy and vouch for this system do have consequences. The escalation points are designed so that when a human signs off, they are making a genuine judgment call with their own credibility behind it, not approving a summary they did not really read.

Coder diversity.

The Dixit Rule derives its power from the independence of the coder population. Consensus among independent agents is strong evidence. Consensus among correlated agents is weak. Coders are therefore initialized with diverse prompts, temperatures, and explicit algorithmic directions — without being told why. They are never forked. All of them must reach the finish line. A release candidate satisfies every constraint that any independent Coder failed to satisfy, as discovered by Testers who never saw any of the code.

Limitations

The independence assumption may not hold in practice.

The central claim of the Dixit Rule is that Coders coordinating on a correct answer is more likely than coordinating on a wrong one. This depends on genuine independence. In practice, all Coders share the same training data, architecture, and priors. If the spec describes a well-known problem with a canonical implementation in training data, Coders may converge on that implementation — including any subtle bugs it carries — without any communication. The Dixit Rule would then produce confident consensus around a wrong answer. The degree to which prompt and temperature diversity actually decorrelates Coder behavior is an empirical question this protocol does not answer.

The spec is the ground truth, but the spec is still text.

The whole system is anchored to the spec. But specs are written in natural language: they are ambiguous, incomplete, and carry implicit assumptions. The Schelling point that independent Coders converge on is the most natural reading of the spec, which may not be the intended one. The protocol has no mechanism for detecting spec ambiguity except as a side effect of escalation — by which point a consistent but wrong implementation may already have been produced. Writing a good spec remains a hard, human problem that this system does not address.

`IDEAL_THRESHOLD` , `MAX_RETRY` , `TEST_REPEATS` are asserted, not derived.

The choice of 60% as the majority threshold, 20 runs for nondeterminism detection, and the various retry limits are reasonable defaults but domain-specific. A concurrent system under race conditions may need far more than 20 runs. A spec with many Coders may warrant a higher majority threshold. These parameters require calibration and the right values will vary by task.

What "diverse Coders" means operationally is underspecified.

The protocol calls for Coders initialized with diverse models, prompts, temperatures, and algorithmic directions. How much diversity is enough for consensus to be meaningful is left open. If diversity is implemented as temperature sampling from a single model, the independence may be shallow — enough to produce surface variation but not enough to decorrelate failure modes on hard edge cases.

No empirical validation.

This is a design proposal. The arguments for its properties are structural, not experimental. Whether the system behaves as described in practice — whether the Dixit Rule actually selects for spec-aligned tests, whether Coder populations actually converge, whether hibernation and revival work as intended — remains to be tested.

^{^}
In real Dixit it doesn't have to be a majority. Here we need to break symmetry somehow, skewing it toward correct understanding of the spec. We assume that in case majority of Coders passes a test it's then it is more probable the test is correct than that all of them made the same mistake (or at least, apparently this is the most natural meaning of the Spec).
^{^}
Yes, a Tester could try to pick a value such that it's 20th power is slightly above IDEAL_THRESHOLD, but at this point it means that somehow it learned a lot of internal details of the Orchestrator while being in a sandbox and never hinted at any game going on - how?

4

Dixit-inspired Grounding of Coding Agents

4

The Problem

Proposed Solution

Assumptions

The Protocol

Insights and Nice Properties

The Dixit Rule grounds Testers to the spec without any AI judgment.

The vetted suite is monotonically growing and self-cleaning.

The 20-run rule handles nondeterminism and prevents Testers from cheating

It is an implicit game without points.

Collusion is structurally prevented, and the spec is the only Schelling point.

Agents do not know they are in a system.

The online judge experience grounds Coders.

Healthy trajectory curation keeps agents productive without deception.

Hibernation handles the asymmetry between `TOO_WEAK` and `TOO_HARD`.

No Coder is left behind.

Human involvement is minimal, late, well-evidenced, and high-value.

Coder diversity.

Limitations

The independence assumption may not hold in practice.

The spec is the ground truth, but the spec is still text.

`IDEAL_THRESHOLD` , `MAX_RETRY` , `TEST_REPEATS` are asserted, not derived.

What "diverse Coders" means operationally is underspecified.

No empirical validation.

4

4

4

Dixit-inspired Grounding of Coding Agents

4

The Problem

Proposed Solution

Assumptions

The Protocol

Insights and Nice Properties

The Dixit Rule grounds Testers to the spec without any AI judgment.

The vetted suite is monotonically growing and self-cleaning.

The 20-run rule handles nondeterminism and prevents Testers from cheating

It is an implicit game without points.

Collusion is structurally prevented, and the spec is the only Schelling point.

Agents do not know they are in a system.

The online judge experience grounds Coders.

Healthy trajectory curation keeps agents productive without deception.

Hibernation handles the asymmetry between TOO_WEAK and TOO_HARD.

No Coder is left behind.

Human involvement is minimal, late, well-evidenced, and high-value.

Coder diversity.

Limitations

The independence assumption may not hold in practice.

The spec is the ground truth, but the spec is still text.

IDEAL_THRESHOLD , MAX_RETRY , TEST_REPEATS are asserted, not derived.

What "diverse Coders" means operationally is underspecified.

No empirical validation.

4

4

Hibernation handles the asymmetry between `TOO_WEAK` and `TOO_HARD`.

`IDEAL_THRESHOLD` , `MAX_RETRY` , `TEST_REPEATS` are asserted, not derived.