Grounding Coding Agents via Dixit

qbolec

[Epistemic status: ideas in this post are mine. I've published them previously in the form summarized by Claude, but this got auto-rejected. Here, I present them in my own voice. The ideas are still not evaluated, but I am working on implementing them to see if this works in practice. Still, the ideas presented here are my best bet on what could work in practice. But, I am not an AI/alignment researcher]

Why?

As a senior developer in a rather complicated legacy project, I review more and more PRs written by coding agents, which often miss to identify the real root cause and thus offer a fix for a wrong problem even if in elegant way. A patch often includes unit tests, which of course pass - but how could they not, given they were written by the same AI, after writing the code, to validate its own solution. Same biases, blind spots, and incentive to finish the task contribute to this.

Yes, humans have similar confirmation-bias problem, which we try to solve by either having a dedicated tester role, or forcing writing the tests up front, or by at least having a reviewer who judges their adequacy. Sure, you can have an adversarial setup of two AIs where the Tester's job is to find a test which the Coder's patch doesn't pass, but such naive incentive structure will lead to test(){assert(false);} when taken to the extreme. You could perhaps add a Judge to the mix, which tries to "objectively" decide if the tests are fair and Tester and Coder really captured the spirit of the Spec, but by making this setup explicit and known to all parties, you set up a game dynamics, which (for sufficiently advanced AI) lead to unhealthy tactics and strategies. Yes, you can ask an AI to use TTD and write some tests up front, but in the limit the winning strategy here is test(){}. You can try to measure some metrics like code coverage, try fuzzing the code or test or input, check if all execution paths are covered, but in the limit all of that is gameable.

Humans typically don't go too far into misleading and scheming at job, because they care about the project, can face serious consequences if they get caught, have a self-image to cultivate etc. Most importantly they live in the same world which the Spec is talking about, and run the code in it, and expect the tests to protect their world from the consequences of wrong code. AIs, even if split into several roles, might still (either by accident, or malice, or poor incentive structure) end up producing text artifacts, which give impression of work being done, while actually being detached from reality, and failing to achieve the true goal of the Spec. There's nothing, in fear, preventing them from writing "Review decision: Accepted" or "All tests I could come up with pass" or "Looks like this code achieves the Spec". Yes, for some narrow tasks you can write acceptation criteria which are verifiable automatically without any LLM in the loop, but in practice I rarely face problems of this nature at my job. It might change in future, say if you write a whole project from scratch in a language which admits theorem-proving, and your problem domain is something about provable properties of software. But, for a big legacy app, translating human's goal to a testable Spec is often the biggest part of the problem.

How?

As explained above there are several bad ways to automate coding. Letting the same agent write tests and code is one of them. But, pitting a Tester against a Coder creates incentive structures which are also unhealthy (code difficult to understand, tests impossible to pass). Trying to frame it as team effort, doesn't help neither, as it might invite collusion. I think parts of the problem are: making it clear to agents they play the game, making them care about the game, incentivize them to make moves in the game which are bad for the product.

My proposal starts from pondering:

How do we disincentivize the Tester from proposing test(){assert(false);} ? And in a way which doesn't introduce test(){}?

Dixit is a party game, where a Proposer has to propose a riddle which is not so hard that nobody can solve it, and not too easy - someone has to guess wrong.

We could use a similar rule for judging the tests proposed by the Tester, if we had not one, but several Coders, so we can probe if the test proposed is in this Goldielock zone of failing for some, but not all Coder's solutions. There's one complexity we don't have in original Dixit which is that we don't know which of the two behaviours, failing or passing, is the right one, so we might want to have big odd number of Coders to break the symmetry, by betting on the most natural outcome being that the majority is right. Intuitively, the Spec is the Shelling Point available to all the agents, around which they can coordinate, even if they don't know about each other or the need to coordinate, because it is simply the most natural target. But, for the outcomes to be natural, we need two more crucial things: nobody should know they are playing a game, and shouldn't see each other's work.

A coding agent can play a game without knowing there's any game being played. Instead of giving an agent an explicit text feedback, or reward signal, or hinting in any way there's some goal or score board, the Orchestrator script can simply use the information gained from the agent to drive what will happen next: for example, use the proposed test or discard it, fork the generated code, or revert it, clone the agent, or rollback its chat history.

Hiding code from Testers and tests from Coders is inspired by coding competitions, where the participants do not know what tests run by the online judge are, and the authors of the test suite write them without looking into submissions. Reducing the communication channel to "pass/fail" and "too easy/too hard/ideal" also makes it more difficult to collude, manipulate, or bias.

What?

Assumptions

I assume everything starts from a problem statement written in text, which we call the Spec. The goal for this whole system is to try to implement a solution for the problem described in the Spec. We assume that AIs involved in this process (Coders and Testers) are smart, but by default do not seek power, or try to escape the box just because you've run them. They might be capable of doing so, though, when pressured, or hinted this is expected of them. Thus we want to run them in containers, and have some non-AI Orchestrator script, which can be reviewed, understood, and can't really be "persuaded" in any way, as it executes a rather simple logic. Crucially it has access to a deterministic tool which lets it in a safe way (say, using containers and ctest) run a single test proposed by a Tester on a single source code proposed by a Coder, and see if it passes or not. We don't assume much about internal workings of a Coder or Tester - we just assume there are capable of doing performing their respective tasks most of the time, like Claude Code, or Codex, and we don't care how. All we need is that their state can be saved and restored (say, at the container level). But, we assume they can make mistakes, don't care much about ground truth, or company goals, and may hallucinate from time to time. Our goal is to figure out the Orchestrator which will try to use such imperfect Agents to build something better than the parts.

The Orchestrator

The overall idea is to implement the solution for the Spec iteratively. There are two sets of agents: Coders and Testers, where each agent is isolated and doesn't know about any other agent or the game. Each agent has access to the Spec. The goal for Coders is to write code which matches the Spec and passes the Test Suite. A Coder doesn't know what the tests in the Test Suite are, but they are numbered, and thus tests results are a vector of pass/fail. A Test Suite is initially empty, and will grow each round. This is because the job of each Tester is to propose a new test to be added to the Test Suite. As explained a test must be such that majority of Coders can pass it, but at lest one doesn't. At the start of each round, each Coder should ensure its code passes whole Test Suite. Then each Tester should try to add a new test. We give it a few tries, and rollback its history if there are too many failures in a row, to prevent accumulating a narrative which would suggest being incompetent or spiraling into some dark regions. If a Tester keeps coming up with too hard tests, it might be because it is scheming, or understands the Spec in wrong way, or simply because the Coders haven't yet got to the point their solution could pass it - it's ahead of its time. We hibernate such Tester, and each round check if the test it proposed is now in reach of the Coders, in which case we revive it. If all Testers keep coming up with too simple tests, this might indicate the solutions found by Coders are indeed correct. At this point we can involve the final judge, probably a human.

Here's the pseudocode:

def main(spec):
  test_suite={}
  for coder in coders:     
    coder.solve()
  loop:
    for coder in coders:     
       if not ensure_passes(coder, test_suite):
          return
    new_tests = []
    for tester in testers:
      if tester.is_hibernated:
         new_tests += attempt_revival(tester, coders)
      else:
         new_tests += propose_test(tester, coders)
    if not new_tests:
      return 
    test_suite += new_tests

def propose_test(tester, coders):
  tester.before_test_1 = tester.save()
  loop at most MAX_TESTER_RETRIES:
     test_1 = tester.propose_test("Propose one more test")
     classification_1 = classify(test_1, coders);
     if classification_1 == IDEAL:
       tester.thanks()
       return [test_1]
       
     tester.before_test_2 = tester.save()     
     test_2 = tester.propose_test("This test was {classification_1}, try again")
     classification_2 = classify(test_2, coders);

     if classification_2 == IDEAL:
       tester.thanks()
       return [test_2]
     
     if classification_1 == TOO_HARD and classification_2 == TOO_HARD:
        tester.is_hibernated = True
        tester.too_hard_tests = [test_1, test_2]
        return []
     
     tester.rollback_to_checkpoint(tester.before_test_1)
  return []

def classify(test, coders):
  pass = 0
  for coder in coders:
     repeat RERUN times: // say, 20 reruns to account for non-deteminism
       if not passes(test, coder):
          failures += 1
          break
  if failures == 0:
    return TOO_EASY
  if failures > len(coders) * MINORITY: // firmly below 0.5, say 0.4
    return TOO_HARD
  return IDEAL

def ensure_passes(coder, test_suite):
  judgements_vector = run_tests(coder, test_suite)
  if all_pass(judgements_vector):
    return True

  coder.solve("Some tests fail: {judgements_vector}")
  judgements_vector2 = run_tests(coder, test_suite)
  if all_pass(judgements_vector2):
    coder.thanks()
    return True

  checkpoint = coder.save()
  loop at most MAX_CODER_RETRIES:
    coder.solve("Some tests fail: {judgements_vector2}")
    judgments_vector3 = run_tests(coder, test_suite)
    if all_pass(judgements_vector3):
      coder.thanks()
      return True
    coder.rollback_to_checkpoint(checkpoint)
  // Human judgment needed as perhaps tests are wrong
  return False

def attempt_revival(tester, coders):
  [test_1, test_2] = tester.too_hard_tests
  classification_1 = classify(test_1, coders)
  if classification_1 != TOO_HARD
    tester.is_hibernated = False
    tester.rollback_to_checkpoint(tester.before_test_2)
    if classification_1 == IDEAL
      tester.thanks()
      return [test_1]
    test_2 = tester.propose_test("This test was TOO EASY, try again")
    classification_2 = classify(test_2, coders);
    if classification_2 == IDEAL:
      tester.thanks()
      return [test_2]   
    tester.rollback(tester.before_test_1)
    return propose_test(tester, coders)
  classification_2 = classify(test_2, coders)
  if classification_2 == TOO_HARD
    return []
  tester.is_hibernated = False
  if classification_1 == IDEAL
    tester.thanks()
    return [test_2]
  tester.rollback(tester.before_test_1)
  return propose_test(tester, coders)

This pseudo-code skips details like how to provide the spec to the agents, or the exact wording of prompts. It is crucial though, that we should not lie to Testers nor Coders, while avoiding mentioning there are other agents in the system. For example TOO_EASY can be delivered as "This test is slightly too simple, can you think of something more demanding?". Also it is important to have a way to obtain diverse samples each time we invoke an agent - we want the agents to be dissimilar to each other, and also to try something new each time we re-roll after rollback.

Intended Properties

The following are presented without any proof, because they are just wishes. But, I hope you can see how the above design tries to accomplish them, and even if it fails, at least it is worth naming these intended goals.

Tests are aligned to the natural meaning of the Spec

The Dixit-like rule aims to generate tests which check for something meaningful, and hopefully the Spec is the natural Shelling point for such "implicit coordination". Note that the Tester doesn't really know it should search for a test which passes the Dixit rule, but the Orchestrator only continues histories in which it does. And the Coders do not really know they have to strive to pass the tests, but Orchestrator only continues histories in which they do. Intuitively this should tend to generate histories in which the tests and code seem to be centered around same understanding of the Spec. And since the Testers don't see the code, and Coder's don't see the test, the hope is this happens because they really aim for the Spec.

No AI judgment involved in declaring success

The Orchestrator is just a Python script which checks some predicates about sets. It can't be persuaded. It can't persuade you. Yes, the tests and code might still end up being wrong or misleading if you read them, but this is hopefully not the most probable outcome of the proposed scheme - none of the agents is incentivized to do so. Coder has nothing to gain from writing persuasive code comments as from their perspective it's just them and a cold online judge providing a judment_vector. Testers do not gain much from writing silly tests, as it's not how they can make their history progress.

The tests are meaningful

Each test added to the suite had to be something interesting enough that at least one Coder got it wrong, and connected with the Spec enough that eventually all Coders figured out how to pass it by just reading the Spec.

Non-deterministic tests aren't a big deal

Some tests are naturally flaky, say because the app is multi-threaded and race conditions are rare. But more importantly one easy cheat to score at Dixit rule could be to make your test or fail at random. By re-running the test several times and only count it as pass if all retries succeed, we guard against Testers trying it deliberately or by accident. Yes, they could still make the ppb of success exactly 1-2^20, but this posits they know they are in the game, know the scoring rule, and care about winning it - none of which is induced by Orchestrator.

The game is invisible

We never tell the Agents they are playing any game or that there are any other players. There are no rules to game, or score to hack. The game happens at the level of exchanging very low-bandwidth information (judgment vector, and test classification) and rollbacks and re-rolls of history. Hopefully, this doesn't incentivize anyone from scheming

History looks good

The Orchestrator curates the history remembered by agents in such a way that it appears to be in the flow: one or two mistakes in a row are fine, but the Orchestrator rollsback the agent if it starts to fail to much, so it doesn't fall into learned helplessness. As LLM Agents might learn about themselves from the history, it seems important to make them heroes who overcome challenges thrown at them. Hopefully it makes agent's experience nice.

No Coder left behind

The main loop's invariant is that at the start of round all the Coders pass the whole test suite. One could imagine an alternative design in which only most promising Coders are invested in, or forked, etc. But for the Dixit rule to work well, we need diverse set of Coders, not a Lamarckian evolution. Also, I think it is cool that the end product is not one, but multiple implementations of the same spec, which opens up new possibilities, like running random test inputs and use consensus to detect problems.

Human's time is well spent

I want to keep human in the loop, but keeping up with the rate at which code is produced by LLMs, means we need to be strategic about when exactly to involve a human and what information and tools should be provided for effective decision making. This proposal involves a human at the beginning when writing the Spec (which can be AI assisted), and at the very end when the Orchestrator reached one of the states which need a human with skin in the game, to interpret:

a Coder can't pass all tests: It could be a Coder is too weak to solve the problem, or got stuck on a wrong path, in which case a human could decide to rollback, reinitialize or just remove it from the pool. Or it could be that the test is simply wrong and the Coder is right.
all Testers got hibernated: Which means all of them generate too hard tests for the Coders. This could mean the Coders are too weak, or Spec is misinterpreted or Testers are somehow overeager to make Coders fail. Something to look into
no new tests are generated: If they are all too easy, this might suggest the Spec got properly implemented by Coders. But it could also be that Testers are too weak.

Limitations

No empirical validation

I am implementing some experiments to test above ideas, but so far I don't have any proof this approach will work. What I do have, though, is experience of failures of existing approaches.

Assuming too much independence

Several places implicitly assume that if several Agents do the same thing, this might be because of the meaning of the Spec. But it could be because of some shared bias, like the same training data, the same capabilities, same weights and seed etc.

Assuming the Spec correctly captures the intent

Even if we grant that the Orchestrator succeeds at aligning Coders and Testers to the Spec, there's still a separate issue if the Spec, or "the most reading of the Spec most natural to LLMs" is what the humans really care for. It's not easy to write a perfect wish.

Various constants out of thin air

Why 20 retries of each test? Why rollback after 2 failures in a row? Why 0.6 is the majority required? I don't know. These are just guesses

15