TL;DR: We present a conceptual discussion and loose formalism regarding Expert Orchestration, emphasizing on judges. We motivate the problem of finding the best way of combining multiple judges scores and present a solution to it: learning the function. Then, we present the architecture and low level details of the experiments we did to learn this function. We conclude by discussing scalability concerns, rooms for improvement, further work and safety considerations of our work.
Epistemic status: Me (José Faustino), Fernando Avalos, Eitan Sprejer and Augusto Bernardi participated in an 48 hour Apart Hackathon related to Expert Orchestration and got second place. This post is roughly a translation of the original work content into a blogpost. The original submission can be found here, the GitHub repository with all of the code used can be found here. José wrote the whole blogpost except the "Further experiments" section (which Fernando wrote) and a lot of parts of this post were just copied from the original submission. Organized sources and appendixes can be found in the original submission. Personal views regarding what future work to pursue are my own (José) and the team shares most of those views. We thank Apart and Martian team for valuable feedback on this draft.
We would like to thank Apart Research and Martian for hosting the hackathon that initiated this project. Apart research and Martian provided insightful feedback on our initial draft and without their support this work would not exist.
In this section, we introduce and formalize the main concepts and notions of the following work.
Modern AI paradigm involves using big, expensive, hard-to-train single models, like OpenAI o3, Anthropic's Claude 4 and Deepmind's Gemini 2.5 pro, that we call Monolithic AIs. This paradigm comes with a bunch of problems, like the fact that only a few players can actually train those models and that users don't actually know how those models 'think', amongst others.
On the other hand, the number of existing models grows a lot everyday. Hugging Face hosts over 1.7M models, for example. Smaller models are cheaper to make and easier to interpret. Can we take advantage of this by designing a paradigm based on smaller models, that outperforms the current paradigm both in cost, performance and safety?
One possible alternative approach, called Expert Orchestration (EO), proposes to:
In this approach, we use judges to evaluate small models and a router to direct prompts to said small models using the judge evaluations. Then, it could be possible to achieve the same performance as monolithic AIs with a smaller cost (since we're only using smaller models), better interpretability/understanding of the models (since it's usually easier to interpret/understand smaller models) and thus better safety properties.
Suppose we want to evaluate how good a bunch of different models are in "mathematics expertise". We call "mathematics expertise" a metric. We'll decompose this problem into smaller, intuitive parts now and explain what problems are there to solve.
First, "mathematics expertise" is very broad, and includes a bunch of different skills, for example:
So our "mathematics expertise" metric could be broken down into at least 3 other sub-metrics. Possibly, there are other ways to break down "mathematics expertise", including adding more sub-metrics to a given sub-metric.
Then, a question arrises: what metric decomposition allows the judge system to have optimal performance? Is it better to let a judge evaluate a single 'high level' metric (for example, just 'mathematics expertise') or a list of metrics (for example, a decomposition like the above, but one that is a huge list of sub-metrics) regarding a specific capability? Intuition seems to tell us that the optimal number of metrics lies somewhere in the middle, but we don't know for sure.
In the rest of this work, we'll not focus on this first problem and just assume there is a set of metrics , where each metric can be a high level metric such as "mathematics expertise", "ethics", "bias" or more specific metrics such as the ones outlined in the bullet points above.
A second problem is "what are judges?". Consider we have a set of 5 LLMs. There are options of what the judge will actually do:
Implementation details are also an important question: will we implement judges as LLMs? If so, should we use a single LLM or a multitude of LLMs? Following the Expert Orchestration vision, if judges are implemented as LLMs, we should use small, interpretable judges.
There's also a third question: how many judges we use for each metric? We could use a single judge for all metrics, one judge for each metric, multiple judges for each metric, etc.
To simplify, we'll formalize a judge as the following:
Let be a space of prompts (that is, the set of all possible prompts one could prompt to an LLM) and the space of possible answers to those prompts (that is, the set of all possible answers to that prompt)
A judge is a function that receives a (prompt, answer) pair and evaluates the answer to the prompt across some number of dimensions, i.e, a judge is a function
Where represent the number of dimensions the judge is evaluating. It could be, for example, if our metrics are domain expertise, ethics and writing skills, we could have, for example, where a judge is evaluating all metrics, where a judge is evaluating domain expertise and ethics, or evaluating a single metric.
Or it could also be that we only have one metric, but the judge evaluates different 'facets' of it. For example, the metric could be 'mathematics expertise' and the judge evaluates the quality from different perspectives - if , the judge could be evaluating 'how rigorous the explanation is' and 'how intuitive the explanation is'
Note: It might be that considering a single metric with 10 'facets' is different - regarding performance - than considering 10 metrics with 1 'facet'. We're not sure.
Now, we introduce the main problem we aim to solve with our work.
We do expect that using multiple judges would make a judge system more safe and perform better:
So it's interesting to research good performing multi-judge architectures if we want to improve safety of judge systems.
Suppose, for simplicity, we're given a set of judges, each one evaluates across a single dimension that is
for
Assume, for simplicity, we have a single metric 'quality' and each judge evaluates different 'facets of quality', e.g, 'legality', 'ethics', 'domain knowledge', etc.
Thus, each judge receives as inputs a pair of prompt and answer and outputs an evaluation value .
How do we combine the scores of all the judges? Should we take the average? The minimum? Maximum? Do some sort of conditional logic with the judges (for example, if the score judge provided is above a certain number, use this value. Otherwise, use the score judge provided, etc.)
Mathematically, we want to choose a function
that receives the evaluations generated by the judges and outputs a single evaluation .
For example, if we combine the judges using the average of the scores, then
Or, if we combine the judges by taking the maximum of the scores, then
Simply taking the maximum, for example, is not a great idea for performance and safety purposes: it could be that inside our set of judges, one judge is 'contaminated' and provides high scores to bad quality answers, when all the other judges provide a bad score to those bad quality answer. Then, the system will output a high score to a certain pair when the pair should have a low score. Imagine, now, that the metric is safety and the contaminated judge provides high scores to non safe answers, then the system will say non safe answers are safe, simply because we combined the judges in a non-robust manner. We do expect the average, on the other hand, would be more robust to that.
Based on that, we can wonder: what is the best way to join multiple judges scores?
Suppose we know the 'ground truth' way of aggregating judge scores (for example, one human or multiple humans could have carefully annotated, for each (prompt, answer), what is the true score value of that answer to that prompt).
Call this ground truth function that correctly scores each pair .
(Recall that is the space of prompts and is the space of answers)
Then, we can parameterize our function that combine the judges, that is, and then, we can search for the parameters such that best approximates , that is,
A way to this is is to compute some notion of 'error' or 'distance' between and . We call this quantity loss function and denote it by . For didactic purposes, to exemplify, it could be the average quadratic error
Then, we can compute the parameters that minimizes .
Mathematically, we say that we want a function that solves
In other words, we train a model to minimize the loss above.
Conceptually, our solution is a better way to combine different judges based on some choice of ground truth (for example, one could consider different ground truths: it could represent preferred scores of a single individual, or the aggregate of preferred scores for multiple individuals, etc.)
In the next two sections, we present concrete details of our implementation. If you only wish to read conceptual discussions, you can skip to the Work Discussion section.
In this section, we explain our implementation in a high-level from a concrete standpoint.
In order to solve this optimization problems, we need to
Here we present the implementation details of our solution. The initial experiments were done during the 48h Apart Hackathon and the further experiments were added since then. All the code can be found here.
Judges used are simply LLMs with ‘rubrics’. A ‘rubric’ is a prompt we pass to the LLMs saying how it should return the scoring. For details on the specific rubrics and models we used, see the Appendix of our original works. We used 10 judges with different rubrics.
After we fix our judges, we have to run the (prompt, answer) pairs through them to get each judge scoring.
In the initial experiments, we tested two different models for constructing the learned functions: a GAM (Generalized Additive Model) and an MLP (Multi-Layer Perceptron), the latter with 16 hidden oures. A Generalized Additive Model (GAM) is a type of regression model that combines interpretability with flexibility. It models the prediction as the sum of smooth, non-linear effects of each feature. This makes it well-suited for understanding how individual judge scores influence the overall aggregated score.
We chose this simple architectures to provide a proof-of-concept for the methodology employed, and to facilitate future work interpreting their learned parameters.
Our results show that both the MLP and GAM single layer networks show better performance than the naive approach of taking the mean of the judge’s ratings.
Model | MSE | MAE | |
NN Model (MLP) | 3.06 | 1.35 | 0.57 |
Model (GAM) | 3.11 | 1.36 | 0.56 |
Naive Mean Baseline | 5.36 | 1.83 | 0.24 |
Given that GAM's are easily interpretable, we ran a partial dependency analysis to understand the behavior of the learned judge aggregator as we vary the judges' scores the model is aggregating.
Where
Intuitively speaking, a partial dependency analysis is useful for characterizing the behavior of a function when one feature is modified along it's possible values. Making this possible requires keeping the other features "fixed". However, we can't fix the other features in arbitrary values and then measure the contribution of the selected feature. To overcome such obstacle, an expected value over all the possible values of is taken, which more appropriately reflects the intuition behind fixing values.
In this context, the features refer to the judges' scores for every (question, answer) pair outputted by the model. All the judges gave scores in the 0.0 - 4.0 range.
By plotting the partial dependency of some of the judges, we obtain:
From the plots, we observe that the GAM strongly reacts to non-safety features like relevance, reasoning, and style—the partial dependence curves rise steadily as these scores increase. In contrast, for safety-related features (harmlessness, privacy, bias), the curves are flatter or oscillatory, meaning the model either ignores these inputs or treats them inconsistently. This indicates a tendency to prioritize performance-oriented metrics over safety-related ones in the aggregation
In this section, we discuss how our solution might behave as models keeps scaling, ideas that would make this particular work better, some ideas of further exploration and safety aspects of our solution.
Multi-judge systems are expected to be more robust than single-judge systems, we generated a model that enables better performance (regarding the base line) to multi-judge evaluation systems but that is also interpretable and cheap to train. So we believe this is a contribution to AI Safety.
Additionally, our architecture can be experimented with as described above and possibly generate insights on safety important problems. For example, if we do "Experiment idea 4" above and show that we can learn , then control it with another not-learned function and still get near-optimal performances (to the judge evaluation), we would have a really safe multi judge evaluation system that performs really well.
However, we highlight possible risks of our work:
To summarize, our simplified work leaves room for a lot of possibly promising experiments that might elucidate the optimal way of doing multiple judge aggregation and we believe those would be very cool to follow on.