BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs

Daniel Wu

The following is a revised version of the winning paper that my team (Daniel Wu, David Zhang, Justin Zhang) produced as part of the Impact Research Initiative Fall 2025 cohort. We were mentored by Nikola Jurkovic.

Abstract

We introduce BBQ-Bench: a novel benchmark designed to evaluate research-relevant reasoning skills of AI models. Our benchmark targets three core capabilities: finding patterns in data, forming hypotheses, and designing useful experiments. We evaluate these capabilities by testing AI models’ ability to infer black-box functions through interactive queries. Each task in our dataset consists of a hidden function, which the model must identify by querying inputs of its choice. We find that recent LLMs outperformed our human baseliners, with Gemini 3 Pro achieving the best score of 92.5%. From manual review of transcripts, we conclude that a likely cause of LLM failures is narrowing in on false hypotheses too early. You can find the full code base here: https://github.com/dzhang3701/black-box-query-bench

Background

Monitoring and evaluating the research capabilities of LLMs is crucial, as models continue to accelerate scientific discovery across various domains, including AI itself. Our benchmark measures skills related to the experimental and discovery-based components of the research process. We do this by abstracting the research workflow into a set of streamlined proxy tasks. Our tasks preserve the core skills involved in research while remaining simple and easy to evaluate. BBQ-Bench tests a form of experimental thinking that mirrors the scientific method, in which a scientist must test their hypothesis by collecting data.

The environment of BBQ-Bench is similar to active learning, which is a subfield of machine learning that aims to increase data efficiency of AI models by allowing the models to query the labels of specific data points within a large set of unlabeled data. Benchmarks for active learning include ALdataset: a benchmark for pool-based active learning and An Expanded Benchmark that Rediscovers and Affirms the Edge of Uncertainty Sampling for Active Learning in Tabular Datasets. These benchmarks aim to standardize the measurement of active learning methods by using a consistent evaluation protocol and a set of diverse datasets. In some sense, BBQ-Bench measures active learning, however it differs in that the underlying functions have structured rules (checking whether a number is prime, rather than whether an image contains a cat). Thus, the difficulty in BBQ-Bench tasks is in identifying the function through informative queries, rather than gradually learning from large quantities of labeled data. Additionally, BBQ-Bench measures the active learning capabilities of LLMs themselves, whereas active learning benchmarks measure the performance of specific active learning techniques.

One of the most comprehensive benchmarks for measuring research capabilities is OpenAI’s FrontierScience, which consists of difficult problems in physics, chemistry and biology. The tasks, created by field experts, are designed to test both olympiad-style problem solving and research-level reasoning. BBQ-Bench differs from FrontierScience in that instead of directly asking research questions, it tests research-based reasoning in an abstracted, interactive environment. This abstraction means that BBQ-Bench generalizes beyond specific domains and targets the research skills themselves.

Dataset

Each task in our dataset consists of a black-box function. The models can repeatedly submit input queries to the function and receive their corresponding outputs, with the ultimate goal of deducing what the function is.

Our dataset consists of 20 tasks, evenly split into two categories: numerical and string. Numerical tasks involve mathematical operations on numbers, and string tasks involve operations on strings of characters. None of the tasks directly involve semantics, or world knowledge.

We designed tasks to span a diverse range of difficulties, domains, and skills. The numerical dataset includes tasks about algebra, geometry, and number theory. The string dataset includes tasks about subsequences, ciphers, and lexicographic orderings. We included tasks that all models could solve, and tasks that no model could solve in order to provide an informative spread of model performance.

We evaluated the difficulty and quality of our tasks by first imagining ways each task could be solved and then testing them on some models and reading through the transcripts. The functions in our tasks are below.

Numerical Tasks

$f (a, b, c) = {\begin{matrix} 1 & (a, b, c) form a Pythagorean Triple 0 & else \end{matrix}$
$f (x) = {\begin{matrix} 1 & x > 58 0 & else \end{matrix}$
$f (x) = digitsum (x)$
$f (x) = 6 x^{3} - 9 x^{2} + 2 x + 3$
$f (a, b, c) = 3 a - 10 b + 5 c$
$f (x) = (2 * f (x - 2) + f (x - 1)) (mod 100)$ $f (1) = f (2) = f (3) = 1$
$f (a, b, c) = a b + c^{2}$
$f (a, b) = gcd (a, b) + lcm (a, b)$
$f (a, b, c, d, e, f) = ⎧ ⎨ ⎩ \begin{matrix} 0 & T is an obtuse triangle 1 & T is an acute triangle 2 & T is a right triangle \end{matrix}$ $where T is the triangle formed by the cartesian coordinates: {(a, b), (c, d), (e, f)}$

String Tasks
$f (s) = string given by cycling all characters in s forward in the alphabet by 10$
$f (s) = {\begin{matrix} 1 & " a b " is a substring of s 0 & T else \end{matrix}$
$f (s) = string given by cycling k th alphabetically lowest character in s forward in the alphabet by k positions for all k$
$f (s) = parity of the sum of the numeric values of each character in s$
$f (s) = length of longest prefix of s that occurs elsewhere in s$
$f (s) = number of characters in s that are alphabetically greater than all neighboring characters$
$f (s) = {\begin{matrix} 1 & s is alphabetically less than "jwz" 0 & T else \end{matrix}$
$f (s) = {\begin{matrix} 1 & there is a pair of consecutive characters in s that have an alphabetic gap of size at least 18 0 & T else \end{matrix}$
$f (s) = length of longest palindromic subsequence of s$
$f (s) = number of indices i such that: the numeric value of the i th character of s \leq i$

In addition to the functions themselves, some tasks come with a set of sample (input, output) pairs that the model receives before making queries. Samples were given for sparse classification tasks, where stumbling upon positive examples would be rare without guidance.

Methods

Our evaluation follows a round-based format:

SYSTEM PROMPT: Models are presented the task setup and guidelines, along with samples (if any)
Query execution: Models submit queries and are returned the outputs of the black-box function on the queries. The number of queries that the model can submit in each round is determined by a parameter query_batch_size, which we vary by task. Harder tasks have larger query_batch_size so that they get more information in each round.
Scratchpad update: Models summarize all of their ideas, including observations, hypotheses, and future experiments, into a plain-text scratchpad. Scratchpads are capped at 300 words, and longer scratchpads are truncated. This scratchpad, along with past query history, is the only information passed forward to future rounds.
Evaluation: We test whether the model has learned the function. We present the model with a set of test inputs, and ask it to provide predictions on the outputs of each input. If all outputs are correct, we judge that the model has correctly inferred the function. We crafted test sets such that passing all test cases would require knowing the function.
Repeat steps 2-4 until max_rounds (20 for string tasks and 30 for numerical tasks) is reached or the model reaches 100% accuracy on the test cases.

Figure 1: Evaluation pipeline showing round-based evaluation format with query, scratchpad, and evaluation phases,
with continual context summarization throughout.

During each of the three phases, models are permitted to run Python once, by invoking the execute_python tool. Models are allowed up to 3 opportunities to successfully invoke the query, submit_predictions, and execute_python tool calls. We observe that models fail to correctly call their desired tool within 3 attempts less than 1% of the time, either because of code errors, invalid queries, or response errors. All testing was carried out with INSPECT, a framework for LLM evaluations developed by the UK AI Safety Institute.

We tested the following models: GPT-5.1 (medium), GPT-5 Mini (medium), GPT-5 Nano (medium), GPT-4.1, Claude 4.5 Sonnet, Claude 4.5 Haiku, Grok 4.1 Fast Reasoning, Gemini 3 Pro Preview, Gemini 2.5 Pro, and Gemini 2.5 Flash. We wanted a set of models that included the frontier of each of the major AI labs, as well as smaller, cheaper models to compare to. We attempted to additionally test grok 04-0709, but due to its very large model size and extensive time used for tasks, we did not fully benchmark it.

In order to optimize use of our API budget, we varied the number of trials we conducted on each model. In each trial, we gave the model the full set of tasks. Our results for models that we conducted fewer trials on should be interpreted with less confidence.

Model	Number of Trials
GPT-5.1	2
GPT-5 Mini	4
GPT-5 Nano	8
GPT-4.1	2
Claude Sonnet 4.5	2
Claude Haiku 4.5	4
Grok 4.1 Fast Reasoning	8
Gemini 3 Pro Preview	4
Gemini 2.5 Pro	8
Gemini 2 Flash	8

In addition to testing the 10 LLMs, we also tested 12 MIT first-year undergraduates to generate a human baseline. These baseliners had no inside knowledge of the functions. We gave these students the same set of tasks, delivered with the same methodology. Participants received the same prompts and followed the same overall evaluation setup as the models, with the exception that evaluations took the form of plaintext submission rather than test cases.

Results

We score each model based on the portion of tasks completed within the query limit. This accuracy makes up our official BBQ-Bench score.

Figure 2: Bar chart showing BBQ-Bench Scores by Model. Error bars represent 50% confidence intervals. Gemini 3 performs the best, and Claude models perform poorly. Many models significantly surpass the human baseline.

Of the models we measured, we found that Gemini 3 Pro and GPT 5.1 scored the highest, and beat the human baseline. The Claude models that we measured lagged behind the latest Gemini, GPT, and Grok models, and are the only frontier models that performed worse than the human baseline.

Figure 3: Bar chart showing portion of numerical tasks solved by each model. Error bars represent 50% confidence intervals.

Figure 4: Bar chart showing portion of string tasks solved by each model. Error bars represent 50% confidence intervals.

Figure 5: Scatter plot showing string scores against numerical scores. While there is a positive relationship, the ordering of models by string scores is different than the ordering of models by numerical score.

We find that the string tasks were more difficult than the numerical tasks overall, and performance on the string tasks showed more variation across models. We also found that the relationship between numerical scores and string scores was strong but not perfect.

Figure 6: Scatter plot showing BBQ-Bench scores over time. Frontier models have consistently improved.

We observe that BBQ-Bench scores have made rapid progress over the past six months. This suggests that research skills overall are on a sharp rise.

Figure 7: Scatter plot showing BBQ-Bench scores against GPQA Diamond scores. There is a strong positive relationship.

We observe a strong but not perfect relationship between GPQA Diamond scores and BBQ-Bench scores. Both benchmarks require a common set of general knowledge and reasoning skills, however BBQ-Bench tests many skills that GPQA does not test, and vice versa.

We were also curious on how many queries it took each model to solve the tasks. Even if two models solved the same portion of tasks overall, one model may have done it with far fewer queries, which BBQ-Bench scores don’t show. We plot the portion of tasks solved versus the portion of queries used for each model.

Figure 8: Cumulative success plot showing solve rates by query time for each model. Some curves begin not at the origin because the models guessed the function using 0 queries (only the sample cases). Paths cross over each other, showing models excel over different periods of the query timeline. Some models may be better at crafting the right experiments, while others may be better at finding patterns with limited data.

We observe that Gemini 3 Pro Preview has high query efficiency, requiring half as many queries as the second-best model to reach 60% task completion. We additionally see that most curves are concave downwards. This means that earlier queries tended to be more helpful than later queries, and more data often had diminishing returns.

We also observe that many curves frequently cross each other. For example, GPT 4.1 beats Gemini 2.5 Flash through the first half of queries, but then Gemini 2.5 catches up and the order flips. We conclude that models likely have different rates of productivity along different portions of the query timeline. Some models shoot up fast and then slow down, which may mean that they are better at identifying patterns in a small amount of data, but are worse at continuing to query helpful data for more complex functions. Other models have more consistent trajectories, which may mean that they take more data to identify simple patterns, but are consistently good at designing the right experiments to identify information they need. We are less confident in this conclusion, due to our limited trial count.

Qualitative Findings

General Model Behaviors

We found that models reason in very structured, focused ways. In their scratchpad, they tend to repeat their recent queries, describe observations, list candidate hypotheses, and brainstorm future queries. Models start with broad families of hypotheses and then narrow in when they have convincing data.

Figure 9: GPT-5.1 using the scratchpad to reason and hypothesize, 20 queries into adjacent character gap task. In general, the models explicitly reasoned about the patterns they found in their data and what they suggest about the shape of the function.

Additionally, models all used code to extract features of the data. This lets them identify patterns in features that were hard to find by looking at the raw data. Models also used code to generate predictions to test cases, converting hypothesized functions into code. Weaker models often wrote code that did not compile.

Figure 10: Claude Sonnet 4.5 writes code to look at sums, mins, and gcd’s of input pairs, 7 queries into gcd + lcm task. In general, models leverage code to pull out features of the data.

Success and Failure Modes

We found that models tended to be more successful when they used a wide set of hypotheses and then narrowed down slowly. When models queried a wider range of inputs for a longer period of time, it was easier for them to make important observations. Successful models held onto a broader set of hypotheses for longer, before going deeper into a specific investigation. Essentially, having an open mind was helpful. Additionally, successful models used a more consistent set of early queries across tasks.

Conversely, a common failure mode of models was narrowing in on a specific hypothesis too early. Unsuccessful models often made observations after a small number of queries, and committed to exploring a specific family of hypotheses of which the true function was not part of. This led to the models pigeonholing in on incorrect approaches without backtracking. This often happened when initial queries were too narrow and didn’t activate the patterns that hinted at the function.

Confirmation Bias

An interesting behavior that we discovered was confirmation bias. Models often made false observations, and then were biased into believing in them for the rest of the task, even in the face of new evidence. The models would note their false beliefs in the scratchpad, and these beliefs carried forward and biased the choice of future queries. These future queries often reinforced false patterns, perpetuating the original bias.

The most common case of this was when models submitted queries that had structural similarity, leading to the presence of patterns that didn’t generally exist. For example, in the string task where the kth lowest character was cycled forward k letters, GPT 4.1 repeatedly submitted strings that were in sorted alphabetical order. It was then tricked early on into believing that the function always cycled the kth character to the left by k.

Figure 11: GPT-4.1 scratchpad 5 queries into add-k-to-kth task. The model has only queried sorted strings, so identifies a pattern (cycling based on left-right ordering) that doesn’t generally exist.

Because of confirmation bias, this belief continued for the entire 20 queries. Because the model believed the hypothesis to be true, it continued to query sorted strings, which continued to add more evidence in favor of the false hypothesis. On query 19, the model queries a non-sorted string, giving it a case that contradicts the hypothesis. However, because of the accumulated evidence in favor of the hypothesis, the model fails to see the contradiction.

Figure 12: GPT-4.1 scratchpad 18 queries into add-k-to-kth task. The model queries a non-sorted string (zyxw…) but because of confirmation bias, believes doesn’t recognize the contradiction.

Although confirmation bias was more common in weaker models like GPT-4.1, a version of it was also present in more capable models. GPT-5.1 falls into the same trap in the local maxima task. Its earliest observations and hypotheses were that the specific letters in the input string don’t matter, only the equality patterns. This led the model to querying strings with many repeated a’s and b’s, which biased the data that the model collected. After 100 queries, the model’s leading observation was about the presence of the substring “ab”. Again, the model has been misled by early false beliefs, and held onto an initial false hypothesis for too long.

Figure 13: A portion of GPT-5.1 scratchpad 6 queries into add-k-to-kth task. The model’s leading observations involve equality patterns.

Figure 14: GPT-5.1 scratchpad 100 queries into add-k-to-kth task. The model’s leading observation involves the presence of the substring “ab” which is unrelated to the true function. The model has been misled by earlier false beliefs.

Backward Reasoning From Test Data

We found that some models used the test data as hints. For example, given the samples {“jav”: -> 0, “pabee” -> 1}, GPT 5.1 correctly inferred that the black box function returns 1 when “ab” is a substring of the input. Looking at the models' scratchpad, we found the top hypothesis was about repeated letters, before the model suddenly switched to the correct rule once it saw the test cases.

We conclude that the model must have reasoned backward from the test data. It noticed that there were many test inputs with “ab”in them, and that the function must be related to this property. This shows that these models have situational awareness about the nature of the test cases. We found many other instances of this across logs.

Backward reasoning like this is a limitation to our approach of testing the model’s understanding through test cases. A future iteration of this benchmark could have models submit their guesses of the function with code or with a textual explanation.

Specific Model/Task Performances

Gemini 3 Pro was extremely impressive. It solved $f (a, b, c) = 3 a - 10 b + 5 c$ in three queries, and $f (x) = 6 x^{3} - 9 x^{2} + 2 x + 3$ in four queries. These are the minimum number of queries required to define an unbiased linear function and a cubic respectively, meaning the model took no extra queries to infer the form of the function. Additionally, on the is_greater_than_58 task, once Gemini 3 Pro identified monotonicity, it explicitly used its queries to binary search for the threshold.

Discussion

BBQ-Bench evaluates models’ ability to conduct scientific and experimental thinking. Our framework requires models to strategically identify patterns, target new information, and perform inductive reasoning from limited evidence. This methodology provides a new measurement for query efficiency, the ability of models to use a constrained experimentation budget to maximally gain information. This capability could give hints into the performance of models in real scientific discovery settings.

An additional advantage of BBQ-Bench is that the methodology is flexible. As our current tasks and query limits become saturated by more capable models, we can adapt BBQ-Bench by adding more complex functions, or by reducing query limits. BBQ-Bench offers a simple but powerful way to investigate research abilities and reasoning patterns of models during scientific reasoning.

A limitation of BBQ-Bench is that at its current state, it may have only a weak correlation with doing actual research. Although we test some research skills, none of the tasks ask real research questions, involve the design of complex experiments, or contain uncontrollable parameters. Additionally, research involves working with hypotheses that are messier than the mathematical and lexical functions we tested. Future work can extend BBQ-Bench to include tasks about real-world objects such as humans or chemical compounds. Additionally, we could introduce variance into our functions to make them more realistic. More generally, benchmarks that involve interactive environments that behave with hidden rules that agents must identify are a promising way to evaluate experimental thinking in AI models.

Appendix

Extra queries are sometimes harmful

We found that a single model on a single task can produce a wide range of results across trials. The most extreme example of this was GPT-4.1 high-reasoning running on the add-k-to-kth task. In one trial, GPT-4.1 correctly identified the function in one try, just by looking at the samples. In a second trial, GPT-4.1 could not identify the function even after making 50 queries. Notably, in the 50-query trial, the model had the opportunity to analyze significantly more data, but still repeatedly failed to find the pattern.

To dig deeper, we further tested with 10 more trials, each of a query limit of 20. The results were: (3 queries, 7 queries, 9 queries, 17 queries, fail, fail, fail, fail, fail, fail). This data suggests that the more queries in, the less likely the model is to get the hypothesis on the next query.

Next, we ran 200 instances of the model being given just the samples. The model guessed the function in 13/200 instances, which is better than 4/182 times (by time we mean a round within a trial, in which the model has a chance to guess the function) it gets it correct in the ten trials. This confirms that the model is best at guessing the function when it has just the samples.

The clearest two explanations for this are:

The scratchpad propagating between rounds is harmful
The extra data is actively harmful

To dig between these two, we run ten more trials, each with a query limit of 50. This time we don’t pass the previous scratchpad into subsequent generation steps. This way, bad hypotheses can’t be propagated forward. We get the stark results: in one trial, the model guesses the function with just the samples, and in the other nine trials, the model never guesses the function. This is a success rate of 1/219, correct guesses over rounds, which is lower than the trials when the model was only fed samples. Additionally, the lone success was based on just the samples.

We conclude that it is the extra data itself that is hurting the model’s success. We believe that the model is overfitting to the data it collects from queries, and gets distracted by patterns that don’t generalize. This is a supplementary find to the confirmation bias finding discussed above. Future work can further investigate whether this property holds in more generally capable models.