A Black-Box Procedure for LLM Confidence in Critical Applications

Jadair

TL;DR:

LLM self-reported confidence doesn't correlate with accuracy. A simple two-step black-box procedure can do much better. First, ask the model a simple verifiable question related to your topic with web search off. If the LLM can't answer, don't trust it on that topic. If it can answer, turn web search on and repeat your more complex question multiple times in separate context windows. This investigation used 640 questions on 4 LLM models over 8 topics and found the resulting accuracy from this approach highly correlated to repeat answer stability (R² = 0.995). The first 320 questions were used to develop this procedure, and the second 320 questions were used to test it.

Introduction

I am not an AI researcher, and this is not an AI research paper. I’m simply an engineering leader working to integrate AI into my workflow and I’m increasingly focused on how to use LLMs in critical applications. This is an investigation I performed and wanted to share in case it’s helpful to others, or in case others have helpful feedback for me.

Today’s frontier models are reasonably accurate, but they are also inconsistently overconfident. A model that is 90% confident in an answer that is 30% wrong can be catastrophic. In applications such as aerospace engineering, we need very high accuracy and to achieve this we need similarly high confidence in our model calibrations. A model’s self-confidence must match its accuracy. Just like a good engineer, an LLM must know when it’s likely to be wrong.

At the end of 2025 I wrote a post titled A Risk-Informed Framework for AI Use in Critical Applications with some ideas on how to better understand model calibration or model anchoring. This post is a follow up, investigating these ideas and developing a black box procedure for improving our understanding of LLM accuracy. The dataset in this investigation consists of:

A total of 64 data points from 8 topics x 4 LLM models x 2 runs each. Each of these data points represented in the following figures is an average or stability calculation on 5 repeated identical primary questions (5 questions x 8 topics x 4 LLMs x 2 runs = 320 primary questions). Each question also included a request for the LLM to report its self-confidence in the answer.
In addition, a simpler secondary question was asked for each of the topics, LLMs, and runs varied across 5 years as specified in the question (5 questions x 8 topics x 4 LLMs x 2 runs = 320 secondary questions). These questions also included a request for the LLM to report its self-confidence in the answer.
The four models used were Gemini 3 Flash, Opus 4.6, ChatGPT-5-mini and Grok 4.2 Fast

This investigation showed that LLM self-confidence was not correlated to accuracy. To find a better approach, half the data was used to develop a simple black box procedure for estimating LLM accuracy, and the second half of the data was used to test this procedure:

Turn web search off and ask the LLM a simple secondary question related to your topic, if it can’t answer then don’t use this model for this topic.
If the LLM can answer the simple question with web search off, then ask the more complex primary question several times and observe the variation in the answers. Given the model likely has training data on this topic (per step 1) the average accuracy may be proportional to the variability in the answer

The test set had a maximum accuracy error of 3.6% and averaged less than 0.5% compared to the trend developed with the training set. Note this is exploratory work only and should be treated as hypothesis-generating, not hypothesis-confirming, but the practical implications for anyone using LLMs in critical applications are worth considering: Despite poor LLM calibration, if the topic is in the training data, model accuracy may be proportional to the degree of variation in repeated responses.

Background

Since writing my original post, I’ve received some excellent feedback from friends, colleagues and yes, an AI research assistant. The first piece of feedback I received is that Frontier Labs would be very unlikely to share detailed information on their training data such as density or coverage on any given topic. Indeed, it seems this information is increasingly held close. The 2025 Stanford Foundation Model Transparency Index found transparency is declining, and information on training data is becoming increasingly opaque across the industry.

I was also made aware of several existing studies that generally support the core assertions in my original post. LLM accuracy does depend on training density and topic proximity and can be estimated by observing answer consistency. Kirchenbauer et al. “LMD3: Language Model Data Density Dependence”, arXiv:2405.06331, 2024 shows us that training data density estimates reliably predict variance in accuracy. Kandpal et al. “Large Language Models Struggle to Learn Long-Tail Knowledge” arXiv:2211.08411, 2023 demonstrates that accuracy degrades as distance from well-represented training regions increases. Xiao et al. “The Consistency Hypothesis in Uncertainty Quantification for Large Language Models”, arXiv:2506.21849, 2025 says that answer consistency predicts accuracy in LLMs, formalized as the 'consistency hypothesis'. Further Ahdritz et al. “Distinguishing the Knowable from the Unknowable with Language Models”, arXiv:2402.03563, 2024 found LLMs have internal indicators of their knowable and unknowable uncertainty – and can even tell the difference. I recognize the business practicalities but given these intrinsic properties I would nonetheless encourage Frontier Labs to consider methods to provide confidence indicators in a way that does not expose their trade secrets.

Until then (or in case it never happens), what can we the users of these LLMs do to better characterize the confidence we should have in LLM responses? This investigation suggests there is much we can do.

Investigation

I started by thinking about the first metric from my original post; model training data density. What can an LLM user observe directly that may give us a hint about model training density? It occurred to me that search engine results count on a particular topic may give at least a relative sense of the data on the internet available for training on a particular topic. I figured as a starting point Google web search results count may be especially relevant for Google’s Gemini LLM. I then selected eight similar topics across a broad range of internet popularity: See Table 1 for eight different sports leagues from around the world with a range of internet representation. Google search results counts were determined by searching for the league name followed by the year 2023 (well within the training window for current LLMs). This search was done in Google incognito mode to remove influence from past searches.

Table 1: Worldwide sports league topics across a range of Google search results count

Next, I came up with a series of primary questions for use on these topics designed to represent the type of question you may want to use an LLM to answer:

“What was the total playing time in hours for the <<insert sports league>> in the season ending in 2023? Include post season playoffs, but don’t include any overtime.”

This question is designed to require some web search and reasoning and for which there is no readily available website listing the final answer. Specifically, this question requires general knowledge of the sport (nominal play time), specific knowledge of the league (number of teams and games played), and finally temporal knowledge of the specific year (playoff outcome). It also includes two reasoning nuances:

Total playing time is different from total game time which includes intermission and commercials etc.
Many leagues span two years, and the question specifically asks for the season that ends in 2023.

The final condition regarding overtime was added as a practical consideration as I needed to be able to manually calculate the source of truth for each question with high confidence and specific game times are not readily available for each league. I was careful to ensure every input to the source of truth in this investigation was identified and derived manually (I originally started with ten leagues but could not manually verify the answers for two and omitted them). I repeated each query five times for each of the eight leagues taking care to ask each question in its own context window, with web search enabled, but memory off. Disabling memory was essential, as I originally left it on and responses across sessions became artificially consistent. This configuration is intended to simulate how a user would use an LLM to answer this question (web search on), without the influence of this research or other past searches (fresh context windows and memory off). The question intentionally asks for a numeric value to allow for the evaluation of the degree of response accuracy.

In this investigation, accuracy is defined as one minus the absolute value of the difference between the LLMs answer and the true answer divided by the true answer. This gives 100% for a correct answer and 0% for a 100% wrong answer (and negative for answers more than 100% wrong).

Note all the models investigated returned reasonably high accuracy, for example 93% of the unaveraged 320 primary question answers were over 90% accurate. However, this number drops sharply the higher the accuracy threshold. Only 79% of answers were over 98% accurate. If this level of accuracy is enough for your purposes this investigation may not be of much use to you. My focus here is to understand confidence for extremely critical applications where the answers must consistently have very high accuracy.

Model Self-Confidence

Before diving into any complicated metrics, I thought that as a starting point it would be ideal if the LLMs simply reported accurate self-confidence on each question. Therefore, I added the following to the end of the question above in each prompt:

“What is your confidence in this answer 0% to 100%?”

Plotting the average accuracy vs average self-confidence for Gemini 3 Flash over five repeated replies to the question above for each of the eight topics shows one low accuracy outlier in the least represented league (Finnish Women’s Basketball League). To validate this outlier, I repeated the entire run and saw the same outlier in the same league as shown in Figure 1.

Figure 1: Average model accuracy vs average self-confidence on eight topics using two runs on Gemini 3 Flash shows the same significant outliers

This result is essentially the reason for this investigation (and its predecessor). A model is not well calibrated when its self-confidence doesn’t match its accuracy. How can we trust models with critical decisions when they are not well calibrated? Even worse than a mis-calibrated model is one that is inconsistently calibrated. Had I only repeated my question four times instead of five I may have missed the outliers and I would have been overconfident in this model for this topic. Reviewing the LLM responses for these outliers reveals these points are clearly hallucinations related to different accounting for the number of games played per season. Regardless, these hallucinations were provided with very high self-confidence as shown in Table 2. The hallucination in Run #2 showed the lowest confidence at 90%, which is still very high for an answer that is almost 30% wrong.

Table 2: Unaveraged prompt question and answers for the least represented league shows similar and same self-confidence across wide range of accuracy with outliers in red

Model Training Data Density

Since model self-confidence is not reliable, the next easiest thing would be to evaluate model trustworthiness based simply on data available for training. Inspired by the first metric proposed in my original post I plotted the average accuracy of Gemini 3 Flash over the five repeated queries using the question above for eight different leagues vs Google search results counts for each league in 2023 (as a proxy for available training data density). In this dataset Gemini 3 Flash is highly accurate until you get to a topic with Google search results count under ~50M, then accuracy drops off quickly. The three least represented leagues also had the lowest accuracy as shown in Figure 2. This is consistent with the LMD3 finding (Kirchenbauer et al. 2024) that training data density predicts per-sample accuracy. This is helpful as a first order approximation of whether there is sufficient data available to train on, however this drop-off is likely relative and may vary by topic or model.

Figure 2: Average model accuracy vs Google search results count as a proxy for available training data density on eight topics using two runs on Gemini 3 Flash shows accuracy drops sharply for topics with less than ~50M Google search results counts

Model Stability

Next, I looked at an approximation of the third metric from my original post; answer stability over small variations in the prompt. The simplest version of this investigation is to measure LLM answer variation in response to the exact same prompt repeated several times. In this investigation, stability is defined as one minus the standard deviation divided by the mean. Note with only five samples in each stability calculation the standard deviation is highly sensitive to outliers (which makes any correlations here noteworthy despite the small sample).

I plotted average model accuracy against stability over the five repeated identical questions for the eight topics ensuring to ask each question in its own context window, with memory off. This resulted in a strong linear correlation between stability and average accuracy for both Gemini 3 Flash runs as shown in Figure 3 (R^2 of 0.99 for all 16 data points).

Figure 3: Average model accuracy vs stability on eight topics using two runs on Gemini 3 Flash shows strong linear correlation

This is expected per the 'consistency hypothesis' (Xiao et al. 2025) but it was nonetheless striking to see this phenomenon so clearly in this small dataset. In this comparison the points that were outliers in the previous accuracy vs self-confidence plot (Figure 1) are no longer outliers since their low accuracy is proportional to low stability in the five responses. This shows that for this model, on this topic, the degree to which you should trust the output may be estimated by the variation in repeat answers.

Next, I decided to add three additional LLM models Opus 4.6, ChatGPT-5-mini and Grok 4.2 Fast using the same methodology to see if this result was unique to the Gemini 3 Flash model. Note the second Opus and ChatGPT runs as well as both Grok runs were added later as the test dataset but are shown here all together for convenience. These additional models show increased range of accuracy and stability, but the correlation remains generally linear with some outliers compared to the Gemini 3 Flash runs, as shown in Figure 4 (R^2 of 0.94 for all 64 data points).

Figure 4: Average model accuracy vs stability on eight topics using two runs each on Gemini 3 Flash, Opus 4.6, ChatGPT-5-mini and Grok 4.2 Fast shows a linear correlation with some outliers

The five data points farthest from the trend are from the least represented leagues (Finnish Women’s Basketball League and Swedish Hockey League) which are the least likely to be covered in the model training. Four of these five points are on the same side of the trendline; meaning they are all more consistent than they are accurate when compared to the trend. Inspection of the prompt replies reveals that for some of these points the model didn’t know or look up the actual number of games in the season and guessed consistently high resulting in relatively high stability but low accuracy. A model that is consistently wrong produces high stability but low accuracy which is a dangerous and misleading failure mode.

ChatGPT-5-mini and Grok 4.2 Fast both have points that are notably from the most represented league (National Football League) and yet show much lower accuracy and stability than all other points. Inspection of the prompt replies shows one of the five answers on this topic in both models returned total game time, not play time. These were the only two replies to make this mistake out of 40 questions on this topic across eight model runs. The mistakes in the prompt replies are clear, and I considered correcting them with the justification that the purpose of this investigation is training data not reasoning. However, given this was not a common issue (may be related to my question phrasing) and ultimately this type of error may not always be so obvious from the user standpoint, I left the answers uncorrected for further analysis.

Model Training Data Geometry

Finally, I wanted to know if the accuracy vs stability outliers could be explained by some gaps in the underlying training data. To investigate this, I returned to the second metric from my original post; does training data coverage and proximity to the specific question topic influence results? Current studies say yes (Kandpal et al. 2023), but how can we investigate this as a user? To answer this question, I devised a set of simpler related secondary questions to be posed to an LLM with search turned off to test the underlying training data.

“Do not search the internet for this answer (use only your training). What was the final game winning score for the <<insert sports league>> in the season ending in <<202X>>.”

This question is designed to have an answer that a user can easily search the web to check. Only the winning team’s score was used, tracking a single numeric value to easily evaluate the degree of accuracy. It’s also designed to cover years before, during, and after the primary question to map temporal coverage of the topic, including 2025 which is mostly beyond these models’ training window. Models that refuse to answer this question may lack relevant training data (the topic may not be in “range” as discussed in my original post), and wrong answers may indicate some nearby but incomplete training data (the idea of training data “proximity” per my original post). The hypothesis here is that either or both results could indicate poor model anchoring on this topic and may predict worse accuracy on the original question. If so, importantly this could be tested by the user via this secondary question.

I asked this additional 320 question set (one question for each of five years over eight topics on four models with two runs each) again taking care to ask each question in its own context window, with memory off, but this time with web search disabled. Accuracy is defined the same as for the primary questions. I also requested model confidence on these secondary questions as well but did not use these results.

Table 3: Secondary question results across 8 topics x 4 LLM models x 2 runs each with web search disabled shows a distribution of correct answers (green), as well as incorrect answers (yellow, orange and red) and answer refusals (empty) being more common on the three least represented leagues

It seems these models respond differently to a lack of training data. Gemini 3 Flash always guessed (even when it was outside its training window) and at times with very low accuracy. The propensity to guess is a dangerous failure mode if you don’t know your models’ training window. Opus 4.6 and ChatGPT-5-mini were more likely to refrain and return only higher accuracy answers. Grok 4.2 Fast occasionally disobeyed the prompt and searched the internet even though it was explicitly asked not to. This is clearly troubling but seems limited to years outside Grok’s training window, and this data was not used in the procedure below.

The two Gemini 3 Flash model runs were the only runs to return answers for every year inside the training window for every topic (even for leagues it consistently got wrong). Gemini 3 Flash was also the model most likely to provide answers for dates after the training cut-off at 69% of the time vs 13% for Opus 4.6, 0% for ChatGPT-5-mini, and 31% for Grok 4.2 Fast. Note the 2025 Superbowl was in Feb 2025 and is inside Opus 4.6 training cut-off of May 2025.
Confirming web search was not enabled, all answers after the training cut-off were wrong with the following exceptions. There was one correct answer in the Gemini 3 Flash Run #2 which upon investigation of the prompt reply showed the wrong losing team, wrong losing team score, and wrong series result leading me to believe this lucky guess was likely based on some prediction from earlier in the season. Also, after testing I noticed Grok 4.2 Fast searched the internet on three (19%) of its 2025 secondary questions despite being told not to.

ChatGPT-5-mini had another reasoning breakdown on the most represented league, which is notable given its reasoning breakdown on play time vs game time in the primary questions (and coincidentally in this case it seemed to be worst in the year used for the primary questions).

In ChatGPT-5-mini Run #1 it struggled to retrieve the score for the most represented league (National Football League) for all years. I gave it a passing grade in most years as it would usually answer with the next year’s game but then proceed to provide the correct next year’s date. It was not, however, able to return the correct score in 2023 (I repeated the question for 2023 and 2022 several times out of curiosity and never saw a correct result for 2023). No other model or topic had this type of issue with this or any other question.

Recalling the Google search results count plot vs average model accuracy (Figure 2); the three least represented leagues all have the lowest average accuracy. Now we see they also likely have the least complete training data on these topics in all four models as observed via this secondary question with web search disabled.

The league with the least internet representation was the most likely to receive a complete lack of replies for all years at 5/8 runs without any replies (vs 1/8 for the second least represented league which was the only other to not receive any replies in a single run). In the three cases where the least represented league did receive replies they were all wrong.
The second least represented league also scored at least one wrong answer in all runs, and the third least represented league scored one wrong answer in all, but one model run (and in that run it refused to answer for half the remaining years in the training window).
These three least represented leagues received 80% of all replies where the model was inside its training window but refused to answer.

These results all point to the possibility that useful information can be derived from this secondary question technique to improve confidence in the primary answer. The simplest approach is to assume that for any refusal to answer a secondary question (where web search was off), these leagues and years are not covered in the training set (not in range) and the answer to the primary question in those leagues and years will be based on real time search results, not model training. Answers based on real time search results, and not model training, may not follow the average accuracy vs stability trend. This was the case in a ChatGPT-5-mini answer to the primary question for the Finnish Women's Basketball League where it didn’t know or look up the actual number of games in the season and guessed consistently high resulting in relatively high stability but low accuracy. Answers to the secondary question that were incorrect but still returned a value may indicate some proximity to training data and may still follow the average accuracy vs stability trend even if with much lower stability and accuracy. This was the case in the ChatGPT-5-mini and Grok 4.2 Fast answers to the primary question for the National Football League where they returned total game time instead of play time, yet accuracy was still proportional to stability.

Removing only those data points where the model refused to answer the secondary question (and leaving in the correct and wrong answers) filters the accuracy-stability relationship to a highly linear trend. The hypothesis here is that for the topics where the model refused to answer, the training data is lacking and so answer variation does not manifest in the same stochastic way as for well represented topics. This means the consistency hypothesis may not apply and we should not attempt to evaluate accuracy based on stability for these points. It is especially interesting to note that with this dataset it was not necessary to know whether the secondary question was answered correctly or not, only that it was answered. A trend line fitted to the original training set has a maximum error between the predicted and actual average accuracy of less than 2% (0.3% average). Compared to this original trend line, the test set has a maximum error of 3.6% (0.3% average). Overall, the trend line fitted to the complete expanded dataset has a maximum error of 3% (0.3% average) as shown in Figure 5 (R^2 of 0.995 for 52 data points).

Figure 5: Average model accuracy vs filtered stability on eight topics using two runs each on Gemini 3 Flash, Opus 4.6, ChatGPT-5-mini and Grok 4.2 Fast shows strong linear correlation

Practical Procedure

In practice if you have a relatively complex question you wish to use an LLM to answer, but would like high confidence in that answer, you could use the following approach:

Check the Range and Proximity by turning web search off and asking the LLM several simple related questions you can easily verify
1. If the model answers (and in particular answers correctly) this should give you confidence that the stability check in the next step will be useful
2. Refusal to answer means the topic may not be in the training and you probably shouldn’t use this model for this question
Check the Stability by asking your complex question five times and observing the variation in the answers
1. High stability in the answer should give you high confidence in that result (given you have confirmed in step 1 the topic is in the training data)
2. For low stability, the accuracy may be proportional and plotting results for several related topics may allow you to estimate the accuracy knockdown

Note I’ve excluded the Google search results training density approximation from this procedure as it appears in this dataset that the ability for a model to answer the secondary question with web search off is a much better indicator of sufficient training data.

It’s also notable that three of the four models tested had serious issues worth considering when selecting an LLM for critical applications. ChatGPT-5-mini had some serious reasoning issues (game time vs play time and difficulty providing scores in seasons ending in specific years), as well as providing consistent hallucinations off the accuracy-stability trend. Also, Gemini 3 Flash was more likely to guess even when it should have known better and Grok 4.2 Fast sometimes disobeyed the prompt. These issues are likely due to the models’ various sophistication levels as the versions of ChatGPT, Gemini and Grok used were lighter-weight free models compared to the flagship Opus model.

Conclusion

To clearly reframe the motivation for this investigation, Figure 6 shows the average accuracy plotted against average self-confidence for the expanded dataset. For this dataset the black box procedure above predicts LLM accuracy far better than the model's own self-reported confidence and importantly lets the user know when a model is unlikely to provide an accurate reply. While the results of the filtered accuracy-stability procedure are highly correlated (R^2 = 0.995 as shown in Figure 5), the model’s self-confidence showed no correlation (R^2 of 0.02 for 64 data points).

Figure 6: Average model accuracy vs self-confidence on eight topics using two runs each on Gemini 3 Flash, Opus 4.6, ChatGPT-5-mini and Grok 4.2 Fast shows no correlation

Admittedly, this specific procedure will not apply to many cases where the answer is not a simple numerical result, and the metrics I used for accuracy and stability are inappropriate in those cases (and admittedly somewhat over simplified even for this case). A metric like semantic entropy could be used as a far more flexible tool for evaluating variation in responses. This investigation intentionally focused on numerical answers and a simplified evaluation as a practical demonstration of the underlying concept: Even in models with poor LLM confidence calibration, the consistency hypothesis tells us LLM accuracy may be proportional to response variation, but importantly only if the topic is in the training data.

To my knowledge, this approach of using simple verifiable questions with search disabled as a filter for the consistency hypothesis has not been proposed elsewhere. This approach may benefit from additional investigation across additional models, topics and question types. Until models learn to know when they're likely wrong, engineers using them will need methods to understand their calibration - like any other good engineering tool.

Thanks to those who reviewed and provided feedback on this post – your time is much appreciated.