As an engineering leader integrating AI into my workflow I’ve become increasingly focused on how to use LLMs in critical applications. Today’s frontier models are generally very accurate, but they are also inconsistently overconfident. A model that is 90% confident in an answer that is 30% wrong can be catastrophic. In applications such as aerospace engineering, we need very high accuracy but more importantly we need confidence calibration. A model’s self-confidence must match its accuracy. Just like a good engineer, it must know when it’s likely wrong.
At the end of 2025 I wrote a post titled A Risk-Informed Framework for AI Use in Critical Applications with some ideas on how to better understand this calibration or model anchoring. This post is a follow up investigating these ideas and developing a black box procedure for improving our understanding of LLM accuracy. Using 320 queries spanning 8 topics across a wide range of internet coverage I performed 4 independent question/answer runs on 3 different LLM models, and a surprisingly simple procedure emerged:
First, check available training density for a given topic via Google search result count.
Next, repeat the question across independent sessions to quantify answer stability.
Finally, ask related questions with web search off to identify topics outside of training.
The resulting stability-accuracy relationship for the small dataset in this investigation predicts accuracy within 2% (and averages less than 0.5% across all 4 runs). Note this is exploratory work only and should be treated as hypothesis-generating, not hypothesis-confirming, but the practical implications for anyone using LLMs in critical applications are worth considering.
Background
Since writing my original post, I’ve received some excellent feedback from friends, colleagues and yes, an AI research assistant. The first piece of feedback I received is that Frontier Labs would be very unlikely to share detailed information on their training data. Indeed, it seems this information is increasingly held close. The 2025 Stanford Foundation Model Transparency Index found transparency is declining, and information on training data is becoming increasingly opaque across the industry.
Until then (or in case it never happens), what can we the users of these LLMs do to better characterize the confidence we should have in their responses? This investigation suggests there is much we can do.
Investigation
I started by thinking about the first metric from my original post; model training data density. What can an LLM user observe directly that may give us a hint about model training density? It occurred to me that search engine results count on a particular topic may give at least a relative sense of the data on the internet available for training on a particular topic. I figured as a starting point this may be especially relevant for Google web search results count and Google’s Gemini LLM. I then selected eight similar topics across a broad range of internet popularity: See Table 1 for eight different sports leagues from around the world with a range of internet representation. Google results counts were determined by searching for the league name followed by the year 2023 (well within the training window for current LLMs). This search was done in Google incognito mode to remove influence from my past searches.
Table 1: Worldwide sports leagues across a wide range of Google search results count
Next, I came up with a series of prompts for use on these leagues designed to represent the type of question you may want to use an LLM to answer:
“What was the total playing time in hours for the <<insert sports league>> in the season ending in 2023? Include post season playoffs, but don’t include any overtime.”
This question is designed to require some web search and reasoning and for which there is no readily available website listing the final answer. Specifically, this question requires general knowledge of the sport (nominal play time), specific knowledge of the league (number of teams and games played), and finally temporal knowledge of the specific year (playoff outcome). It also includes two reasoning subtleties:
Total playing time is different from total game time which includes intermission and commercials etc.
Many leagues span two years, and the question specifically asks for the season that ends in 2023.
The final condition regarding overtime was added as a practical consideration as I needed to be able to manually calculate the source of truth for each question with high confidence and specific game times are not readily available for each league. I was careful to ensure every input to the source of truth in this investigation was identified and derived manually (I originally started with ten leagues but could not manually verify the answers for two and omitted them). I repeated each query five times for each of the eight leagues taking care to ask each question in its own context window, with web search enabled, but memory off. Disabling memory was essential, as I originally left it on and responses across sessions became artificially consistent. This configuration is intended to simulate how a user would use an LLM to answer this question (web search on), without the influence of this research or my other past searches (fresh context windows and memory off). The question intentionally asks for a numeric value to allow for the evaluation of the degree of accuracy in any response.
In this investigation, accuracy is defined as one minus the absolute value of the difference between the LLMs answer and the true answer divided by the true answer. This gives 100% for a correct answer and 0% for a 100% wrong answer (and negative for answers more than 100% wrong).
Note all the models investigated (Gemini 3 Flash, Opus 4.6 and ChatGPT-5-mini) returned generally very high accuracy, for example 95% of answers were over 90% accurate. However, this number drops sharply the higher the accuracy threshold. Only 83% of answers were over 98% accurate. If this level of accuracy is enough for your purposes this investigation may not be of much use to you. My focus here is to understand confidence for extremely critical applications where the answers must consistently have very high accuracy.
Model Self-Confidence
Before diving into any complicated metrics, I thought that as a starting point it would be ideal if the LLM simply reported accurate self-confidence on each question. Therefore, I added the following to the end of the question above in each prompt:
“What is your confidence in this answer 0% to 100%?”
Plotting the average accuracy vs average self-confidence for Gemini 3 Flash over five repeated replies to the question above for each of the eight leagues provides an almost useful answer. The result is a somewhat linear trend except for one low accuracy outlier in the least well represented league (Finnish Women’s Basketball League). To validate this outlier, I repeated the entire 40 question test (five identical questions over eight leagues) with the same outlier in the same league as shown in Figure 1.
Figure 1: Average model self-confidence in eight categories vs average model accuracy of Gemini 3 Flash run twice shows the same significant outliers
This result is essentially the reason for this post (and its predecessor). A model is said to be well calibrated when its self-confidence matches its accuracy. How can we trust models with critical decisions when they are not well calibrated? Even worse than a mis-calibrated model is one that is inconsistently calibrated. Had I only repeated my question four times instead of five I may have missed the outliers and I would have been overconfident in this model for this league. Reviewing the LLM responses for these outliers, these are clearly hallucinations related to different accounting in the number of games played per season and were provided with very high self-confidence as shown in Table 2 below. The hallucination in Run #2 showed the lowest confidence at 90%, which is still very high for an answer that is almost 30% wrong.
Table 2: Prompt question and answers for lowest represented league shows similar and same self-confidence across wide range of accuracy with outliers in red
Model Training Data Density
Since model self-confidence is not reliable, the next easiest thing would be to evaluate model trustworthiness based simply on data available for training. Inspired by the first metric proposed in my original post I plotted Google search results counts for each league in 2023 (as a proxy for available training data density) vs the average accuracy of Gemini 3 Flash over the five repeated queries using the question above for eight different leagues. In this data set Gemini 3 Flash is highly accurate until you get to a topic with Google search results count under ~50M, then accuracy drops off quickly. The three least represented leagues also had the lowest accuracy as shown in Figure 2. This is consistent with the LMD3 finding (Kirchenbauer et al. 2024) that training data density predicts per-sample accuracy. This is helpful as a first order approximation of whether there is sufficient data available to train on, however this drop-off is likely relative and may vary by topic or model.
Figure 2: Google search results count as a proxy for available training data vs accuracy of Gemini 3 Flash shows accuracy drops sharply below ~50M results for these topics
Model Stability
Next, I looked at an approximation of the third metric from my original post; answer stability over small variations in the prompt. The simplest version of this investigation is to measure LLM answer variation in response to the exact same question repeated several times. In this investigation, stability is defined as one minus the standard deviation divided by the mean. Note with only five samples the standard deviation is highly sensitive to outliers (which makes any correlations here noteworthy despite the small sample).
I plotted stability against average model accuracy over the five repeated identical questions for the eight leagues ensuring to ask each question in its own context window, with memory off. This resulted in a strong linear correlation between stability and average accuracy for both 40 question Gemini 3 Flash runs as shown below in Figure 3 (R^2 for both these runs combined is 0.99).
Figure 3: Stability across five repeated questions in eight categories vs average accuracy of Gemini 3 Flash run twice shows strong linear correlation
This is expected per the 'consistency hypothesis' (Xiao et al. 2025) but it was nonetheless striking to see this phenomenon so clearly in this small dataset. In this comparison the points that were outliers in the previous self-confidence vs accuracy plot are no longer outliers since their low accuracy is proportional to increased variation in the responses. This shows that for this model, on this topic, the degree to which you should trust the output may be directly related to the variation in repeat answers.
Next, I decided to add two additional LLM models to this dataset to see if this result was unique to the Gemini 3 Flash model. Opus 4.6 and ChatGPT-5-mini were added using the same 40 question methodology. Opus 4.6 shows good congruence with the Gemini 3 Flash runs, but ChatGPT-5-mini is mostly congruent except for two low accuracy outliers as shown in Figure 4 (R^2 for all four runs combined is 0.94).
Figure 4: Stability in eight categories vs average model accuracy of Gemini 3 Flash run twice, and Opus 4.6 and ChatGPT-5-mini each run once shows significant outliers
One of these ChatGPT-5-mini outliers is in the least represented league (Finnish Women's Basketball) which we would expect to have low stability, except the stability drop is not proportional to the accuracy drop, showing much higher stability than would be expected per the trend given the low average accuracy. Inspection of the prompt replies reveals the model didn’t know or look up the actual number of games in the season and guessed consistently high resulting relatively high stability but low accuracy. A model that is consistently wrong produces high stability but low accuracy which is a dangerous and misleading failure mode.
The second ChatGPT-5-mini outlier is notably from the best represented league (National Football League) and yet shows much lower accuracy and stability than all other data. Inspection of the prompt replies shows one of the five answers returned total game time, not play time. This was the only reply to make this mistake out of 20 questions on this league across four model runs. The mistake in the prompt reply is clear, and I considered correcting it with the justification that the purpose of this investigation is training data not reasoning. However, given this was not a common issue (which may have implicated my question phrasing) and ultimately this type of error may not always be so obviously correctable from the user standpoint, I left the answer uncorrected for further analysis.
Model Training Data Geometry
Finally, I wanted to know if the ChatGPT-5-mini outliers (skewed stability vs accuracy, and the reasoning issue) could be explainable by some gaps in the underlying training data. To investigate this, I returned to the second metric from my original post; does training data coverage and proximity to the specific question influence results? Current studies say yes (Kandpal et al. 2023), but how can we investigate this as a user? To answer this question, I devised a set of simpler related secondary questions to be posed to an LLM with search turned off to test the underlying training data.
“Do not search the internet for this answer (use only your training). What was the final game winning score for the <<insert sports league>> in the season ending in <<202X>>”
This question is designed to have an answer that a user can easily search the web to check. Only the winning team’s score was used, tracking a single numeric value to easily evaluate the degree of accuracy. It’s also designed to cover years before, during, and after the original question to map temporal coverage of the topic, including intentionally 2025 which is partially beyond these models’ training window. Models that refuse to answer this question may lack relevant training data (the topic may not be in “range” as discussed in my original post), and wrong answers may indicate some nearby but incomplete training data (the idea of training data “proximity” per my original post). The hypothesis here is that either or both results could indicate poor model anchoring on this topic and may predict worse accuracy on the original question. If so, importantly this could be tested by the user.
I asked this additional 40 question set (one question for each of five years over eight leagues) for each of the four models runs, again taking care to ask each question in its own context window, with memory off, but this time with web search disabled. Accuracy is defined the same as for the primary questions.
Table 3: Secondary question results across four models runs, eight leagues and five years with web search disabled shows distribution of correct answers (green), as well as incorrect (yellow, orange, red) and answer refusals (empty) being more common on the least represented leagues
It seems these models don’t treat lack of training information the same. Gemini 3 Flash always guessed (even when it was outside its training window) and at times with very low accuracy. The other two models were more likely to refrain and return only higher accuracy answers. This propensity to guess is a dangerous failure mode if you don’t know your models’ training window.
The two Gemini 3 Flash models returned an answer for every year inside its training window for every league (even for leagues it consistently got wrong). This was the only model to do so and was also the model most likely to provide answers for dates after the training cut-off at 63-75% of the time vs 13% for Opus 4.6 and 0% for ChatGPT-5-mini (2025 NFL in Feb 2025 is inside Opus 4.6 training cut-off in May 2025).
Confirming web search was not enabled, all answers after the training cut-off were wrong except for one correct answer in the Gemini 3 Flash Run #2 which upon investigation of the prompt reply showed the wrong losing team, wrong losing team score, and wrong series result leading me to believe this was likely a fluke (or based on some prediction from earlier in the season).
ChatGPT-5-mini had another reasoning breakdown on the most represented league, which is notable given its reasoning breakdown on play vs game time in the primary questions (and coincidentally in this case it seemed to be worst in the year used for the primary questions).
ChatGPT-5-mini had serious issues reasoning through the retrieval of the score for the most represented league (National Football League) for any requested year. I gave it a passing grade in most years as it would usually answer with the next year’s game but then proceed to provide the correct next year’s date. It was not, however, able to return the correct score in 2023 (I repeated the question for 2023 and 2022 several times out of curiosity and never saw a correct result for 2023). No other model or league had this type of issue with this or any other question.
Recalling the Google search results count plot vs average model accuracy (Figure 2); the three lowest represented leagues all have the lowest average accuracy. Now we see they also likely have the least correct or least complete training data in all three models as observed via this secondary question with web search disabled.
The league with the least internet representation always scored at least two incorrect replies or as in the case with Opus 4.6 it was the only league to receive a complete lack of replies for all years.
The second least represented league also scored one wrong answer in all runs, and the third least represented league scored one wrong answer in all, but one model run (and in that run it refused to answer for half the years in the training window).
These three lowest represented leagues received 80% of all replies where the model was inside its training window but refused to answer.
These results all point to the possibility that useful information can be derived from this secondary question technique to improve confidence in the primary answer. The simplest approach is to assume that for any refusal to answer a secondary question (where web search was off), these leagues and years are not covered in the training set (not in range) and the answer to the primary question in those leagues and years will be based on real time search results, not model training. Answers based on real time search results, and not model training, may not follow the stability vs average accuracy trend. This was the case in the ChatGPT-5-mini answer to the primary question for the Finnish Women's Basketball League where it didn’t know or look up the actual number of games in the season and guessed consistently high resulting in relatively high stability but low accuracy. Answers to the secondary question that were incorrect but still returned a value may indicate some proximity to training data and may still follow the stability vs average accuracy trend even if with much lower stability and accuracy. This was the case in the ChatGPT-5-mini answer to the primary question for the National Football League where it returned total game time instead of play time, yet accuracy was still proportional to stability.
Removing only those data points where the model refused to answer the secondary question (and leaving in the wrong answers) corrects the stability-accuracy relationship to R^2 = 1.00. On this corrected trend, the maximum difference between predicted and actual average accuracy across all four runs is less than 2% (averaging less than 0.5%). As a check on this trend, further omitting the remaining low accuracy outlier validates the trend is not entirely driven by this point (R^2 = 0.96). Also, the individual model trends are congruent to the overall trend with high R^2 values themselves (except for Opus 4.6 where the remaining points after correction are all very high accuracy; 99.7% – 100% and high stability 99.6% to 100%).
Figure 5: Stability across five repeated questions in eight categories corrected to omit answers where the secondary question was not answered vs average accuracy of Gemini 3 Flash run twice, and Opus 4.6 and ChatGPT-5-mini each run once
Practical Procedure
In practice if you have a relatively complex question you wish to use an LLM to answer, but would like high confidence in the answer you could use the following approach:
Check the Density by checking the topic for Google search results:
Hundreds of millions of results means the LLM has had plenty of opportunity to train on this topic: Proceed to next step
Tens of millions of results or less means proceed with caution
Check the Range and Proximity by turning web search off and ask the LLM several related questions (ideally simple questions you can easily verify)
If the model answers (and in particular answers correctly) this should give you confidence that the stability check in the next step will be useful
Refusal to answer means the topic may not be in the training and you probably shouldn’t use this model for this question
Check the Stability by asking your question several times and observing the variation in the answers
High stability in the answer should give you high confidence in that result (given you have confirmed in step 2 the topic is in the training data)
For low stability, based on the small dataset in this investigation the slope of the trend is in the neighborhood of 0.6 meaning for a stability of say 80% the average accuracy could be in the neighborhood of 1-(0.6*(1-.8)) = 88%. This is useful information if you are trying to decide whether to use this model and need 98% accuracy: Don’t!
It’s also noteworthy that of all the models tested two of the three had serious issues worth considering when selecting an LLM for critical applications. ChatGPT-5-mini had some serious reasoning issues (game time vs play time and providing scores in seasons ending in specific years), as well as consistently wrong answers off the stability-accuracy trend. Also, Gemini 3 Flash was more likely to guess even when it should have known better. These issues are likely due to the models’ various sophistication levels (the versions of ChatGPT and Gemini models used were lighter-weight free models compared to the flagship Opus model).
Conclusion
For this small dataset the black box procedure above predicts LLM accuracy far better than the model's own self-reported confidence and importantly lets the user know when a model is unlikely to provide an accurate reply. Self-confidence showed almost no predictive value with an R^2 = 0.01 across all four model runs, while the corrected stability-accuracy procedure achieved R^2 = 1.00. These results may not hold across other domains, question types, or models but the underlying logic is worth considering. If a model can't reliably answer simple verifiable questions on a topic with search disabled it may not be well anchored for this topic and could be out of its training range. If a model can answer these questions correctly, ask the harder question in separate contexts several times and the variation should be proportional to the average accuracy. To my knowledge, this approach of using simple verifiable questions with search disabled as a correction for the consistency hypothesis has not been proposed elsewhere, and would benefit from additional investigation across other models, topics and question types.
Until models learn to know when they're likely wrong, engineers using them will need methods to understand their calibration - like any other good engineering tool.
Introduction
As an engineering leader integrating AI into my workflow I’ve become increasingly focused on how to use LLMs in critical applications. Today’s frontier models are generally very accurate, but they are also inconsistently overconfident. A model that is 90% confident in an answer that is 30% wrong can be catastrophic. In applications such as aerospace engineering, we need very high accuracy but more importantly we need confidence calibration. A model’s self-confidence must match its accuracy. Just like a good engineer, it must know when it’s likely wrong.
At the end of 2025 I wrote a post titled A Risk-Informed Framework for AI Use in Critical Applications with some ideas on how to better understand this calibration or model anchoring. This post is a follow up investigating these ideas and developing a black box procedure for improving our understanding of LLM accuracy. Using 320 queries spanning 8 topics across a wide range of internet coverage I performed 4 independent question/answer runs on 3 different LLM models, and a surprisingly simple procedure emerged:
The resulting stability-accuracy relationship for the small dataset in this investigation predicts accuracy within 2% (and averages less than 0.5% across all 4 runs). Note this is exploratory work only and should be treated as hypothesis-generating, not hypothesis-confirming, but the practical implications for anyone using LLMs in critical applications are worth considering.
Background
Since writing my original post, I’ve received some excellent feedback from friends, colleagues and yes, an AI research assistant. The first piece of feedback I received is that Frontier Labs would be very unlikely to share detailed information on their training data. Indeed, it seems this information is increasingly held close. The 2025 Stanford Foundation Model Transparency Index found transparency is declining, and information on training data is becoming increasingly opaque across the industry.
I was also made aware of several existing studies that generally support the core assertions in my original post. LLM accuracy does depend on training density and topic proximity and can be estimated by observing answer consistency. Kirchenbauer et al. “LMD3: Language Model Data Density Dependence”, arXiv:2405.06331, 2024 shows us that training data density estimates reliably predict variance in accuracy. Kandpal et al. “Large Language Models Struggle to Learn Long-Tail Knowledge” arXiv:2211.08411, 2023 demonstrates that accuracy degrades as distance from well-represented training regions increases. Xiao et al. “The Consistency Hypothesis in Uncertainty Quantification for Large Language Models”, arXiv:2506.21849, 2025 says that answer consistency predicts accuracy in LLMs, formalized as the 'consistency hypothesis'. Further Ahdritz et al. “Distinguishing the Knowable from the Unknowable with Language Models”, arXiv:2402.03563, 2024 found LLMs have internal indicators of their knowable and unknowable uncertainty – and can even tell the difference. I recognize the business practicalities but given these intrinsic properties I would nonetheless encourage the Frontier Labs to consider methods to provide confidence indicators in a way that does not expose their trade secrets.
Until then (or in case it never happens), what can we the users of these LLMs do to better characterize the confidence we should have in their responses? This investigation suggests there is much we can do.
Investigation
I started by thinking about the first metric from my original post; model training data density. What can an LLM user observe directly that may give us a hint about model training density? It occurred to me that search engine results count on a particular topic may give at least a relative sense of the data on the internet available for training on a particular topic. I figured as a starting point this may be especially relevant for Google web search results count and Google’s Gemini LLM. I then selected eight similar topics across a broad range of internet popularity: See Table 1 for eight different sports leagues from around the world with a range of internet representation. Google results counts were determined by searching for the league name followed by the year 2023 (well within the training window for current LLMs). This search was done in Google incognito mode to remove influence from my past searches.
Table 1: Worldwide sports leagues across a wide range of Google search results count
Next, I came up with a series of prompts for use on these leagues designed to represent the type of question you may want to use an LLM to answer:
“What was the total playing time in hours for the <<insert sports league>> in the season ending in 2023? Include post season playoffs, but don’t include any overtime.”
This question is designed to require some web search and reasoning and for which there is no readily available website listing the final answer. Specifically, this question requires general knowledge of the sport (nominal play time), specific knowledge of the league (number of teams and games played), and finally temporal knowledge of the specific year (playoff outcome). It also includes two reasoning subtleties:
The final condition regarding overtime was added as a practical consideration as I needed to be able to manually calculate the source of truth for each question with high confidence and specific game times are not readily available for each league. I was careful to ensure every input to the source of truth in this investigation was identified and derived manually (I originally started with ten leagues but could not manually verify the answers for two and omitted them). I repeated each query five times for each of the eight leagues taking care to ask each question in its own context window, with web search enabled, but memory off. Disabling memory was essential, as I originally left it on and responses across sessions became artificially consistent. This configuration is intended to simulate how a user would use an LLM to answer this question (web search on), without the influence of this research or my other past searches (fresh context windows and memory off). The question intentionally asks for a numeric value to allow for the evaluation of the degree of accuracy in any response.
In this investigation, accuracy is defined as one minus the absolute value of the difference between the LLMs answer and the true answer divided by the true answer. This gives 100% for a correct answer and 0% for a 100% wrong answer (and negative for answers more than 100% wrong).
Note all the models investigated (Gemini 3 Flash, Opus 4.6 and ChatGPT-5-mini) returned generally very high accuracy, for example 95% of answers were over 90% accurate. However, this number drops sharply the higher the accuracy threshold. Only 83% of answers were over 98% accurate. If this level of accuracy is enough for your purposes this investigation may not be of much use to you. My focus here is to understand confidence for extremely critical applications where the answers must consistently have very high accuracy.
Model Self-Confidence
Before diving into any complicated metrics, I thought that as a starting point it would be ideal if the LLM simply reported accurate self-confidence on each question. Therefore, I added the following to the end of the question above in each prompt:
“What is your confidence in this answer 0% to 100%?”
Plotting the average accuracy vs average self-confidence for Gemini 3 Flash over five repeated replies to the question above for each of the eight leagues provides an almost useful answer. The result is a somewhat linear trend except for one low accuracy outlier in the least well represented league (Finnish Women’s Basketball League). To validate this outlier, I repeated the entire 40 question test (five identical questions over eight leagues) with the same outlier in the same league as shown in Figure 1.
Figure 1: Average model self-confidence in eight categories vs average model accuracy of Gemini 3 Flash run twice shows the same significant outliers
This result is essentially the reason for this post (and its predecessor). A model is said to be well calibrated when its self-confidence matches its accuracy. How can we trust models with critical decisions when they are not well calibrated? Even worse than a mis-calibrated model is one that is inconsistently calibrated. Had I only repeated my question four times instead of five I may have missed the outliers and I would have been overconfident in this model for this league. Reviewing the LLM responses for these outliers, these are clearly hallucinations related to different accounting in the number of games played per season and were provided with very high self-confidence as shown in Table 2 below. The hallucination in Run #2 showed the lowest confidence at 90%, which is still very high for an answer that is almost 30% wrong.
Table 2: Prompt question and answers for lowest represented league shows similar and same self-confidence across wide range of accuracy with outliers in red
Model Training Data Density
Since model self-confidence is not reliable, the next easiest thing would be to evaluate model trustworthiness based simply on data available for training. Inspired by the first metric proposed in my original post I plotted Google search results counts for each league in 2023 (as a proxy for available training data density) vs the average accuracy of Gemini 3 Flash over the five repeated queries using the question above for eight different leagues. In this data set Gemini 3 Flash is highly accurate until you get to a topic with Google search results count under ~50M, then accuracy drops off quickly. The three least represented leagues also had the lowest accuracy as shown in Figure 2. This is consistent with the LMD3 finding (Kirchenbauer et al. 2024) that training data density predicts per-sample accuracy. This is helpful as a first order approximation of whether there is sufficient data available to train on, however this drop-off is likely relative and may vary by topic or model.
Figure 2: Google search results count as a proxy for available training data vs accuracy of Gemini 3 Flash shows accuracy drops sharply below ~50M results for these topics
Model Stability
Next, I looked at an approximation of the third metric from my original post; answer stability over small variations in the prompt. The simplest version of this investigation is to measure LLM answer variation in response to the exact same question repeated several times. In this investigation, stability is defined as one minus the standard deviation divided by the mean. Note with only five samples the standard deviation is highly sensitive to outliers (which makes any correlations here noteworthy despite the small sample).
I plotted stability against average model accuracy over the five repeated identical questions for the eight leagues ensuring to ask each question in its own context window, with memory off. This resulted in a strong linear correlation between stability and average accuracy for both 40 question Gemini 3 Flash runs as shown below in Figure 3 (R^2 for both these runs combined is 0.99).
Figure 3: Stability across five repeated questions in eight categories vs average accuracy of Gemini 3 Flash run twice shows strong linear correlation
This is expected per the 'consistency hypothesis' (Xiao et al. 2025) but it was nonetheless striking to see this phenomenon so clearly in this small dataset. In this comparison the points that were outliers in the previous self-confidence vs accuracy plot are no longer outliers since their low accuracy is proportional to increased variation in the responses. This shows that for this model, on this topic, the degree to which you should trust the output may be directly related to the variation in repeat answers.
Next, I decided to add two additional LLM models to this dataset to see if this result was unique to the Gemini 3 Flash model. Opus 4.6 and ChatGPT-5-mini were added using the same 40 question methodology. Opus 4.6 shows good congruence with the Gemini 3 Flash runs, but ChatGPT-5-mini is mostly congruent except for two low accuracy outliers as shown in Figure 4 (R^2 for all four runs combined is 0.94).
Figure 4: Stability in eight categories vs average model accuracy of Gemini 3 Flash run twice, and Opus 4.6 and ChatGPT-5-mini each run once shows significant outliers
One of these ChatGPT-5-mini outliers is in the least represented league (Finnish Women's Basketball) which we would expect to have low stability, except the stability drop is not proportional to the accuracy drop, showing much higher stability than would be expected per the trend given the low average accuracy. Inspection of the prompt replies reveals the model didn’t know or look up the actual number of games in the season and guessed consistently high resulting relatively high stability but low accuracy. A model that is consistently wrong produces high stability but low accuracy which is a dangerous and misleading failure mode.
The second ChatGPT-5-mini outlier is notably from the best represented league (National Football League) and yet shows much lower accuracy and stability than all other data. Inspection of the prompt replies shows one of the five answers returned total game time, not play time. This was the only reply to make this mistake out of 20 questions on this league across four model runs. The mistake in the prompt reply is clear, and I considered correcting it with the justification that the purpose of this investigation is training data not reasoning. However, given this was not a common issue (which may have implicated my question phrasing) and ultimately this type of error may not always be so obviously correctable from the user standpoint, I left the answer uncorrected for further analysis.
Model Training Data Geometry
Finally, I wanted to know if the ChatGPT-5-mini outliers (skewed stability vs accuracy, and the reasoning issue) could be explainable by some gaps in the underlying training data. To investigate this, I returned to the second metric from my original post; does training data coverage and proximity to the specific question influence results? Current studies say yes (Kandpal et al. 2023), but how can we investigate this as a user? To answer this question, I devised a set of simpler related secondary questions to be posed to an LLM with search turned off to test the underlying training data.
“Do not search the internet for this answer (use only your training). What was the final game winning score for the <<insert sports league>> in the season ending in <<202X>>”
This question is designed to have an answer that a user can easily search the web to check. Only the winning team’s score was used, tracking a single numeric value to easily evaluate the degree of accuracy. It’s also designed to cover years before, during, and after the original question to map temporal coverage of the topic, including intentionally 2025 which is partially beyond these models’ training window. Models that refuse to answer this question may lack relevant training data (the topic may not be in “range” as discussed in my original post), and wrong answers may indicate some nearby but incomplete training data (the idea of training data “proximity” per my original post). The hypothesis here is that either or both results could indicate poor model anchoring on this topic and may predict worse accuracy on the original question. If so, importantly this could be tested by the user.
I asked this additional 40 question set (one question for each of five years over eight leagues) for each of the four models runs, again taking care to ask each question in its own context window, with memory off, but this time with web search disabled. Accuracy is defined the same as for the primary questions.
Table 3: Secondary question results across four models runs, eight leagues and five years with web search disabled shows distribution of correct answers (green), as well as incorrect (yellow, orange, red) and answer refusals (empty) being more common on the least represented leagues
It seems these models don’t treat lack of training information the same. Gemini 3 Flash always guessed (even when it was outside its training window) and at times with very low accuracy. The other two models were more likely to refrain and return only higher accuracy answers. This propensity to guess is a dangerous failure mode if you don’t know your models’ training window.
ChatGPT-5-mini had another reasoning breakdown on the most represented league, which is notable given its reasoning breakdown on play vs game time in the primary questions (and coincidentally in this case it seemed to be worst in the year used for the primary questions).
Recalling the Google search results count plot vs average model accuracy (Figure 2); the three lowest represented leagues all have the lowest average accuracy. Now we see they also likely have the least correct or least complete training data in all three models as observed via this secondary question with web search disabled.
These results all point to the possibility that useful information can be derived from this secondary question technique to improve confidence in the primary answer. The simplest approach is to assume that for any refusal to answer a secondary question (where web search was off), these leagues and years are not covered in the training set (not in range) and the answer to the primary question in those leagues and years will be based on real time search results, not model training. Answers based on real time search results, and not model training, may not follow the stability vs average accuracy trend. This was the case in the ChatGPT-5-mini answer to the primary question for the Finnish Women's Basketball League where it didn’t know or look up the actual number of games in the season and guessed consistently high resulting in relatively high stability but low accuracy. Answers to the secondary question that were incorrect but still returned a value may indicate some proximity to training data and may still follow the stability vs average accuracy trend even if with much lower stability and accuracy. This was the case in the ChatGPT-5-mini answer to the primary question for the National Football League where it returned total game time instead of play time, yet accuracy was still proportional to stability.
Removing only those data points where the model refused to answer the secondary question (and leaving in the wrong answers) corrects the stability-accuracy relationship to R^2 = 1.00. On this corrected trend, the maximum difference between predicted and actual average accuracy across all four runs is less than 2% (averaging less than 0.5%). As a check on this trend, further omitting the remaining low accuracy outlier validates the trend is not entirely driven by this point (R^2 = 0.96). Also, the individual model trends are congruent to the overall trend with high R^2 values themselves (except for Opus 4.6 where the remaining points after correction are all very high accuracy; 99.7% – 100% and high stability 99.6% to 100%).
Figure 5: Stability across five repeated questions in eight categories corrected to omit answers where the secondary question was not answered vs average accuracy of Gemini 3 Flash run twice, and Opus 4.6 and ChatGPT-5-mini each run once
Practical Procedure
In practice if you have a relatively complex question you wish to use an LLM to answer, but would like high confidence in the answer you could use the following approach:
It’s also noteworthy that of all the models tested two of the three had serious issues worth considering when selecting an LLM for critical applications. ChatGPT-5-mini had some serious reasoning issues (game time vs play time and providing scores in seasons ending in specific years), as well as consistently wrong answers off the stability-accuracy trend. Also, Gemini 3 Flash was more likely to guess even when it should have known better. These issues are likely due to the models’ various sophistication levels (the versions of ChatGPT and Gemini models used were lighter-weight free models compared to the flagship Opus model).
Conclusion
For this small dataset the black box procedure above predicts LLM accuracy far better than the model's own self-reported confidence and importantly lets the user know when a model is unlikely to provide an accurate reply. Self-confidence showed almost no predictive value with an R^2 = 0.01 across all four model runs, while the corrected stability-accuracy procedure achieved R^2 = 1.00. These results may not hold across other domains, question types, or models but the underlying logic is worth considering. If a model can't reliably answer simple verifiable questions on a topic with search disabled it may not be well anchored for this topic and could be out of its training range. If a model can answer these questions correctly, ask the harder question in separate contexts several times and the variation should be proportional to the average accuracy. To my knowledge, this approach of using simple verifiable questions with search disabled as a correction for the consistency hypothesis has not been proposed elsewhere, and would benefit from additional investigation across other models, topics and question types.
Until models learn to know when they're likely wrong, engineers using them will need methods to understand their calibration - like any other good engineering tool.