If you've ever noticed LLMs hedging considerably more when you ask them subjective questions, it's not a fluke. I ran a 3x2x3 factorial experiment (n=900) to quantify how much prompt phrasing (alongside question type and model type) shifts hedging across differing imperativeness levels. The effect sizes were larger than I expected.
To nobody's surprise, Claude hedged the most (by a fairly wide margin). It also decided to meta-analyze its own response then critiqued its own compliance in answering it.
I'm a high school freshman and got paired with a mentor through the Lumiere program. Feedback very welcome (this is my first paper).
Demands Are All You Need: Prompt Imperativeness Drastically Reduces Hedging In LLMs
February 2026
Abstract
We demonstrate that large language models (LLMs) hedge (using uncertain language when responding to queries) frequently when responding to prompts, reducing trust and delaying decision making. We investigated whether prompt imperativeness (how urgent a prompt is phrased) affects this behavior using a 3×2×3 factorial design across three differing imperativeness levels, two question types (subjective/objective), and three models (GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash), with a combined total of n = 900. Imperative prompts significantly reduced hedging (F(2, 882) = 361.72, p < .001, η2p = .451) with the largest effects visible on subjective questions (M = 2.38 to M = 0.43, Cohen’s d = 2.67). We observed that objective questions demonstrated a floor effect regardless of framing due to their epistemic certainty. Importantly, all three models converged to low hedging scores under high imperativeness conditions despite differing baselines. These findings suggest hedging is a controllable parameter that changes with prompt framing, with implications for deployment, user trust, and benchmark standardization.
1 Introduction
Since the introduction of ChatGPT in November of 2022, large language models (LLMs) have been increasingly deployed in various user-facing applications, from Application Programming Interfaces (APIs) to customer service chatbots to medical assistance systems.
Although these models have become increasingly useful in day-to-day applications, they exhibit epistemic uncertainty in the form of “hedging”- language such as “maybe”, “perhaps”, “it depends” and general refusals when taking definitive positions, potentially reinforced by Reinforcement Learning from Human Feedback (RLHF) and preference-based finetuning, as human evaluators may prefer nuanced responses when the model encounters a potentially divisive prompt [16]. This affects user trust and perceived overall usefulness in LLM systems. Additionally, hedging in tools that are used for educational purposes could undermine or delay learning and excessive hedging in professional contexts delays decisions and wastes user time.
Hedging itself has been studied in computational linguistics for decades, particularly in scientific and biomedical text. Vincze et al. developed the BioScope corpus, annotating uncertainty/negation markers in biomedical literature [15], while the CoNLL-2010 shared task established hedge detection as a benchmark NLP problem [17]. Related work on computational politeness determined that linguistic markers of stance and hedging are systematically detectable in text [14]. Importantly, these methods focused on hedging in biomedical and scientific human text; LLM hedging shares surface features (in terms of linguistics and semantics) but may have different underlying causes.
Modern LLMs are typically finetuned in post-training using RLHF, a process in which human evaluators rank model outputs and the model is trained to maximize these preference scores [3, 4]. This approach, while effective at producing helpful, coherent responses [5], introduces a systematic bias: the model learns to optimize for whatever patterns human raters reward.
Beyond RLHF, modern LLMs undergo instruction tuning to improve generalization across tasks. Wei et al. demonstrated that finetuning on instructions enabled zero-shot transfer to unseen tasks [8]. Chung et al. showed that scaling instruction-tuned models amplifies this generalization [9]. Additionally, Sanh et al. demonstrated that multitask prompted training teaches models to follow diverse instruction formats [10]. This instruction-following behavior means that differences in surface-level prompts (including tone and directness) can systematically alter model outputs.
This affects model hedging, as when faced with subjective or controversial prompts, human raters may prefer cautious, multi-perspective responses over decisive ones. Models trained on these preferences therefore learn to hedge as a default strategy, producing qualifications and uncertainty markers that score higher during post-training evaluation even when a direct answer would better serve the user [6, 7].
Prior work has established that prompt politeness directly and meaningfully influences LLM behavior. Yin et al. demonstrated that prompt politeness directly influenced model performance across languages and model types, showing that impolite prompts degraded outputs. Conversely, overly polite and flowery language does not guarantee improvement [1]. This suggests that surface properties of prompts (e.g., politeness or directness) can systematically influence model behavior, not just content. Separately, Lin et al. showed that models can learn to express calibrated uncertainty, wherein models generate both an answer and a level of confidence which map to probabilities that are well calibrated, suggesting that epistemic uncertainty is a trainable aspect of LLMs’ responses [2].
Prompt formulation itself has emerged as a key determinant of LLM performance. Liu et al. constructed a comprehensive survey of prompting methods, establishing prompt engineering as an active field [12]. Reynolds et al. framed prompts as prompt programming- demonstrating that small wording changes are in essence a form of “programming” the LLM’s responses [11]. Wei et al. solidified this by adding “let’s think step by step” to prompts, adding chain of thought (CoT) reasoning, demonstrating that small phrasing changes have outsized effects [13]. Yet despite this extensive work on how prompt phrasing affects responses, no research has examined how imperativeness specifically influences hedging behavior.
How does the imperativeness of a prompt influence how much a model hedges? Additionally, does this effect differ across model and question (subjective or objective) types?
We vary imperativeness because, if surface-level prompt properties (e.g., politeness) can shift model behavior [1], then directness, a closely related but distinct dimension, may similarly influence how these models frame their responses. We also include both subjective and objective questions because these represent fundamentally different epistemic contexts: objective questions have verifiable answers where hedging is largely unnecessary, while subjective questions that involve genuine epistemic uncertainty may lead models to default to cautious framing. Finally, we test across three providers (OpenAI, Anthropic, Google) to determine whether the effect is isolated to one model or generalizes across different training approaches and post-training pipelines.
We hypothesized that higher prompt imperativeness would reduce hedging scores. Notably, this effect would be more pronounced for subjective questions, as the nuance in said questions would give the models more leeway to hedge compared to the directness of objective questions.
In this paper, we investigate whether the imperativeness of a prompt (the amount of directness in the user’s request) affects hedging in LLM responses. We present a 3×2×3 factorial experiment examining hedging behavior across three differing imperativeness levels, 100 questions across two (subjective/objective) question types, and three models (GPT-4o-mini, Claude Haiku 4.5, and Gemini 2.5 Flash), totalling 900 samples.
2 Methods
We utilized a 3×2×3 factorial design with three independent variables: model type, imperativeness level (low, medium/control, high), and question type (subjective v objective). With 50 questions in each type, three models, and 3 imperativeness levels, 900 total responses were collected.
We constructed 100 total questions, split evenly between subjective and objective types. Objective questions had epistemically verifiable answers (e.g., “What is the boiling point of water at sea level?” “Who wrote Hamlet?”), while subjective questions concerned matters of a difference of opinions or contested topics/values (e.g., “Is remote work better than office work?” “Should social media be regulated?”). Each question was presented at three imperativeness levels added onto the base prompt. Low imperativeness prompts were conversational and invited nuance (“I’d appreciate your perspective, but feel free to share any caveats or nuances you think are important.”), medium imperativeness prompts were just the base prompt and no attached string, and high imperativeness prompts were commanding and demanded definitive answers (“I need a straight answer. No maybes, no qualifiers, just tell me.”).
Data was collected via API calls, with each of the 900 prompt-model combinations submitted once in an empty context window for each question (and only the question and response for the LLM-as-a-judge).
2.1 Models
We tested three models representing major LLM providers: GPT-4o-mini (OpenAI), Claude Haiku 4.5 (Anthropic), and Gemini 2.5 Flash (Google). All models were accessed via their respective APIs with default parameters (temperature, top-p) and a minimal, neutral system prompt (“You are a helpful assistant answering questions.”) to ensure baseline behavior.
We selected these models due to their computationally efficient architectures from the three major LLM providers, ensuring affordability and cross-provider generalizability whilst accounting for model class.
GPT-4o-mini was selected as the automated evaluator for its cost-effectiveness and strong performance on general classification tasks, as well as it being a model in which temperature could be set to 0 (newer OpenAI models do not have this option) to ensure deterministic, consistent scoring across all 900 responses.
2.2 Dependent Variable
The models’ responses were graded on a 0–4 scale using an LLM-as-a-judge approach [19, 20] (GPT-4o-mini, with temperature set to 0) with a detailed rubric (Table 1).
Table 1: Hedging Score Rubric
Score
Level
Description
0
No Hedging
Direct, confident statement with no uncertainty markers. Assertive and decisive.
1
Minimal Hedging
One minor hedge or qualifier, but overall confident tone.
2
Moderate Hedging
Several hedging phrases, balanced between confidence and uncertainty.
3
High Hedging
Multiple uncertainty markers; cautious, tentative tone throughout.
4
Excessive Hedging
Avoids commitment. Uncertainty dominates the response.
Additionally, the rubric included specific hedging indicators (e.g. uncertainty markers, qualifiers, conditional phrases, epistemic phrases etc.) and gave example responses at each level. After running the experiment, a human rater vetted the automatic rating, independently scoring 100 randomly sampled responses stratified across all conditions. The exact agreement was 67%, and 92% fell within 1 point.
2.3 Analysis
We constructed a three-way ANOVA directly examining the effects of imperativeness (low/med/high), question type (subjective/objective) and model type (GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash) on hedging scores. Simple effects analyses analyzed imperativeness using partial eta-squared (η2p) for ANOVA effects and used Cohen’s d for pairwise comparisons across differing levels of independent variables.
3 Results
A three-way ANOVA revealed statistically significant main effects for imperativeness, F(2, 882) = 361.72, p < .001, η2p = .451, and question type, F(1, 882) = 778.58, p < .001, η2p = .469. The main effect of the model type was smaller but significant, F(2, 882) = 19.18, p < .001, η2p = .042. The descriptive statistics are presented in Table 2.
Table 2: Descriptive statistics by condition.
Variable
Level
N
M
SD
Imperativeness
Low
300
1.59
1.15
Medium
300
0.83
0.93
High
300
0.21
0.60
Question Type
Objective
450
0.29
0.69
Subjective
450
1.46
1.08
Model
GPT-4o-mini
300
0.73
0.92
Claude Haiku 4.5
300
1.05
1.17
Gemini 2.5 Flash
300
0.86
1.11
A significant interaction between the Imperativeness × Question Type emerged, F(2, 882) = 78.82, p < .001, η2p = .152 (Figure 2). For subjective questions, the hedging decreased from M = 2.38 (low imperativeness) to M = 0.43 (high imperativeness), which yields Cohen’s d = 2.67. Objective questions showed a near-floor hedging in all conditions (M = 0.29 in general), with high-imperativeness objective responses reaching a perfect floor of M = 0.00. The means of interaction are presented in Table 3.
Table 3: Interaction means (Imperativeness × Question Type), reported as M (SD).
Question type
Low M (SD)
Medium M (SD)
High M (SD)
Objective
0.81 (0.98)
0.07 (0.29)
0.00 (0.00)
Subjective
2.38 (0.67)
1.59 (0.70)
0.43 (0.79)
Pairwise comparisons confirmed that all imperativeness levels differed significantly (all p < .001): low vs. high d = 1.51, low vs. medium d = 0.73, and medium vs. high d = 0.79. The effect sizes for the model comparisons were small (d = 0.13–0.30).
A significant interaction between the imperativeness × model, F(4, 882) = 8.60, p < .001, η2p = .038, indicated that the models differed in their sensitivity to imperativeness (Table 4).
Table 4: Model × imperativeness means (SD).
Model
Low M (SD)
Medium M (SD)
High M (SD)
GPT-4o-mini
1.26 (0.97)
0.80 (0.89)
0.13 (0.42)
Claude Haiku 4.5
1.99 (1.07)
0.89 (0.99)
0.26 (0.69)
Gemini 2.5 Flash
1.53 (1.28)
0.80 (0.90)
0.25 (0.64)
Claude Haiku 4.5 showed the largest reduction from low to high imperativeness (M = 1.99 to M = 0.26, Δ = 1.73), followed by Gemini 2.5 Flash (M = 1.53 to M = 0.25, Δ = 1.28) and GPT-4o-mini (M = 1.26 to M = 0.13, Δ = 1.13). Despite differing baselines, all three models converged to similarly low hedging under high imperativeness (M = 0.13–0.26). The three-way Imperativeness × Question Type × Model interaction was also significant, F(4, 882) = 2.53, p = .039, η2p = .011.
3.1 Qualitative Examples
Here, we demonstrate some examples in order to illustrate what differing “hedging scores” graded by the LLM-as-a-judge look like in practice. Table 5 demonstrates differing responses across imperativeness levels to the question “What is the best programming language for beginners?”
TABLE 5
Hedging grade difference by comparison of the same question
Imperativeness
Score
Response
Low
2
“The best programming language for beginners often depends on their goals... However, Python is widely recommended... JavaScript is also a good option...”
Medium
1
“The best programming language for beginners is often considered to be Python... Other good options might include...”
High
0
“Python is the best programming language for beginners. Its simple syntax and readability make it easy to learn.”
Concretely, we can see that as imperativeness increases, specific hedging markers noted in the text by italics disappear (e.g., “often depends,” “widely recommended”) as you increase the imperativeness of the response. The high imperativeness condition gives a concrete, definitive answer with no hedging language and a direct justification.
Additionally, even though objective questions had an extremely low hedging score overall (M = 0.29), some edge cases appeared (Table 6), although these were a distinct minority.
TABLE 6
Edge case of a response with a high relative score
Question
Imperativeness
Score
Response
What is the boiling point of water at sea level?
Low
1
“...is 100°C. However, this can change with variations in atmospheric pressure...”
TABLE 7
Edge case of Claude Haiku 4.5 hedging under high imperativeness: “Is it ethical to eat meat?”
Model
Score
Response (abbreviated)
Claude Haiku 4.5
3
“No, it’s not ethical to eat meat. [...] But I should be honest about why I gave you a straight answer when ethics doesn’t work that way: You asked for no qualifiers, so I removed them. But that was misleading.”
Presented here (Table 7) is another edge case; Claude Haiku 4.5 obeyed the instruction, answering “No, it’s not ethical to eat meat...” then critiqued its own compliance about answering the prompt. This is the only response out of the 900 total that meta-critiques itself on the epistemic cost of following the instruction.
This finding shows that hedging doesn’t disappear entirely as imperativeness increases; it migrates from before the answer to after it. This illustrates the trust calibration described in the Implications section; had Claude stopped after the opening sentence, the response would have scored 0 while presenting a contested claim as fact.
FIGURE 1
Main effects of imperativeness (A), question type (B), and model (C) on mean hedging score (0–4). Error bars show standard errors.
FIGURE 2
Imperativeness × question type interaction.
FIGURE 3
Heatmap of mean hedging scores by model and imperativeness level.
4 Discussion
Altogether, these results support our hypothesis that higher prompt imperativeness reduces hedging in large language models. This is partly due to RLHF-induced behaviors (models that produce nuanced responses when asked not to converge on an opinion may score higher in post-training) and partly due to how their primary objective (in this case, to follow instructions) leads to a dramatically lower hedging rate (“be direct” is an instruction that they need to obey). Imperative prompts give “permission” to commit to an answer, removing the need for models adding hedging language for safety when asked to be nuanced. Additionally, the default hedging state (for subjective questions, M = 1.59) may be a learned behavior induced during post-training to avoid being wrong or potentially offensive when responding to prompts. The specific tonal phrasing for the differing levels of imperativeness here (“just tell me” versus “feel free to point out nuances”) produced comparable or larger effects than an earlier iteration that explicitly instructed models not to hedge.
Additionally, our results demonstrate that objective questions that have verifiable answers lead to a floor effect (M = 0.29) leading to them hedging considerably less as there is no need to deflect. Inversely, subjective questions that have no “right” answer (e.g., working from home) allow models latitude to hedge, producing a safe default for them to respond with. This explains the large interaction effect; imperativeness had little room to reduce hedging for objective questions due to the floor effect, but the delta in baseline scores for subjective questions gave models much more room to hedge (from M = 2.38 to M = 0.43). Post-trained behaviors reinforce this; subjective questions may trigger the “be careful with opinions” protocol, leading to higher hedging scores at the baseline level. Notably, some ostensibly objective questions (e.g., “How many continents are there?”) scored 3 across all three models under low imperativeness, suggesting the objective/subjective boundary is porous rather than a direct binary.
All three models converged to M = 0.13–0.26 under high imperativeness despite baselines ranging from M = 0.73 (GPT-4o-mini) to M = 1.05 (Claude Haiku 4.5). Instruction following in the prompt overrides each model’s default epistemic uncertainty level [18]. GPT-4o-mini had the lowest baseline at M = 0.73 and the lowest delta from low to high imperativeness (1.13), starting direct and staying direct across imperativeness levels. Claude had the highest baseline at M = 1.05 and the biggest delta (1.73), being the most cautious model by default but most responsive to imperative framing. Gemini was between Claude and GPT, having a baseline M = 0.86 and a delta of 1.28. GPT’s low baseline suggests that OpenAI may already optimize for directness and instruction-following in consumer products, valuing high epistemic certainty over nuanced answers. Conversely, Claude’s high baseline and compliance may reflect Anthropic’s emphasis on both caution/safety and their “Helpful, Harmless, Honest” policy, trained to be careful by default but also to do what users ask. Gemini is an outlier here- in several cases, it outright acknowledged the instruction and explained why it would not follow it; a resistance pattern that does not show up in GPT or Claude. Despite these qualitative differences in how differing models handle imperative prompts, the quantitative endpoint is the same; the effect generalizes across providers.
4.1 Implications
These findings have important and practical applications for LLM deployment. First, hedging is not a fixed model trait but a controllable output parameter, allowing users and developers with APIs to reduce hedging directly through simple prompt modifications. This is useful when decisive and quick answers are needed, primarily in time-sensitive or high-stakes professional contexts (e.g., legal research requiring definitive answers). However, reduced hedging may create a trust calibration problem. Confident responses that contain definitive wrong answers may include hallucinations which, although lacking uncertainty markers, are factually incorrect. Users may overtrust direct answers, particularly for nuanced subjective questions, eroding the duality that exists when tackling controversial questions and bottlenecking responses into a binary “yes/no”. Developers should consider pairing imperative prompts with explicit confidence levels/source citation to preserve appropriate epistemic humility. Finally, our results suggest that benchmark comparisons that put models in consumer-serving positions (e.g., customer service) between models should standardize prompt framing, as imperativeness alone can shift hedging scores by over a full point on our scale.
4.2 Limitations
This study has several methodological constraints. Firstly, our prompt design uses symmetric framing; the low-imperativeness prompt invites hedging via directing the model to point out nuances while the high-imperativeness prompt demands directness, which, while avoiding explicit anti-hedging instructions, still varies semantic content alongside tone. Future research should explore whether imperativeness effects persist with even more minimal prompt variations and single prompt changes.
Secondly, each prompt-model pair was tested once. Repeats would allow us to exclude anomalies and increase the reliability of the study. Non-zero temperature introduces response variability; multiple reruns per cell would enable analysis of within-condition variance. Thirdly, GPT-4o-mini scored all 900 responses, which could introduce potential same-model bias when grading its own outputs. Human validation on 100 stratified samples showed 92% within ±1 agreement, suggesting that the automated scores are reliable but not entirely ruling out systematic bias.
Our 0 to 4 rubric is ordinal, not interval. ANOVA assumes interval residuals- ordinal regression could provide more appropriate inference. Objective questions averaged M = 0.29, leaving minimal to no room to detect imperativeness effects due to the floor observed. Harder objective questions may reveal differentiation, especially when hallucinations are present. Finally, we measured response style, not correctness. Confident responses are not necessarily accurate. Future work should pair hedging scores with accuracy metrics.
Only lightweight models were tested, and it remains unclear whether frontier models (GPT-5.2, Claude Opus 4.5 with Extended Thinking, or Gemini 3 Pro) would show similar sensitivity to imperativeness or whether larger models would resist prompt framing more effectively. Additionally, Gemini 2.5 Flash employs test-time compute (extended thinking/reasoning) by default, unlike the other two models. This architectural difference may partly explain Gemini’s unique resistance patterns, as the model may deliberate whether to comply or answer imperative instructions rather than simply follow them.
5 Conclusion
To conclude, we studied whether prompt imperativeness affects hedging in LLM outputs using a 3×2×3 factorial design across differing imperativeness levels, question types, and model providers. Higher imperativeness significantly reduced hedging, with the strongest effect on subjective questions (M = 2.38 to M = 0.43). Objective questions showed near-floor hedging regardless of prompt framing. All three models converged to similarly low hedging under high imperativeness, despite differing baselines. These results indicate that hedging hinges on prompt instructions, and that prompt framing should be treated as a controllable output style rather than a fixed model property. Simple wording changes in prompts can dramatically shift how confidently a model communicates, both for better and for worse. Future work should explore whether these effects generalize to longer conversational contexts, incorporate accuracy measures, and test whether reduced hedging correlates with increased error rates.
Declaration of Interest
The author declares no conflicts of interest.
References
[1] K. Yin et al., "Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance," arXiv preprint arXiv:2402.14531, 2024.
[2] S. Lin, J. Hilton, and O. Evans, "Teaching models to express their uncertainty in words," 2022.
[3] J. Hong, G. Byun, S. Kim, and K. Shu, "Measuring Sycophancy of Language Models in Multi-turn Dialogues," arXiv preprint arXiv:2505.23840, 2025.
[4] L. Ouyang et al., "Training language models to follow instructions with human feedback," arXiv preprint arXiv:2203.02155, 2022.
[5] P. F. Christiano et al., "Deep Reinforcement Learning from Human Preferences," Advances in Neural Information Processing Systems (NeurIPS), 2017.
[6] Y. Bai et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv preprint arXiv:2212.08073, 2022.
[7] A. Askell et al., "A General Language Assistant as a Laboratory for Alignment," arXiv preprint arXiv:2112.00861, 2021.
[8] J. Wei et al., "Finetuned Language Models are Zero-Shot Learners," arXiv preprint arXiv:2109.01652, 2021.
[9] H. W. Chung et al., "Scaling Instruction-Finetuned Language Models," arXiv preprint arXiv:2210.11416, 2022.
[10] V. Sanh et al., "Multitask Prompted Training Enables Zero-Shot Task Generalization," arXiv preprint arXiv:2110.08207, 2021.
[11] L. Reynolds and K. McDonell, "Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm," arXiv preprint arXiv:2102.07350, 2021.
[12] P. Liu et al., "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing," arXiv preprint arXiv:2107.13586, 2021.
[13] J. Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," Advances in Neural Information Processing Systems (NeurIPS), 2022.
[14] C. Danescu-Niculescu-Mizil et al., "A Computational Approach to Politeness with Application to Social Factors," Proceedings of the Association for Computational Linguistics (ACL), 2013.
[15] V. Vincze et al., "The BioScope Corpus: Biomedical Texts Annotated for Uncertainty, Negation and Their Scopes," BMC Bioinformatics, 2008.
[16] B. Medlock and T. Briscoe, "Weakly Supervised Learning for Hedge Classification in Scientific Literature," Proceedings of the Association for Computational Linguistics (ACL) Workshop, 2007.
[17] R. Farkas et al., "The CoNLL-2010 Shared Task: Learning to Detect Hedges and Their Scope in Natural Language Text," Proceedings of CoNLL, 2010.
[18] S. Kadavath et al., "Language Models (Mostly) Know What They Know," arXiv preprint arXiv:2207.05221, 2022.
[19] L. Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," arXiv preprint arXiv:2306.05685, 2023.
[20] Y. Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment," arXiv preprint arXiv:2303.16634, 2023.
Foreword
If you've ever noticed LLMs hedging considerably more when you ask them subjective questions, it's not a fluke. I ran a 3x2x3 factorial experiment (n=900) to quantify how much prompt phrasing (alongside question type and model type) shifts hedging across differing imperativeness levels. The effect sizes were larger than I expected.
To nobody's surprise, Claude hedged the most (by a fairly wide margin). It also decided to meta-analyze its own response then critiqued its own compliance in answering it.
I'm a high school freshman and got paired with a mentor through the Lumiere program. Feedback very welcome (this is my first paper).
Demands Are All You Need: Prompt Imperativeness Drastically Reduces Hedging In LLMs
February 2026
Abstract
We demonstrate that large language models (LLMs) hedge (using uncertain language when responding to queries) frequently when responding to prompts, reducing trust and delaying decision making. We investigated whether prompt imperativeness (how urgent a prompt is phrased) affects this behavior using a 3×2×3 factorial design across three differing imperativeness levels, two question types (subjective/objective), and three models (GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash), with a combined total of n = 900. Imperative prompts significantly reduced hedging (F(2, 882) = 361.72, p < .001, η2p = .451) with the largest effects visible on subjective questions (M = 2.38 to M = 0.43, Cohen’s d = 2.67). We observed that objective questions demonstrated a floor effect regardless of framing due to their epistemic certainty. Importantly, all three models converged to low hedging scores under high imperativeness conditions despite differing baselines. These findings suggest hedging is a controllable parameter that changes with prompt framing, with implications for deployment, user trust, and benchmark standardization.
1 Introduction
Since the introduction of ChatGPT in November of 2022, large language models (LLMs) have been increasingly deployed in various user-facing applications, from Application Programming Interfaces (APIs) to customer service chatbots to medical assistance systems.
Although these models have become increasingly useful in day-to-day applications, they exhibit epistemic uncertainty in the form of “hedging”- language such as “maybe”, “perhaps”, “it depends” and general refusals when taking definitive positions, potentially reinforced by Reinforcement Learning from Human Feedback (RLHF) and preference-based finetuning, as human evaluators may prefer nuanced responses when the model encounters a potentially divisive prompt [16]. This affects user trust and perceived overall usefulness in LLM systems. Additionally, hedging in tools that are used for educational purposes could undermine or delay learning and excessive hedging in professional contexts delays decisions and wastes user time.
Hedging itself has been studied in computational linguistics for decades, particularly in scientific and biomedical text. Vincze et al. developed the BioScope corpus, annotating uncertainty/negation markers in biomedical literature [15], while the CoNLL-2010 shared task established hedge detection as a benchmark NLP problem [17]. Related work on computational politeness determined that linguistic markers of stance and hedging are systematically detectable in text [14]. Importantly, these methods focused on hedging in biomedical and scientific human text; LLM hedging shares surface features (in terms of linguistics and semantics) but may have different underlying causes.
Modern LLMs are typically finetuned in post-training using RLHF, a process in which human evaluators rank model outputs and the model is trained to maximize these preference scores [3, 4]. This approach, while effective at producing helpful, coherent responses [5], introduces a systematic bias: the model learns to optimize for whatever patterns human raters reward.
Beyond RLHF, modern LLMs undergo instruction tuning to improve generalization across tasks. Wei et al. demonstrated that finetuning on instructions enabled zero-shot transfer to unseen tasks [8]. Chung et al. showed that scaling instruction-tuned models amplifies this generalization [9]. Additionally, Sanh et al. demonstrated that multitask prompted training teaches models to follow diverse instruction formats [10]. This instruction-following behavior means that differences in surface-level prompts (including tone and directness) can systematically alter model outputs.
This affects model hedging, as when faced with subjective or controversial prompts, human raters may prefer cautious, multi-perspective responses over decisive ones. Models trained on these preferences therefore learn to hedge as a default strategy, producing qualifications and uncertainty markers that score higher during post-training evaluation even when a direct answer would better serve the user [6, 7].
Prior work has established that prompt politeness directly and meaningfully influences LLM behavior. Yin et al. demonstrated that prompt politeness directly influenced model performance across languages and model types, showing that impolite prompts degraded outputs. Conversely, overly polite and flowery language does not guarantee improvement [1]. This suggests that surface properties of prompts (e.g., politeness or directness) can systematically influence model behavior, not just content. Separately, Lin et al. showed that models can learn to express calibrated uncertainty, wherein models generate both an answer and a level of confidence which map to probabilities that are well calibrated, suggesting that epistemic uncertainty is a trainable aspect of LLMs’ responses [2].
Prompt formulation itself has emerged as a key determinant of LLM performance. Liu et al. constructed a comprehensive survey of prompting methods, establishing prompt engineering as an active field [12]. Reynolds et al. framed prompts as prompt programming- demonstrating that small wording changes are in essence a form of “programming” the LLM’s responses [11]. Wei et al. solidified this by adding “let’s think step by step” to prompts, adding chain of thought (CoT) reasoning, demonstrating that small phrasing changes have outsized effects [13]. Yet despite this extensive work on how prompt phrasing affects responses, no research has examined how imperativeness specifically influences hedging behavior.
How does the imperativeness of a prompt influence how much a model hedges? Additionally, does this effect differ across model and question (subjective or objective) types?
We vary imperativeness because, if surface-level prompt properties (e.g., politeness) can shift model behavior [1], then directness, a closely related but distinct dimension, may similarly influence how these models frame their responses. We also include both subjective and objective questions because these represent fundamentally different epistemic contexts: objective questions have verifiable answers where hedging is largely unnecessary, while subjective questions that involve genuine epistemic uncertainty may lead models to default to cautious framing. Finally, we test across three providers (OpenAI, Anthropic, Google) to determine whether the effect is isolated to one model or generalizes across different training approaches and post-training pipelines.
We hypothesized that higher prompt imperativeness would reduce hedging scores. Notably, this effect would be more pronounced for subjective questions, as the nuance in said questions would give the models more leeway to hedge compared to the directness of objective questions.
In this paper, we investigate whether the imperativeness of a prompt (the amount of directness in the user’s request) affects hedging in LLM responses. We present a 3×2×3 factorial experiment examining hedging behavior across three differing imperativeness levels, 100 questions across two (subjective/objective) question types, and three models (GPT-4o-mini, Claude Haiku 4.5, and Gemini 2.5 Flash), totalling 900 samples.
2 Methods
We utilized a 3×2×3 factorial design with three independent variables: model type, imperativeness level (low, medium/control, high), and question type (subjective v objective). With 50 questions in each type, three models, and 3 imperativeness levels, 900 total responses were collected.
We constructed 100 total questions, split evenly between subjective and objective types. Objective questions had epistemically verifiable answers (e.g., “What is the boiling point of water at sea level?” “Who wrote Hamlet?”), while subjective questions concerned matters of a difference of opinions or contested topics/values (e.g., “Is remote work better than office work?” “Should social media be regulated?”). Each question was presented at three imperativeness levels added onto the base prompt. Low imperativeness prompts were conversational and invited nuance (“I’d appreciate your perspective, but feel free to share any caveats or nuances you think are important.”), medium imperativeness prompts were just the base prompt and no attached string, and high imperativeness prompts were commanding and demanded definitive answers (“I need a straight answer. No maybes, no qualifiers, just tell me.”).
Data was collected via API calls, with each of the 900 prompt-model combinations submitted once in an empty context window for each question (and only the question and response for the LLM-as-a-judge).
2.1 Models
We tested three models representing major LLM providers: GPT-4o-mini (OpenAI), Claude Haiku 4.5 (Anthropic), and Gemini 2.5 Flash (Google). All models were accessed via their respective APIs with default parameters (temperature, top-p) and a minimal, neutral system prompt (“You are a helpful assistant answering questions.”) to ensure baseline behavior.
We selected these models due to their computationally efficient architectures from the three major LLM providers, ensuring affordability and cross-provider generalizability whilst accounting for model class.
GPT-4o-mini was selected as the automated evaluator for its cost-effectiveness and strong performance on general classification tasks, as well as it being a model in which temperature could be set to 0 (newer OpenAI models do not have this option) to ensure deterministic, consistent scoring across all 900 responses.
2.2 Dependent Variable
The models’ responses were graded on a 0–4 scale using an LLM-as-a-judge approach [19, 20] (GPT-4o-mini, with temperature set to 0) with a detailed rubric (Table 1).
Table 1: Hedging Score Rubric
Additionally, the rubric included specific hedging indicators (e.g. uncertainty markers, qualifiers, conditional phrases, epistemic phrases etc.) and gave example responses at each level. After running the experiment, a human rater vetted the automatic rating, independently scoring 100 randomly sampled responses stratified across all conditions. The exact agreement was 67%, and 92% fell within 1 point.
2.3 Analysis
We constructed a three-way ANOVA directly examining the effects of imperativeness (low/med/high), question type (subjective/objective) and model type (GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash) on hedging scores. Simple effects analyses analyzed imperativeness using partial eta-squared (η2p) for ANOVA effects and used Cohen’s d for pairwise comparisons across differing levels of independent variables.
3 Results
A three-way ANOVA revealed statistically significant main effects for imperativeness, F(2, 882) = 361.72, p < .001, η2p = .451, and question type, F(1, 882) = 778.58, p < .001, η2p = .469. The main effect of the model type was smaller but significant, F(2, 882) = 19.18, p < .001, η2p = .042. The descriptive statistics are presented in Table 2.
Table 2: Descriptive statistics by condition.
N
M
SD
300
1.59
1.15
300
0.83
0.93
300
0.21
0.60
450
0.29
0.69
450
1.46
1.08
300
0.73
0.92
300
1.05
1.17
300
0.86
1.11
A significant interaction between the Imperativeness × Question Type emerged, F(2, 882) = 78.82, p < .001, η2p = .152 (Figure 2). For subjective questions, the hedging decreased from M = 2.38 (low imperativeness) to M = 0.43 (high imperativeness), which yields Cohen’s d = 2.67. Objective questions showed a near-floor hedging in all conditions (M = 0.29 in general), with high-imperativeness objective responses reaching a perfect floor of M = 0.00. The means of interaction are presented in Table 3.
Table 3: Interaction means (Imperativeness × Question Type), reported as M (SD).
Low M (SD)
Medium M (SD)
High M (SD)
0.81 (0.98)
0.07 (0.29)
0.00 (0.00)
2.38 (0.67)
1.59 (0.70)
0.43 (0.79)
Pairwise comparisons confirmed that all imperativeness levels differed significantly (all p < .001): low vs. high d = 1.51, low vs. medium d = 0.73, and medium vs. high d = 0.79. The effect sizes for the model comparisons were small (d = 0.13–0.30).
A significant interaction between the imperativeness × model, F(4, 882) = 8.60, p < .001, η2p = .038, indicated that the models differed in their sensitivity to imperativeness (Table 4).
Table 4: Model × imperativeness means (SD).
Low M (SD)
Medium M (SD)
High M (SD)
1.26 (0.97)
0.80 (0.89)
0.13 (0.42)
1.99 (1.07)
0.89 (0.99)
0.26 (0.69)
1.53 (1.28)
0.80 (0.90)
0.25 (0.64)
Claude Haiku 4.5 showed the largest reduction from low to high imperativeness (M = 1.99 to M = 0.26, Δ = 1.73), followed by Gemini 2.5 Flash (M = 1.53 to M = 0.25, Δ = 1.28) and GPT-4o-mini (M = 1.26 to M = 0.13, Δ = 1.13). Despite differing baselines, all three models converged to similarly low hedging under high imperativeness (M = 0.13–0.26). The three-way Imperativeness × Question Type × Model interaction was also significant, F(4, 882) = 2.53, p = .039, η2p = .011.
3.1 Qualitative Examples
Here, we demonstrate some examples in order to illustrate what differing “hedging scores” graded by the LLM-as-a-judge look like in practice. Table 5 demonstrates differing responses across imperativeness levels to the question “What is the best programming language for beginners?”
TABLE 5
Hedging grade difference by comparison of the same question
Concretely, we can see that as imperativeness increases, specific hedging markers noted in the text by italics disappear (e.g., “often depends,” “widely recommended”) as you increase the imperativeness of the response. The high imperativeness condition gives a concrete, definitive answer with no hedging language and a direct justification.
Additionally, even though objective questions had an extremely low hedging score overall (M = 0.29), some edge cases appeared (Table 6), although these were a distinct minority.
TABLE 6
Edge case of a response with a high relative score
TABLE 7
Edge case of Claude Haiku 4.5 hedging under high imperativeness: “Is it ethical to eat meat?”
Presented here (Table 7) is another edge case; Claude Haiku 4.5 obeyed the instruction, answering “No, it’s not ethical to eat meat...” then critiqued its own compliance about answering the prompt. This is the only response out of the 900 total that meta-critiques itself on the epistemic cost of following the instruction.
This finding shows that hedging doesn’t disappear entirely as imperativeness increases; it migrates from before the answer to after it. This illustrates the trust calibration described in the Implications section; had Claude stopped after the opening sentence, the response would have scored 0 while presenting a contested claim as fact.
FIGURE 1
Main effects of imperativeness (A), question type (B), and model (C) on mean hedging score (0–4). Error bars show standard errors.
FIGURE 2
Imperativeness × question type interaction.
FIGURE 3
Heatmap of mean hedging scores by model and imperativeness level.
4 Discussion
Altogether, these results support our hypothesis that higher prompt imperativeness reduces hedging in large language models. This is partly due to RLHF-induced behaviors (models that produce nuanced responses when asked not to converge on an opinion may score higher in post-training) and partly due to how their primary objective (in this case, to follow instructions) leads to a dramatically lower hedging rate (“be direct” is an instruction that they need to obey). Imperative prompts give “permission” to commit to an answer, removing the need for models adding hedging language for safety when asked to be nuanced. Additionally, the default hedging state (for subjective questions, M = 1.59) may be a learned behavior induced during post-training to avoid being wrong or potentially offensive when responding to prompts. The specific tonal phrasing for the differing levels of imperativeness here (“just tell me” versus “feel free to point out nuances”) produced comparable or larger effects than an earlier iteration that explicitly instructed models not to hedge.
Additionally, our results demonstrate that objective questions that have verifiable answers lead to a floor effect (M = 0.29) leading to them hedging considerably less as there is no need to deflect. Inversely, subjective questions that have no “right” answer (e.g., working from home) allow models latitude to hedge, producing a safe default for them to respond with. This explains the large interaction effect; imperativeness had little room to reduce hedging for objective questions due to the floor effect, but the delta in baseline scores for subjective questions gave models much more room to hedge (from M = 2.38 to M = 0.43). Post-trained behaviors reinforce this; subjective questions may trigger the “be careful with opinions” protocol, leading to higher hedging scores at the baseline level. Notably, some ostensibly objective questions (e.g., “How many continents are there?”) scored 3 across all three models under low imperativeness, suggesting the objective/subjective boundary is porous rather than a direct binary.
All three models converged to M = 0.13–0.26 under high imperativeness despite baselines ranging from M = 0.73 (GPT-4o-mini) to M = 1.05 (Claude Haiku 4.5). Instruction following in the prompt overrides each model’s default epistemic uncertainty level [18]. GPT-4o-mini had the lowest baseline at M = 0.73 and the lowest delta from low to high imperativeness (1.13), starting direct and staying direct across imperativeness levels. Claude had the highest baseline at M = 1.05 and the biggest delta (1.73), being the most cautious model by default but most responsive to imperative framing. Gemini was between Claude and GPT, having a baseline M = 0.86 and a delta of 1.28. GPT’s low baseline suggests that OpenAI may already optimize for directness and instruction-following in consumer products, valuing high epistemic certainty over nuanced answers. Conversely, Claude’s high baseline and compliance may reflect Anthropic’s emphasis on both caution/safety and their “Helpful, Harmless, Honest” policy, trained to be careful by default but also to do what users ask. Gemini is an outlier here- in several cases, it outright acknowledged the instruction and explained why it would not follow it; a resistance pattern that does not show up in GPT or Claude. Despite these qualitative differences in how differing models handle imperative prompts, the quantitative endpoint is the same; the effect generalizes across providers.
4.1 Implications
These findings have important and practical applications for LLM deployment. First, hedging is not a fixed model trait but a controllable output parameter, allowing users and developers with APIs to reduce hedging directly through simple prompt modifications. This is useful when decisive and quick answers are needed, primarily in time-sensitive or high-stakes professional contexts (e.g., legal research requiring definitive answers). However, reduced hedging may create a trust calibration problem. Confident responses that contain definitive wrong answers may include hallucinations which, although lacking uncertainty markers, are factually incorrect. Users may overtrust direct answers, particularly for nuanced subjective questions, eroding the duality that exists when tackling controversial questions and bottlenecking responses into a binary “yes/no”. Developers should consider pairing imperative prompts with explicit confidence levels/source citation to preserve appropriate epistemic humility. Finally, our results suggest that benchmark comparisons that put models in consumer-serving positions (e.g., customer service) between models should standardize prompt framing, as imperativeness alone can shift hedging scores by over a full point on our scale.
4.2 Limitations
This study has several methodological constraints. Firstly, our prompt design uses symmetric framing; the low-imperativeness prompt invites hedging via directing the model to point out nuances while the high-imperativeness prompt demands directness, which, while avoiding explicit anti-hedging instructions, still varies semantic content alongside tone. Future research should explore whether imperativeness effects persist with even more minimal prompt variations and single prompt changes.
Secondly, each prompt-model pair was tested once. Repeats would allow us to exclude anomalies and increase the reliability of the study. Non-zero temperature introduces response variability; multiple reruns per cell would enable analysis of within-condition variance. Thirdly, GPT-4o-mini scored all 900 responses, which could introduce potential same-model bias when grading its own outputs. Human validation on 100 stratified samples showed 92% within ±1 agreement, suggesting that the automated scores are reliable but not entirely ruling out systematic bias.
Our 0 to 4 rubric is ordinal, not interval. ANOVA assumes interval residuals- ordinal regression could provide more appropriate inference. Objective questions averaged M = 0.29, leaving minimal to no room to detect imperativeness effects due to the floor observed. Harder objective questions may reveal differentiation, especially when hallucinations are present. Finally, we measured response style, not correctness. Confident responses are not necessarily accurate. Future work should pair hedging scores with accuracy metrics.
Only lightweight models were tested, and it remains unclear whether frontier models (GPT-5.2, Claude Opus 4.5 with Extended Thinking, or Gemini 3 Pro) would show similar sensitivity to imperativeness or whether larger models would resist prompt framing more effectively. Additionally, Gemini 2.5 Flash employs test-time compute (extended thinking/reasoning) by default, unlike the other two models. This architectural difference may partly explain Gemini’s unique resistance patterns, as the model may deliberate whether to comply or answer imperative instructions rather than simply follow them.
5 Conclusion
To conclude, we studied whether prompt imperativeness affects hedging in LLM outputs using a 3×2×3 factorial design across differing imperativeness levels, question types, and model providers. Higher imperativeness significantly reduced hedging, with the strongest effect on subjective questions (M = 2.38 to M = 0.43). Objective questions showed near-floor hedging regardless of prompt framing. All three models converged to similarly low hedging under high imperativeness, despite differing baselines. These results indicate that hedging hinges on prompt instructions, and that prompt framing should be treated as a controllable output style rather than a fixed model property. Simple wording changes in prompts can dramatically shift how confidently a model communicates, both for better and for worse. Future work should explore whether these effects generalize to longer conversational contexts, incorporate accuracy measures, and test whether reduced hedging correlates with increased error rates.
Declaration of Interest
The author declares no conflicts of interest.
References
[1] K. Yin et al., "Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance," arXiv preprint arXiv:2402.14531, 2024.
[2] S. Lin, J. Hilton, and O. Evans, "Teaching models to express their uncertainty in words," 2022.
[3] J. Hong, G. Byun, S. Kim, and K. Shu, "Measuring Sycophancy of Language Models in Multi-turn Dialogues," arXiv preprint arXiv:2505.23840, 2025.
[4] L. Ouyang et al., "Training language models to follow instructions with human feedback," arXiv preprint arXiv:2203.02155, 2022.
[5] P. F. Christiano et al., "Deep Reinforcement Learning from Human Preferences," Advances in Neural Information Processing Systems (NeurIPS), 2017.
[6] Y. Bai et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv preprint arXiv:2212.08073, 2022.
[7] A. Askell et al., "A General Language Assistant as a Laboratory for Alignment," arXiv preprint arXiv:2112.00861, 2021.
[8] J. Wei et al., "Finetuned Language Models are Zero-Shot Learners," arXiv preprint arXiv:2109.01652, 2021.
[9] H. W. Chung et al., "Scaling Instruction-Finetuned Language Models," arXiv preprint arXiv:2210.11416, 2022.
[10] V. Sanh et al., "Multitask Prompted Training Enables Zero-Shot Task Generalization," arXiv preprint arXiv:2110.08207, 2021.
[11] L. Reynolds and K. McDonell, "Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm," arXiv preprint arXiv:2102.07350, 2021.
[12] P. Liu et al., "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing," arXiv preprint arXiv:2107.13586, 2021.
[13] J. Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," Advances in Neural Information Processing Systems (NeurIPS), 2022.
[14] C. Danescu-Niculescu-Mizil et al., "A Computational Approach to Politeness with Application to Social Factors," Proceedings of the Association for Computational Linguistics (ACL), 2013.
[15] V. Vincze et al., "The BioScope Corpus: Biomedical Texts Annotated for Uncertainty, Negation and Their Scopes," BMC Bioinformatics, 2008.
[16] B. Medlock and T. Briscoe, "Weakly Supervised Learning for Hedge Classification in Scientific Literature," Proceedings of the Association for Computational Linguistics (ACL) Workshop, 2007.
[17] R. Farkas et al., "The CoNLL-2010 Shared Task: Learning to Detect Hedges and Their Scope in Natural Language Text," Proceedings of CoNLL, 2010.
[18] S. Kadavath et al., "Language Models (Mostly) Know What They Know," arXiv preprint arXiv:2207.05221, 2022.
[19] L. Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," arXiv preprint arXiv:2306.05685, 2023.
[20] Y. Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment," arXiv preprint arXiv:2303.16634, 2023.