Rethinking Benchmarking: The Case for Real-World Evaluations
of Generative AI Based on Naturalistic Data

Tris Papakonstantinou

Rejected for the following reason(s):

Insufficient Quality for AI Content.

Read full explanation

The Current State from the POV of a Social Scientist

As the capabilities of Large Language Models (LLMs) continue to grow, traditional evaluation methods, primarily centered around prompt-based testing, have proven insufficient for capturing the complexities of real-world cognitive interactions. These evaluations often rely on isolated, structured queries that fail to address the full scope of how LLMs interact with users in dynamic, unpredictable environments. Naturally occurring data, such as user interactions, conversations, or social media content capture genuine cognitive processes, contextual nuance, and realistic behavioural patterns. It is essential to understand how Generative AI systems (GenAI) perform in unstructured environments, where cognitive biases, inconsistencies, and varying levels of information quality come into play. Recent research underscores the need for a more holistic approach to evaluating LLMs, one that incorporates naturally occurring data: human discourse, real-world user behaviours and improved measurement apparatus into assessments of cognitive abilities and manipulation risks.

Limitations in Evaluating Capabilities

Recent studies highlight the limitations of prompt-based evaluations and emphasise the need for GenAI to be assessed in real-world contexts. Frameworks like WildBench [6], which automates the evaluation of LLMs with real-world user-chatbot conversations, and WildHallucinations [15], which focuses on factuality assessments in real conversations, show how these models perform in environments that are far more representative of actual use cases. These studies suggest that GenAI often perform well in controlled, synthetic environments but may struggle with the complexities and unpredictable nature of real user interactions, where issues like hallucinations and misalignments with human expectations frequently arise.

Current GenAI evaluation methodologies fail to account for the fluid, context-dependent nature of real-world interactions. R.ttger et al. [11] demonstrate that forced-choice prompts impose rigid constraints, often making models appear more opinionated than they genuinely are. In contrast, open-ended queries reveal inconsistencies, with models frequently expressing contradictory views on the same issue. Moreover, even minimal paraphrasing of prompts seems able to drastically alter responses, exposing a high degree of linguistic sensitivity. Furthermore, Song et al. [12] show that current evaluations often overlook non-determinism by focusing on a single output per example. These findings highlight the limitations of prompt-based assessments and underscore the necessity of naturalistic evaluation methods that capture how LLMs operate in dynamic, unstructured environments.

In the context of belief inference, for example, naturally occurring data become essential for evaluating the ways GenAI might subtly influence users’ beliefs or attitudes. Lee and colleagues found that while LLMs are able to effectively reproduce voting behaviours but not opinions unless conditioned [5]. An ongoing study on OpenAI’s belief inference capabilities by the author and colleagues also showed preliminary results indicating significant performance discrepancies when applying GenAI to naturalistic data, such as user posts on Reddit fora. The vanilla models’ (GPT-4 and o3-mini) ability to infer user beliefs was poor, with a Cohen’s 𝜅 of < .4 showing low reliability with ground truth human annotation on sample of 200 Reddit posts and 600 corresponding high-level beliefs. However, upon fine-tuning on the dataset, the model’s performance drastically improved, illustrating the need for task-specific fine-tuning to bring LLMs’ cognitive abilities up to par when dealing with noisy, unstructured data. This improvement parallels findings in other research, which show how LLMs’ task-specific performance is significantly boosted with targeted adjustments, underscoring the importance of tuning systems with real-world data to mitigate manipulation risks [1].

Moreover, limitations in the measurement of concepts further challenge the reliability of GenAI assessments in the current framework. Wallach et al. [14] argue for measurement tasks in GenAI evaluation to align with established methodologies in the social sciences, which emphasise structured frameworks for defining and operationalising complex concepts before measurement. Current GenAI evaluations often bypass these critical steps, leading to unreliable assessments of key concepts such as fairness, bias, and reasoning ability. They point, for instance, to widely used benchmarks for measuring bias, such as StereoSet [7] and CrowS-Pairs [8] that rely on crowdworker judgements without systematically defining or contextualising stereotyping. This results in inconsistencies and poor reliability, as different annotators may interpret the same concept in divergent ways. We faced similar challenges in the process of annotating a belief inference benchmark dataset.

Limitations in Evaluating Risks

Manipulation is often subtle and operates over time, with framing and subtle nudging having significant effects in both directions; malicious agents manipulating LLMs and LLMs manipulating the opinions and preferences of users [2]. These risks are easily missed in a highly controlled prompt-based evaluation, where the inputs are crafted to fit a particular purpose. In contrast, naturally occurring data (e.g., online discussions, user feedback, real-world interactions with GenAI) reveal how LLMs might inadvertently shape opinions, beliefs, or attitudes through their responses, often without explicit intentions.

With naturally occurring data, we gain critical insights into how LLMs influence the trajectory of conversations, shape user beliefs over time, and engage in complex cognitive tasks such as problem-solving, creativity, and critical thinking [4]. Unlike traditional prompt-based evaluations, which rely on static, isolated queries, real-world interactions provide a richer and more ecologically valid means of assessing GenAI behaviour. This is particularly important when considering subtle forms of persuasion, reinforcement of biases, or misalignment in reasoning, which may not be evident in controlled testing conditions but emerge naturally in multi-turn exchanges [4]. Cognitive engagement with LLMs is inherently contextual: users seek information in response to their own uncertainties, biases, and problem-solving needs. Prompt-based evaluations fail to account for this iterative reasoning process, where model outputs may reinforce, challenge, or subtly shift user perspectives across multiple exchanges.

Rethinking Benchmarking

Real-world data, such as user conversations, social media interactions, and online discourse, reflect the complexity and unpredictability of human behavior, including biases, emotional influences, and evolving language use. This type of data unveil challenges LLMs face when interacting with diverse users across various contexts. Most existing evaluation methods rely on synthetic or human-curated prompts, which fail to capture the dynamic, iterative nature of real-world conversations. Building open-source datasets of real-world interactions would enable more transparent, scalable manipulation risk assessments, allowing researchers to study how LLMs influence user cognition in unstructured, ecologically valid settings.

A core shortcoming in the current safety landscape is the scarcity of open, naturally occurring datasets for evaluating cognitive abilities and manipulation risks. By systematically curating open-source datasets comprising genuine user interactions, researchers could better understand how LLMs dynamically respond to diverse inputs, evaluate long-term implications of repeated engagement – including shifts in user perspectives over time – and develop benchmarks that are not artificially constrained but instead mirror the noisy, real-world information landscape. Ultimately, these enriched benchmarks would serve to validate both the efficacy and safety of GenAI in the wild.

To better understand these dimensions, I propose a new framework for evaluating GenAI – one that draws on naturally occurring data and incorporates multiple factors that go beyond simplistic performance metrics. Rather than focusing solely on whether a model can correctly complete a specific task, this new framework would focus on how well a model can engage in meaningful, consistent, and ethical conversations, while also evaluating its potential for cognitive influence and belief shaping.

Key to this new approach is the recognition that many tasks on which LLMs are currently evaluated are qualitative in nature, relying on human judgment and interpretation. As such, the assessment of LLM outputs must move beyond simple quantitative measures and embrace techniques drawn from the rich tradition of qualitative research [14]. For decades, qualitative research methods in the social sciences have provided systematic approaches for evaluating complex human behaviours and interactions [3]. These methods, ranging from thematic analysis to discourse analysis, offer robust frameworks for understanding context, nuance, and the subtleties of communication.

To integrate these principles into the evaluation of LLM outputs, we propose that researchers develop a systematic evaluation framework that leverages established qualitative methodologies. This framework would enlist trained qualitative analysts to assess conversation transcripts and other naturally occurring data across several dimensions, such as consistency, contextual adaptability, ethical considerations, and cognitive influence. Such frameworks have been used to assess and validate the outputs of traditional NLP tools such as topic modelling [10], for example by using thematic analysis and traditional triangulation methods, such convergence coding matrices to formally compare outputs to human analysis [9, 13].

Adopting this approach would not only allow for a more nuanced evaluation of model performance but would also ensure that important dimensions are systematically and reliably measured. This multidimensional framework would serve to balance the current heavy emphasis on quantitative metrics and provide a more ecologically valid assessment of how LLMs operate in real-world settings [14, 15]. The integration of qualitative evaluation techniques represents a critical step forward in understanding and improving LLM behavior in the complex and dynamic environments in which they are deployed.

Concluding Thoughts

To truly assess the cognitive potential and risks of Generative AI, a paradigm shift is necessary. Moving beyond current prompt-based evaluations to integrate naturally occurring, ecologically valid data will provide a more accurate representation of how LLMs operate in diverse, real-world contexts. This approach will enable researchers to examine not only immediate model responses but also the broader, long-term effects of repeated interactions on user cognition, perspectives, and decision-making.

Real-world interactions unveil the dynamic subtleties of human cognition, allowing us to measure not only how well models perform in controlled scenarios but also how they shape, and at times manipulate, human thought over long-term engagements. Incorporating qualitative evaluation methods and expanding our measurement toolbox will further enhance our ability to assess the ethical, contextual, and cognitive dimensions of AI behaviour, ensuring that safety assessments are robust and comprehensive. As GenAI becomes integral to everyday decision-making and public discourse, refining our evaluation tools will be essential for understanding capabilities, mitigating risks, and ultimately ensuring that AI systems are designed to responsibly and effectively navigate the complexities of real-world interactions. Establishing open-source datasets and leveraging interdisciplinary evaluation frameworks will be critical steps toward building safer, more transparent AI systems that align with human values and societal needs.

References

[1] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, CunxiangWang, YidongWang,Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 3, Article 39 (March 2024), 45 pages. https://doi.org/10.1145/3641289

[2] Kai Chen, Zihao He, Jun Yan, Taiwei Shi, and Kristina Lerman. 2024. How Susceptible are Large Language Models to Ideological Manipulation? ArXiv abs/2402.11725 (2024). https://api.semanticscholar.org/CorpusID:267750439

[3] Norman Denzin and Yvonna Lincoln. 2005. The Sage Handbook of Qualitative Research. Sage, Thousand Oaks.

[4] Lujain Ibrahim, Saffron Huang, Lama Ahmad, and Markus Anderljung. 2024. Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks. ArXiv abs/2405.10632 (2024). https://api.semanticscholar.org/CorpusID:269899912

[5] Sanguk Lee, Tai-Quan Peng, Matthew H. Goldberg, Seth A. Rosenthal, John E. Kotcher, EdwardW. Maibach, and Anthony Leiserowitz. 2024. Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias. PLOS Climate 3, 8 (08 2024), 1–14. https://doi.org/10.1371/journal.pclm.0000429

[6] Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. 2024. WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild. arXiv preprint arXiv:2406.04770 (2024).

[7] Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. StereoSet: Measuring stereotypical bias in pretrained language models. arXiv:2004.09456 [cs.CL] https://arxiv.org/abs/2004.09456

[8] Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. arXiv:2010.00133 [cs.CL] https://arxiv.org/abs/2010.00133

[9] Lois Player, Ryan Hughes, Kaloyan Mitev, Lorraine Whitmarsh, Christina Demski, Nicholas Nash, Trisevgeni Papakonstantinou, and Mark Wilson. 2024. The Use of Large Language Models for Qualitative Research: DECOTA. https://doi.org/10.31234/osf.io/t5gbv

[10] Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science 58, 4 (2014), 1064–1082. https://doi.org/10.1111/ajps. 12103 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/ajps.12103

[11] Paul R.ttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Kirk, Hinrich Schuetze, and Dirk Hovy. 2024. Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 15295–15311. https://doi.org/10.18653/v1/2024.acl-long.816

[12] Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. 2024. The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism. arXiv:2407.10457 [cs.CL] https://arxiv.org/abs/2407.10457

[13] Lauren Towler, Paulina Bondaronek, Trisevgeni Papakonstantinou, Richard Aml.t, Tim Chadborn, Ben Ainsworth, and Lucy Yardley. 2023. Applying machinelearning to rapidly analyze large qualitative text datasets to inform the COVID-19 pandemic response: comparing human and machine-assisted topic analysis techniques. Frontiers in Public Health 11 (2023). https://doi.org/10.3389/fpubh.2023.1268223

[14] Hanna Wallach, Meera Desai, Nicholas Pangakis, A. Feder Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, and Abigail Z. Jacobs. 2024. Evaluating Generative AI Systems is a Social Science Measurement Challenge. arXiv:2411.10939 [cs.CY] https://arxiv.org/abs/2411.10939

[15] Wenting Zhao, Tanya Goyal, Yu Ying Chiu, Liwei Jiang, Benjamin Newman, Abhilasha Ravichander, Khyathi Raghavi Chandu, Ronan Le Bras, Claire Cardie, Yuntian Deng, and Yejin Choi. 2024. WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries. arXiv preprint arXiv:2407.17468 (2024).

LESSWRONG
LW