Workshop Report: Why current benchmarks approaches are not sufficient for safety?

Tom DAVID; Pierre Peigné

I’m sharing the report from the workshop held during the AI, Data, Robotics Forum in Eindhoven, a European event bringing together policymakers, industry representatives, and academics to discuss the challenges and opportunities in AI, data, and robotics. This report provides a snapshot of the current state of discussions on benchmarking within these spheres.

Speakers: Peter Mattson, Pierre Peigné and Tom David

Observations

Safety and robustness are essential for AI systems to transition from innovative concepts and research to reliable products and services that deliver real value. Without these qualities, the potential benefits of AI may be overshadowed by failures and safety concerns, hindering adoption and trust in the technology.
AI research and development have transitioned from traditional engineering methodologies, which rely on explicitly defined rules, to data-driven approaches. This shift highlights the need to leverage extensive datasets and computational power to train models, underscoring the complexity of developing systems that operate effectively without predefined logic.
The opaque nature of deep learning models, often described as "black boxes," presents significant challenges in understanding these models. This necessitates rigorous research into interpretability and transparency, ensuring that stakeholders can trust AI systems, particularly in critical applications where safety and reliability are paramount.
Current benchmarking practices face significant challenges, such as the tendency for models to memorize benchmark data. This memorization can lead to misaligned metrics that do not accurately reflect a model's real-world capabilities. Additionally, the sensitivity of benchmarks to prompt variations introduces inconsistencies in evaluation, undermining the reliability of results and making it difficult to assess model capabilities across different scenarios.
From a safety perspective, existing benchmarks may inadvertently exploit vulnerabilities within AI models in a biased manner. This bias can lead to skewed assessments that fail to address critical safety concerns, resulting in AI systems that may perform well under benchmark conditions but exhibit unsafe behaviors in real-world applications.

Recommendations

To facilitate the transition towards the development of useful and safe products and services that effectively benefit society, economy, and industry, it is essential to focus on two key objectives: enhancing reliability and reducing risk. This raises the question of what specific actions and strategies can be implemented to achieve these goals.
To ensure that AI systems are reliable and effective, it is imperative to establish rigorous evaluation measures throughout the training and testing phases. This involves not only assessing the performance of models with training and test data but also implementing comprehensive metrics that accurately capture their robustness in realistic scenarios.
The development of a "Science of Evals" is essential to create standardized and meaningful benchmarks that reflect the complexities of AI applications. By focusing on rigorous and systematic evaluation methodologies, we can enhance our understanding of model behavior and address limitations of current static benchmarks.
The effectiveness of AI solutions is directly influenced by the quality of the benchmarks used during testing and evaluation. Poorly designed benchmarks can introduce significant losses in understanding, leading to misaligned expectations and suboptimal performance. Therefore, it is crucial to develop benchmarks that accurately reflect real-world problems, enabling more reliable assessments of AI capabilities.
Both the benchmarking and testing processes are inherently "lossy," meaning they can oversimplify complex real-world scenarios. To minimize this loss, it is essential to create benchmarks that encompass a wide range of conditions and variability. By refining evaluation methodologies, we can ensure that AI solutions are effective in controlled environments and robust in real-world challenges.
By establishing metrics that reflect real-world conditions and expectations, stakeholders can drive progress and ensure that advancements are aligned with societal needs, ultimately fostering trust and encouraging wider adoption of effective practices. Transparency regarding testing methods and processes ("what's under the hood") is crucial for validating the benchmarks.
Effective governance requires a structured approach that aligns social principles, policies, and regulations with the rapid advancements in AI capabilities. By integrating benchmarks into the governance framework, organizations can set clear quality standards that guide the development and deployment of AI technologies while ensuring they remain socially responsible and aligned with long-term objectives.
An effective approach to adversarial robustness testing involves dynamically probing the attack surfaces of AI systems to identify and exploit vulnerabilities. This method adapts strategies based on the system's responses, ensuring a comprehensive evaluation of potential weaknesses.
Each adversarial test should be uniquely tailored to leverage the specific vulnerabilities of the target system. By employing varied and adaptive testing methodologies, these assessments can minimize memorization effects and reduce sensitivity to prompt variations, leading to more reliable and unbiased evaluations of AI robustness.
It is crucial to recognize that General AIs (GenAIs) are fundamentally different from humans and should be evaluated as distinct entities. Avoiding anthropomorphization allows for a clearer examination of GenAI cognition and behavior, free from biases and assumptions rooted in human experience, leading to more accurate insights into how these systems function.
Enhancing comprehension is critical for ensuring the safe development of advanced AI technologies, and then, leading to innovations that benefit society and human

[-]Ben Smith2y20

sensitivity of benchmarks to prompt variations introduces inconsistencies in evaluation

When evaluating human intelligence, random variation is also something evaluators must deal with. Psychometricians have more or less solved this problem by designing intelligence tests to include a sufficiently large battery of correlated test questions. By serving a large battery of questions, one can exploit regression to the mean in the same way that samples from a distribution in general can arrive at an estimate of a population mean from samples.

I suppose the difference between AI models and humans is that through experience we know that the frontier of human intelligence can be more or less explored by such batteries of tests. In contrast, you never know when an AI model (an "alien mind" as you've written before) has an advanced set of capabilities with a particular kind of prompt.

The best way to solve this problem I can imagine to try to understand the distribution under which AIs can produce interesting intelligence. With the LLM Ethology approach this does seem to cache out to: perhaps there are predictable ways that high-intelligence results can be elicited. We have already discovered a lot about how current LLMs have and how best to elicit the frontier of their capabilities.

I think this underscores the question: how much can we infer about capabilities elicitation in the next generation of LLMs from the current generation? Given the widespread use, the current generation is implicitly "crowdsourced" and we get a good sense of their frontier. But we don't have the opportunity to fully understand how to best elicit capabilities in an LLM before it is thoroughly tested. Any one test might not be able to discover the full capabilities of a model because no test can anticipate the full distribution. But if the principles for eliciting full capabilities are constant from one generation to the next, perhaps we can apply what we learned about the last generation to the next one.

3

Workshop Report: Why current benchmarks approaches are not sufficient for safety?

3

Observations

Recommendations

3

3