LESSWRONG
LW

1230
Wikitags

AI Evaluations

Edited by Raemon, duck_master last updated 1st Aug 2023

AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)

Behavioral evaluations assess a model's abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer's ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model's behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include:

  • developing a method-agnostic standard to demonstrate sufficient understanding of a model
  • ensuring that the level of understanding is adequate to catch dangerous failure modes
  • finding the right balance between behavioral and understanding-based evaluations.

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

See also:

  • Interpretability (ML & AI)
  • AI Risk
Subscribe
Discussion
1
Subscribe
Discussion
1
Posts tagged AI Evaluations
45How evals might (or might not) prevent catastrophic risks from AI
Orpheus16
3y
0
166When can we trust model evaluations?
Ω
evhub
2y
Ω
10
117The case for more ambitious language model evals
Ω
Jozdien
2y
Ω
30
22[Paper] Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods
markov, Charbel-Raphaël
5mo
0
217Announcing Apollo Research
Ω
Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer
2y
Ω
11
211Thoughts on sharing information about language model capabilities
Ω
paulfchristiano
2y
Ω
44
164Towards understanding-based safety evaluations
Ω
evhub
3y
Ω
16
65Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses
Ω
TurnTrout
9mo
Ω
3
486How Does A Blind Model See The Earth?
henry
3mo
40
322Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Ω
evhub, Nicholas Schiefer, Carson Denison, Ethan Perez
2y
Ω
30
190OMMC Announces RIP
Adam Scholl, aysja
2y
5
94DeepMind: Model evaluation for extreme risks
Zach Stein-Perlman
2y
12
38Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Ω
Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush, scasper
2y
Ω
2
33How good are LLMs at doing ML on an unknown dataset?
Håvard Tveit Ihle
1y
4
359What’s the short timeline plan?
Ω
Marius Hobbhahn
10mo
Ω
49
Load More (15/213)
Add Posts