LESSWRONG
LW

Wikitags

AI Oversight

Edited by RobertM last updated 20th Sep 2022

AI Oversight as described by Doing oversight from the very start of training seems hard:

Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.

Examples of oversight techniques include:

  • Transparency tools (either used by a human, an AI, or a human assisted by an AI)
  • Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)
  • Relaxed adversarial training (which could be seen as an extension of adversarial inputs)
Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged AI Oversight
113Oversight Misses 100% of Thoughts The AI Does Not Think
Ω
johnswentworth
3y
Ω
49
41No, We're Not Getting Meaningful Oversight of AI
Davidmanheim
2mo
4
27Quick thoughts on "scalable oversight" / "super-human feedback" research
Ω
David Scott Krueger (formerly: capybaralet)
3y
Ω
9
111Measuring and Improving the Faithfulness of Model-Generated Reasoning
Ω
Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman, Ethan Perez
2y
Ω
15
38Building Black-box Scheming Monitors
james__p, richbc, Simon Storf, Marius Hobbhahn
1mo
18
27Human-AI Complementarity: A Goal for Amplified Oversight
Ω
rishubjain, Sophie Bridgers
8mo
Ω
4
14Doing oversight from the very start of training seems hard
Ω
peterbarnett
3y
Ω
3
12The weak-to-strong generalization (WTSG) paper in 60 seconds
sudo
2y
1
4Trying to measure AI deception capabilities using temporary simulation fine-tuning
alenoach
2y
0
4Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis
Matt Levinson
8mo
0
3AI Safety Oversights
Davey Morse
7mo
0
2Alignment Structure Direction - Recursive Adversarial Oversight(RAO)
Jayden Shepard
4mo
0
2W2SG: Introduction
Maria Kapros
1y
2
1Is there any existing term summarizing non-scalable oversight methods in outer alignment?
Q
Allen Shen
2y
Q
0
Add Posts