x

LESSWRONG

LW

AI Oversight — LessWrong

AI Oversight

Edited by RobertM last updated 20th Sep 2022

AI Oversight as described by Doing oversight from the very start of training seems hard:

Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.
Examples of oversight techniques include:
Transparency tools (either used by a human, an AI, or a human assisted by an AI)
Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)
Relaxed adversarial training (which could be seen as an extension of adversarial inputs)

Add Posts

1

1

Posts tagged AI Oversight

2

126Oversight Misses 100% of Thoughts The AI Does Not Think

4y

49

2

48No, We're Not Getting Meaningful Oversight of AI

11mo

4

2

27Quick thoughts on "scalable oversight" / "super-human feedback" research

David Scott Krueger

3y

9

1

111Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman, Ethan Perez

3y

15

1

82Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate

Jordan Taylor, Max H, Ed Fage, Thomas Read, Joseph Bloom

15d

0

1

45Building Black-box Scheming Monitors

CorrigibleAgent, richbc, Simon Storf, Marius Hobbhahn

10mo

18

1

27Human-AI Complementarity: A Goal for Amplified Oversight

rishubjain, Sophie Bridgers

1y

4

1

27Constitutional Black-Box Monitoring for Scheming in LLM Agents

Simon Storf, richbc, Marius Hobbhahn

3mo

0

1

14Paraphrasing Is (At Best) a Partial Defence Against Steganography in LLMs

Usman Anwar, robertzk

1mo

0

1

14Doing oversight from the very start of training seems hard

4y

3

1

9How a bug of AI hardware may become a feature for AI governance

6mo

0

1

4Trying to measure AI deception capabilities using temporary simulation fine-tuning

3y

0

1

4Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis

1y

0

1

3AI Safety Oversights

1y

0

1

2Alignment Structure Direction - Recursive Adversarial Oversight(RAO)

1y

0

Load More (15/17)

Add Posts