LESSWRONG
LW

Wikitags

Scalable Oversight

This page is a stub.
Subscribe
Subscribe
Discussion0
Discussion0
Posts tagged Scalable Oversight
37Scaling Laws for Scalable Oversight
Subhash Kantamneni, Josh Engels, David Baek, Max Tegmark
4mo
1
113Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Ω
Sam Marks
1y
Ω
10
92Learning the prior
Ω
paulfchristiano
5y
Ω
28
86Scalable oversight as a quantitative rather than qualitative problem
Ω
Buck
1y
Ω
11
31Inference-Only Debate Experiments Using Math Problems
Ω
Arjun Panickssery, Abhimanyu Pallavi Sudhir, JacksonKaunismaa
1y
Ω
0
21AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
Ω
DanielFilan
1y
Ω
0
18Rational Animations' video about scalable oversight and sandwiching
Writer
2mo
0
195Optimizing The Final Output Can Obfuscate CoT (Research Note)
Ω
lukemarks, jacob_drori, cloud, TurnTrout
1mo
Ω
22
169Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Ω
cloud, Jacob G-W, Evzen, Joseph Miller, TurnTrout
9mo
Ω
12
88Prover-Estimator Debate: A New Scalable Oversight Protocol
Ω
Jonah Brown-Cohen, Geoffrey Irving
3mo
Ω
18
49On scalable oversight with weak LLMs judging strong LLMs
Ω
zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner, Rohin Shah
1y
Ω
18
27NYU Code Debates Update/Postmortem
David Rein
1y
4
27Human-AI Complementarity: A Goal for Amplified Oversight
Ω
rishubjain, Sophie Bridgers
8mo
Ω
4
22Is weak-to-strong generalization an alignment technique?
QΩ
cloud
7mo
QΩ
1
7Evaluating Oversight Robustness with Incentivized Reward Hacking
Yoav, Juan V, julianjm, McKennaFitzgerald
4mo
2
Load More (15/19)
Add Posts