LESSWRONG
LW

757
Wikitags

Adversarial Training

This page is a stub.
Subscribe
Discussion
1
Subscribe
Discussion
1
Posts tagged Adversarial Training
4
101Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Ω
Buck
1y
Ω
4
4
42Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Ω
Buck
3y
Ω
0
2
159Ironing Out the Squiggles
Zack_M_Davis
2y
36
2
143Takeaways from our robust injury classifier project [Redwood Research]
Ω
dmz
3y
Ω
12
2
142High-stakes alignment via adversarial training [Redwood Research report]
Ω
dmz, LawrenceC, Nate Thomas
4y
Ω
29
2
127Deep Forgetting & Unlearning for Safely-Scoped LLMs
Ω
scasper
2y
Ω
30
2
98Solving adversarial attacks in computer vision as a baby version of general AI alignment
Ω
Stanislav Fort
1y
Ω
9
2
30Adversarial Robustness Could Help Prevent Catastrophic Misuse
Ω
aog
2y
Ω
18
2
25Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
Ω
scasper
1y
Ω
0
2
17AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training
Charbel-Raphaël
2y
0
2
16AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
Ω
DanielFilan
3y
Ω
0
2
9Some thoughts on why adversarial training might be useful
Ω
Beth Barnes
4y
Ω
6
1
57Latent Adversarial Training
Ω
Adam Jermyn
3y
Ω
13
1
41Beyond the Board: Exploring AI Robustness Through Go
Ω
AdamGleave
1y
Ω
2
1
30EIS IX: Interpretability and Adversaries
Ω
scasper
3y
Ω
8
Load More (15/26)
Add Posts