LESSWRONG
LW

Wikitags

Adversarial Training

This page is a stub.
Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Adversarial Training
101Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Ω
Buck
9mo
Ω
4
42Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Ω
Buck
3y
Ω
0
157Ironing Out the Squiggles
Zack_M_Davis
1y
36
143Takeaways from our robust injury classifier project [Redwood Research]
Ω
dmz
3y
Ω
12
142High-stakes alignment via adversarial training [Redwood Research report]
Ω
dmz, LawrenceC, Nate Thomas
3y
Ω
29
126Deep Forgetting & Unlearning for Safely-Scoped LLMs
Ω
scasper
2y
Ω
30
89Solving adversarial attacks in computer vision as a baby version of general AI alignment
Ω
Stanislav Fort
11mo
Ω
8
30Adversarial Robustness Could Help Prevent Catastrophic Misuse
Ω
aog
2y
Ω
18
25Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
Ω
scasper
1y
Ω
0
17AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training
Charbel-Raphaël
2y
0
16AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
Ω
DanielFilan
3y
Ω
0
9Some thoughts on why adversarial training might be useful
Ω
Beth Barnes
4y
Ω
6
52Latent Adversarial Training
Ω
Adam Jermyn
3y
Ω
13
41Beyond the Board: Exploring AI Robustness Through Go
Ω
AdamGleave
1y
Ω
2
30EIS IX: Interpretability and Adversaries
Ω
scasper
2y
Ω
8
Load More (15/23)
Add Posts