This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Adversarial Training
•
Applied to
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
by
Buck
1d
ago
•
Applied to
Solving adversarial attacks in computer vision as a baby version of general AI alignment
by
Stanislav Fort
1mo
ago
•
Applied to
Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
by
RobertM
2mo
ago
•
Applied to
Does robustness improve with scale?
by
ChengCheng
3mo
ago
•
Applied to
Beyond the Board: Exploring AI Robustness Through Go
by
AdamGleave
4mo
ago
•
Applied to
Ironing Out the Squiggles
by
Zack_M_Davis
5mo
ago
•
Applied to
Some thoughts on why adversarial training might be useful
by
Zach Stein-Perlman
9mo
ago
•
Applied to
Adversarial Robustness Could Help Prevent Catastrophic Misuse
by
aogara
10mo
ago
•
Applied to
Deep Forgetting & Unlearning for Safely-Scoped LLMs
by
scasper
10mo
ago
•
Applied to
AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training
by
jacobjacob
1y
ago
•
Applied to
AI Safety 101 - Chapter 5.1 - Debate
by
Charbel-Raphaël
1y
ago
•
Applied to
Against Almost Every Theory of Impact of Interpretability
by
Charbel-Raphaël
1y
ago
•
Applied to
Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI
by
Benaya Koren
1y
ago
•
Applied to
EIS IX: Interpretability and Adversaries
by
scasper
2y
ago
•
Applied to
EIS XII: Summary
by
scasper
2y
ago
•
Applied to
EIS XI: Moving Forward
by
scasper
2y
ago
•
Applied to
Takeaways from our robust injury classifier project [Redwood Research]
by
Ruby
2y
ago
•
Applied to
Oversight Leagues: The Training Game as a Feature
by
janus
2y
ago
•
Applied to
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
by
DanielFilan
2y
ago