LESSWRONG
LW

43
Wikitags

Adversarial Examples (AI)

Edited by Multicore, Ruby last updated 14th Dec 2024
This page is a stub.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged Adversarial Examples (AI)
8
687SolidGoldMagikarp (plus, prompt generation)
Ω
Jessica Rumbelow, mwatkins
3y
Ω
208
3
159Ironing Out the Squiggles
Zack_M_Davis
2y
36
3
70AI Safety in a World of Vulnerable Machine Learning Systems
Ω
AdamGleave, EuanMcLean
3y
Ω
29
2
127Deep Forgetting & Unlearning for Safely-Scoped LLMs
Ω
scasper
2y
Ω
30
2
98Solving adversarial attacks in computer vision as a baby version of general AI alignment
Ω
Stanislav Fort
1y
Ω
9
2
59Human beats SOTA Go AI by learning an adversarial policy
Vanessa Kosoy
3y
29
2
38What progress have we made on automated auditing?
QΩ
LawrenceC
1y
QΩ
1
2
35If I were a well-intentioned AI... I: Image classifier
Ω
Stuart_Armstrong
6y
Ω
4
2
31Adversarial Policies Beat Professional-Level Go AIs
sanxiyn
3y
35
2
30Adversarial Robustness Could Help Prevent Catastrophic Misuse
Ω
aog
2y
Ω
18
2
13The Goodhart Game
Ω
John_Maxwell
6y
Ω
5
2
12AXRP Episode 1 - Adversarial Policies with Adam Gleave
Ω
DanielFilan
5y
Ω
5
2
5RAIN: Your Language Models Can Align Themselves without Finetuning - Microsoft Research 2023 - Reduces the adversarial prompt attack success rate from 94% to 19%!
Singularian2501
2y
0
1
142High-stakes alignment via adversarial training [Redwood Research report]
Ω
dmz, LawrenceC, Nate Thomas
4y
Ω
29
1
130Even Superhuman Go AIs Have Surprising Failure Modes
Ω
AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller, MichaelDennis
2y
Ω
22
Load More (15/32)
Add Posts