x

LESSWRONG
LW

Adversarial Examples (AI) — LessWrong

Adversarial Examples (AI)

Edited by Multicore, Ruby last updated 14th Dec 2024

This page is a stub.

Posts tagged Adversarial Examples (AI)

8

673SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow, mwatkins

3y

208

3

162Ironing Out the Squiggles

2y

37

3

70AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave, EuanMcLean

3y

29

2

127Deep Forgetting & Unlearning for Safely-Scoped LLMs

2y

30

2

98Solving adversarial attacks in computer vision as a baby version of general AI alignment

1y

9

2

59Human beats SOTA Go AI by learning an adversarial policy

3y

29

2

38What progress have we made on automated auditing?

1y

1

2

35If I were a well-intentioned AI... I: Image classifier

Stuart_Armstrong

6y

4

2

31Adversarial Policies Beat Professional-Level Go AIs

3y

35

2

30Adversarial Robustness Could Help Prevent Catastrophic Misuse

2y

18

2

13The Goodhart Game

6y

5

2

12AXRP Episode 1 - Adversarial Policies with Adam Gleave

5y

5

2

5RAIN: Your Language Models Can Align Themselves without Finetuning - Microsoft Research 2023 - Reduces the adversarial prompt attack success rate from 94% to 19%!

Singularian2501

2y

0

1

142High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC, Nate Thomas

4y

29

1

130Even Superhuman Go AIs Have Surprising Failure Modes

AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller, MichaelDennis

2y

22

Load More (15/34)

Add Posts