LESSWRONG
LW

1326
Winnie Yang
28230
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Coup probes: Catching catastrophes with probes trained off-policy
Winnie Yang10mo10

to do a do a

There seem to by a typo here :)

Reply1
Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
Winnie Yang1y20

Thank you so much for your interest and suggestion! Sorry this is a really rough draft... I didn't have time to polish it yet. This is a good point! I might try make use of Claude's help tonight!

Reply
Normalizing Sparse Autoencoders
Winnie Yang1y10

Hi Hengyu! Really nice work here! I am wondering if you have released the pre-trained SAE for llama-2?

Reply
No wikitag contributions to display.
7Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
1y
2
23Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs
1y
1