LESSWRONG
LW

1321
Egor Zverev
65010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Robustness of Contrast-Consistent Search to Adversarial Prompting
Egor Zverev2y20

Thanks for the post! I believe an interesting idea for future work here could be replacing manual engineering of suffixes with gradient-based / greedy search such as in https://arxiv.org/abs/2307.15043 

Reply
No wikitag contributions to display.
73A comparison of causal scrubbing, causal abstractions, and related methods
Ω
2y
Ω
3