LESSWRONG
LW

506
Ana Kapros
14100
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No Comments Found
No wikitag contributions to display.
10Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior
7mo
0
7Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts
9mo
0