Current safety training techniques do not fully transfer to the agent setting
TL;DR: We are presenting three recent papers which all share a similar finding, i.e. the safety training techniques for chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they are often willing to directly execute harmful actions. However, all papers find that different attack methods like jailbreaks, prompt-engineering, or refusal-vector ablation do transfer. Here are the three papers: 1. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents 2. Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents 3. Applying Refusal-Vector Ablation to Llama 3.1 70B Agents What are language model agents? Language model agents are a combination of a language model and scaffolding software. Regular language models are typically limited to being chatbots, i.e. they receive messages and reply to them. However, scaffolding gives these models access to tools which they can directly execute and essentially puts them in a loop to perform entire tasks autonomously. To correctly use tools, they are often fine-tuned and carefully prompted. As a result, these agents can perform a broader range of complex, goal-oriented tasks autonomously, surpassing the potential roles of traditional chat bots. Overview Results across the three papers are not directly comparable. One reason is that we have to distinguish between refusal, unsuccessful compliance and successful compliance. This is different from previous chat safety benchmarks that usually simply distinguish between compliance and refusal. With many tasks it is clearly specifiable when it has been successfully completed, but all three papers use different methods to define success. There is also some methodological difference in prompt engineering and rewriting of tasks. Despite these differences, Figure 1 shows a similar pattern between all of them, attack methods such as jail-breaks, prompt engineering and mechanistic ch
I liked the design when i saw it today, but also would like aggregate statistics like comments count/ post count/ recent activity. perhaps even something like github showing a calendar with activity for each commit. It would also be good to retain a bio with a self description and optionally urls to websites or social media accounts.