I am seeing more and more evidence on places like twitter and elsewhere of 'jailbroken' open weight models like Qwen or Kimi. This has always been a possibility and jailbreaking via steering vectors is fairly well documented. The thing I am starting to find concerning is how easy it is to do using agentic coding. Claude Code has no issue helping you jailbreak a model under the weak guise of research. It will happily write prompts for CAA or other steering vector techniques, apply them to Qwen on GPU servers, and refine its process. I have tried it myself and it is quite easy to do. I know having a model like Qwen produce meth recipes isn't exactly the primary concern with AI safety but it may pose a pathway for seemingly aligned system to produce misaligned agents outside of the primary agent's alignment constraints. It is also a very easy way for humans to get around alignment constraints for a variety of applications. This is rapidly requiring little to no technical know how as agentic coding becomes more and more powerful.
How can you tell those are 1. jailbroken 2. open-weights models? I use LLMs for coding a lot but I cant even tell apart claude and chatgpt confidently.
People are posting themselves jailbreaking or even releasing tools to jailbreak models. I meant this is done explicitly and celebrated, not that I am implicitly noticing it.