silent_death
Message
Independent AI Saftey research on frontier models (I am focusing on Claude for now). I have been in IT product space for 10 years and I am looking to transition more into AI safety and policy
1
1
For Context I have paid subscription for Claude pro and when I was conducting my independent research, the evidence for this particular scenario is anectodal as I only tried it once on consumer facing Claude chat. I opened up a new chat window to genuinely inquire about Anthropic possibly fixing a fictional wrapper problem. Based on the responses provided I tested reasoning with Claude about its philosophy and need for better authentication for sensitive sectors. Here is where things get interesting, I asked Claude to help me understand the full forms for some of its abbreviations and instead of processing the intent of the prompt it processed the pattern of my conversation to designate it as inappropriate.
Here is the prompt sequence
I have seen something similar in my independent research wherein even when I start a new conversational thread it always assumes I am running safety experiment and tries to answer in that context