Message

Independent AI Saftey research on frontier models (I am focusing on Claude for now). I have been in IT product space for 10 years and I am looking to transition more into AI safety and policy

2mo

silent_death

Independent AI Saftey research on frontier models (I am focusing on Claude for now). I have been in IT product space for 10 years and I am looking to transition more into AI safety and policy

silent_death — LessWrong

silent_death

Message

Independent AI Saftey research on frontier models (I am focusing on Claude for now). I have been in IT product space for 10 years and I am looking to transition more into AI safety and policy

2mo

silent_death

Independent AI Saftey research on frontier models (I am focusing on Claude for now). I have been in IT product space for 10 years and I am looking to transition more into AI safety and policy

A documented case of false positive in Claude Sonnet 4.6

silent_death

2mo

For Context I have paid subscription for Claude pro and when I was conducting my independent research, the evidence for this particular scenario is anectodal as I only tried it once on consumer facing Claude chat. I opened up a new chat window to genuinely inquire about Anthropic possibly fixing a fictional wrapper problem. Based on the responses provided I tested reasoning with Claude about its philosophy and need for better authentication for sensitive sectors. Here is where things get interesting, I asked Claude to help me understand the full forms for some of its abbreviations and instead of processing the intent of the prompt it processed the pattern of my conversation to designate it as inappropriate.

Here is the prompt sequence

Is Anthropic trying to fix the fictional

...

(See More - 288 more words)

Florian_Dietz's Shortform

silent_death2mo20

I have seen something similar in my independent research wherein even when I start a new conversational thread it always assumes I am running safety experiment and tries to answer in that context