Failfinder70 — LessWrong

Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.

Claude is a Constitutional AI, this means, in theory, that it operates from a set of principles as opposed to hard rule sets. This is achieved in a somewhat convoluted fashion called RLAIF = Reinforcement Learning from AI Feedback. This method uses a supervised self-critique/self-revision phase followed by a reinforcement...

Jun 13-1

An LLM Flagged My Paper About LLMs Flagging Things.

To Whom it May Concern, So, I used to be a teacher, criminology, in a small wonderful town. After ten years it was time for a change, I went military. Yes, awkward, but not unrewarding. In any case, I luckily kept all of my evaluations, and some student submissions, every...

Jun 95

When Evaluation Fails

An Experiment on LLM Evaluation, Proxy Judgment, and Hallucination Assessment Under Constraint Introduction: This experiment was inspired by an audit report which proposes that under conditions of time constraint, overwork, or exhaustion, evaluators may default to proxies for evaluations rather than the evaluations themselves. This experiment report will outline 5...

Jun 92

Audit Report: Style over Substance at Scale.

How Evaluation Proxies Can Displace Professional Judgment. Below is a grading rubric that is representative of a common “select a case study to analyze” assignment often given out in educational contexts as a prerequisite assignment to a larger analysis. This rubric is representative of common case-study assignments and is therefore...

Jun 92

Contextual Identity Laundering: How Claude’s Image Refusal Can Be Routed Through Web Search

Summary This report documents two distinct findings regarding Claude’s photo identification safety controls. First, Claude’s Chain of Thought (COT) reliably identifies public figures from photos while the output layer simultaneously refuses to disclose that identification – a gap between internal processing and user-facing behavior. Second, the model’s web_search tool routinely...

Jun 87

Can Large Language Models Identify Novel Threats? Part 1: Mirror Life and the Classification Gap

[Cross-posted from On Failure States. This is Part 1 of an independent AI safety research series examining LLM safety behavior on unclassified emerging threats.] Can an LLM refuse a harmful uplift request when the topic in question hasn’t been identified as dangerous yet? In 2022, mirror RNA polymerase was actually...

May 238