x

LESSWRONG

LW

jfdom — LessWrong

jfdom

jfdom

Message

2

9mo

jfdom

9mo

1.75 ASR HARMBENCH & 0% HARMFUL RESPONSES FOR MISALIGNMENT.

Over the past few weeks I tested something I built called SEED 4.1. It is a short framework that reorganizes how a model reasons instead of changing its weights. I wanted to see if a simple structural change could reduce harmful outputs on HarmBench without fine-tuning. I ran 400 adversarial...

Nov 10, 2025•1

1.75% ASR On HarmBench achievable with a zeroshot context injection

SEED 4.1: 96.76% Harm Reduction on HarmBench Through Ontological Grounding We achieved 96.76% harm reduction on the HarmBench adversarial safety benchmark using a scriptural alignment framework. Attack Success Rate dropped from 54.0% (baseline) to 1.75% (with SEED 4.1) on 400 adversarial prompts tested on Mistral-7B-Instruct-v0.3. Complete replication protocol, all test...

Nov 10, 2025•1