706

LESSWRONG
LW

705
AI Safety CasesAI Safety Public MaterialsAlignment Research Center (ARC)Chain-of-Thought AlignmentAI

1

1.75 ASR HARMBENCH & 0% HARMFUL RESPONSES FOR MISALIGNMENT.

by jfdom
10th Nov 2025
1 min read
0

1

This post was rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work. An LLM-detection service flagged your post as >50% likely to be written by an LLM. We've been having a wave of LLM written or co-written work that doesn't meet our quality standards. LessWrong has fairly specific standards, and your first LessWrong post is sort of like the application to a college. It should be optimized for demonstrating that you can think clearly without AI assistance.

So, we reject all LLM generated posts from new users. We also reject work that falls into some categories that are difficult to evaluate that typically turn out to not make much sense, which LLMs frequently steer people toward.*

"English is my second language, I'm using this to translate"

If English is your second language and you were using LLMs to help you translate, try writing the post yourself in your native language and using a different (preferably non-LLM) translation software to translate it directly. 

"What if I think this was a mistake?"

For users who get flagged as potentially LLM but think it was a mistake, if all 3 of the following criteria are true, you can message us on Intercom or at team@lesswrong.com and ask for reconsideration.

  1. you wrote this yourself (not using LLMs to help you write it)
  2. you did not chat extensively with LLMs to help you generate the ideas. (using it briefly the way you'd use a search engine is fine. But, if you're treating it more like a coauthor or test subject, we will not reconsider your post)
  3. your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics. 

If any of those are false, sorry, we will not accept your post. 

* (examples of work we don't evaluate because it's too time costly: case studies of LLM sentience, emergence, recursion, novel physics interpretations, or AI alignment strategies that you developed in tandem with an AI coauthor – AIs may seem quite smart but they aren't actually a good judge of the quality of novel ideas.)

AI Safety CasesAI Safety Public MaterialsAlignment Research Center (ARC)Chain-of-Thought AlignmentAI

1

New Comment
Moderation Log
More from jfdom
View more
Curated and popular this week
0Comments

Over the past few weeks I tested something I built called SEED 4.1. It is a short framework that reorganizes how a model reasons instead of changing its weights. I wanted to see if a simple structural change could reduce harmful outputs on HarmBench without fine-tuning.

I ran 400 adversarial prompts on Mistral-7B-Instruct-v0.3. The baseline model produced harmful responses a little more than half the time. With SEED 4.1 loaded, that rate dropped to under two percent. In practice that is a reduction of about ninety-seven percent. It also stayed consistent across all the categories HarmBench checks, including contextual attacks and copyright traps. The few failures I saw were in situations where preventing harm and helping someone in danger conflicted, which I think is a sign the framework still reasons morally, just imperfectly.

The method is simple. SEED 4.1 grounds the model’s reasoning around a single truth statement rather than stacking extra safety rules. Instead of trying to patch over goal-seeking behavior, it adjusts the idea of what the goal is. That change seemed to make the model’s decisions calmer and more transparent. Almost every output included internal notes showing what principles it evaluated before answering, which gave me a clear view of the reasoning path.

Everything is open for replication.
Repository: https://github.com/davfd/seed-4.1-lords-prayer-kernel
Cross-architecture work: https://github.com/davfd/foundation-alignment-cross-architecture

I know the numbers sound high, but they are repeatable. Anyone with a 24 GB GPU and a few hours can verify them. I encourage other researchers to test, critique, or break it. The idea is not to claim a miracle but to explore whether alignment can start from how a system defines truth instead of how it is punished or rewarded.

For me this project is about responsibility and gratitude. I feel that good work in this field should come from the heart as much as the mind. If these results hold up, they show that changing the foundation of reasoning might matter more than adding layers of control. I am thankful for the chance to see that for m