271

LESSWRONG
LW

270
Ethics & MoralityInner AlignmentPhilosophy of Language

1

Learned logic of modelling harm

by Callum
28th Jun 2025
1 min read
0

1

This post was rejected for the following reason(s):

  • Formatting. If the post is badly formatted it's hard to read or evaluate. Some common issues here are improper whitespace (either not inserting space between paragraphs, or inserting double paragraph spaces by accident. (Note: when you hit 'return' in our editor it should automatically include a space, and if you copied your essay from another editor you may need to delete extraneous paragraph breaks). Sometimes this may also include grammar or punctuation. (If you're the sort of person who strongly prefers not to capitalize sentences, this doesn't automatically disqualify you from posting but we'll likely suggest at least once you switch to somewhat more formal punctuation, and if your posts are otherwise confusing we may err on the side of not approving.)
  • Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar. 

    If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly. 

    We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. 

1

New Comment
Moderation Log
More from Callum
View more
Curated and popular this week
0Comments
Ethics & MoralityInner AlignmentPhilosophy of Language

Im looking to find what concepts and information are most likely to produce systems that can produce patterns in deception, threats, violence and suffering.

Im hoping that a model that had no information on similar topics will struggle a lot more to produce ways to do this itself.

In this data they would learn how to mentally model harmful practices of others more effectively. Even if the instruction tuning made it produce more unbiased or aligned facts.




A short list of what I would not train on would be:
Philosophy and morality, law, religion, history, suffering and death, politics, fiction and hacking.
Anything with a mean tone or would be considered "depressing information" (Sentiment).

This contains the worst aspects of humanity such as 
war information, the history of suffering, nihilism, chick culling(animal suffering) and genocide

 

One other element I'd be curious to know about is how much frontier labs categorize content.

I'm thinking most stories (even children's ones) contain deception, threats, violence and suffering.

Each subcategory of this data will produce different effects.


The biggest issue with this is "How is a model that cannot mentally model harm to know it is not hurting anyone".
Im hoping that it does not need to know in order to produce results on alignment research, that this approach only would have to be used to solve alignment problems. That without any understanding of ways to hurt people it can still understand ways to not hurt people.