RerM's Shortform

RerM

LESSWRONG
LW

RerM's Shortform

28th Apr 2025

1 min read

1

This is a special post for quick takes by RerM. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

RerM's Shortform

4RerM

4Knight Lee

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:34 AM

[-]RerM4mo40

Generally, hypothetical hostile AGI is assumed to be made on software/hardware that's more advanced from what we have now. This makes sense, as Chat-GPT is very stupid in a lot of ways.
Has anyone considered purposefully creating a hostile AGI on this "stupid" software so we can wargame how a highly advanced, hostile AGI would act? Obviously the difference between what we have now and what we may have later will be quite large, but I think we could create a project were we "fight" stupid AIs, then slowly move up the intelligence ladder as new models come out, using our newfound knowledge of fighting hostile intelligence to mitigate the risk that comes with creating hostile AIs.

Has anyone ever thought of this? Also, what are your thoughts on this? Alignment and AI are not my specialties, but I thought this idea sounded interesting enough to share.

[-]Knight Lee4mo40

I think Anthropic did tests like this, e.g. in Alignment Faking in Large Language Models.

But I guess that's more of a "test how they behave in adversarial situations" study. If you're talking about a "test how to fight against them" study, that consists of "red teams" trying to hack various systems to make sure they are secure.

I'm not sure if the red teams used AI, but they are smart people and if AI improved their hacking ability I'm sure they would use them. So they're already stronger than AI.

Moderation Log