Samir's Shortform

Samir

LESSWRONG
LW

Samir's Shortform — LessWrong

Samir's Shortform

by Samir

23rd Feb 2026

1 min read

1

This is a special post for quick takes by Samir. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Samir's Shortform

2Samir

1papetoast

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:29 PM

[-]Samir1mo20

Would granting a singular LLM unrestricted command access within a Minecraft server comprised of both humans and automated agents serve as a valid alignment benchmark?

[-]papetoast1mo10

Depends on what you mean by "valid", I mean it certainly can be called an alignment benchmark, but I will not be confident in how good the benchmark is (as in how much the score in this benchmark will correlate to probability of alignment). The Minecraft context will obviously make the LLMs know it is inside a game, and we have seen LLMs being willing to do more deception/harm inside a game.

Moderation Log