This is a special post for quick takes by Samir. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Would granting a singular LLM unrestricted command access within a Minecraft server comprised of both humans and automated agents serve as a valid alignment benchmark?
Depends on what you mean by "valid", I mean it certainly can be called an alignment benchmark, but I will not be confident in how good the benchmark is (as in how much the score in this benchmark will correlate to probability of alignment). The Minecraft context will obviously make the LLMs know it is inside a game, and we have seen LLMs being willing to do more deception/harm inside a game.