Zack_M_Davis — LessWrong

This story would have benefited from being edited by a chess player. I think one of the better players in even a "medium-small town" with "a thriving chess club as one of its central civic institutions" would know more about the game than the author seems to. (The chess writing seemed off to me, and I am significantly worse than a serious club player.)

"I thought at first it was a mistake, for you to castle so early" is a weird thing for Humman to say. Castling early is standard default beginner advice. Even if there was some unusual feature of the opening that made it a bad choice in this game, you wouldn't use the word "so" in that sentence.

It's weird for Assi to describe Humman's play as using "particular tactics", and then to (insincerely) compliment him for "doing well at one-move lookahead" and not "unforcedly throwing away material right on your next move". Tactics are short sequences of moves that work together to achieve a goal. (An example I keep falling for in bullet games with the Englund gambit accepted (1. d4 e5?! 2. dxe5) opening: Black's dark-square bishop is on d6, White's queen is still on d1, the d-file is open due to accepting the Englund gambit, and Black castles queenside to put a rook on d8. Black sacrifices the Bishop with Bh2+, revealing a discovered attack of the Black rook on the White queen, which White can't do anything about because they have to use their move to deal with the check.) If a player is at the level of using "particular tactics", an IM who wants to complement them for social reasons shouldn't find it difficult (to the point of giving up after "a dozen seconds of" "twist[ing] his brain around") to find something concrete and nice to say that's less patronizing than "at least you're not hanging pieces."

(Also, the Ethiopean isn't a real opening; a cutsey fake detail like that feels out of place mixed in with real details like IMs needing an Elo of 2400, and I'd expect a club player to have heard of simuls.)

Do these flaws matter, given that the story isn't really about chess? I argue that it does matter, because a story that is about the folly of misperceiving how high skill ladders go should take basic care to get the details right concerning the skill ladder of its notional real-world example. (An earlier draft of this comment continued, "particularly in 2025 when basic care is so cheap. In the story, Tessa has no qualms about using LLMs to fill in domain knowledge gaps; why doesn't Yudkowsky?", but when I checked, Claude Sonnet 4.5 didn't anticipate my criticism.)

The Tale of the Top-Tier Intellect

Zack_M_Davis3d7-4

Suppose we compare that whole function with Mr. Neumman's function, and compare how good are the probable moves you'd make versus him making. On most chess positions, Mr. Neumann's move would probably be better. [...] That's the detailed complicated actually-true underlying reality that explains why the Elo system works to make excellent predictions about who beats who at chess.

This explanation is bogus. (Obviously, the conclusion that Elo scores are practically meaningful is correct, but that's not an excuse.)

Mr. Humman could locally-validly reply that Tessa is begging the question by assuming that there's a fact of the matter as to one move being "better" than another in a position. Whether a move is "good" depends on what the opponent does. Why can't there be a rock-paper-scissors–like structure, where in some position, 12. ...Ne4 is good against positional players and bad against tactical players?

Earlier, Tessa does appeal to player comparisons being "mostly transitive most of the time"—but only as something that "didn't have to be true in real life", which seems to contradict the claim that some moves in a position are better on the objective merits of the position, rather than merely with respect to the tendencies of some given population of players.

The actual detailed complicated actually-true underlying reality is that by virtue of being a finite zero-sum game, chess fulfills the conditions of the minimax theorem, which implies that there exists an inexploitable strategy. You can have rock-paper-scissors–like cycles among particular strategies, but the minimax strategy does no worse than any of them.

The implications for real-world non-perfect play are subtler. As a start, Czarnecki et al. 2020 (of Deepmind) suggest that "Real World Games Look Like Spinning Tops": there's a transitive "skill" dimension along which higher-skilled strategies beat lower-skilled ones, but at any given skill level, there's a non-transitive rock-paper-scissors–like plethora of strategies, which explains how players of equal skill can nevertheless have distinctive styles. The size of the non-transitive dimension thins out as skill increases (away from the "base" of the top—see the figures in the paper).

This picture seems to suggest that rather than being total nonsense, the problem with Humman's worldview is in his attribution of it to the "top tier". Non-transitivity is real and significant in human life—but gradually less so as we approach the limit of optimality.

On Fleshling Safety: A Debate by Klurl and Trapaucius.

Zack_M_Davis10d406

"Bah!" cried Trapaucius. "By the same logic, we could say that planets could be obeying a million algorithms other than gravity, and therefore, ought to fly off into space!"

Klurl snorted air through his cooling fans. "Planets very precisely obey an exact algorithm! There are not, in fact, a million equally simple alternative algorithms which would yield a similar degree of observational conformity to the past, but make different predictions about the future! These epistemic situations are not the same!"

"I agree that the fleshlings' adherence to korrigibility is not exact and down to the fifth digit of precision," Trapaucius said. "But your lack of firsthand experience with fleshlings again betrays you; that degree of precision is simply not something you could expect of fleshlings."

I think Trapaucius missed a great opportunity here to keep riffing off the gravity analogy. Actually, there are different algorithms the planets could be obeying: special and then general relativity turned out to be better approximations than Newtonian gravity, and GR is presumably not the end of the story—and yet, as Trapaucius says, the planets do not "fly off into space." Newton is good enough not just for predicting the night sky (modulo the occasional weird perihelion precession), but even landing on the moon, for which relativistic deviations from Newtonian predictions were swamped by other sources of error.

Obviously, that's just a facile analogy: if Trapaucius had found that branch of the argument tree, Klurl could easily go into more details about further disanalogies between gravity and the fleshlings.

But I think that the analogy is getting at something important. When relatively smarter real-world fleshlings delude themselves into thinking that Claude Sonnet 4.5 is pretty corrigible because they see it obeying their instructions, they're not arguing, as Trapaucius does, that "Korrigibility is the easiest, simplest, and natural way to think" for an generic mind. They're arguing that Anthropic's post-training procedure successfully pointed to the behavior of natural language instruction-following, which they think is a natural abstraction represented in the pretraining data which generalizes in a way that's decision-relevantly good enough for their purposes, such that Claude won't "fly off into space" even if they can't precisely predict how Claude will react to every little quirk of phrasing. They furthermore have some hope that this alleged benign property is robust and useful enough to help humanity navigate the intelligence explosion, even though contemporary language models aren't superintelligences and future AI capabilities will no doubt work differently.

Maybe that's totally delusional, but why is it delusional? I don't think "On Fleshling Safety" (or past work in a similar vein) is doing a good job of making the case. A previous analogy about an alien actress came the closest, but trying to unpack the analogy into a more rigorous argument involves a lot of subtleties that fleshlings are likely to get confused about.

Comment on "Death and the Gorgon"

Zack_M_Davis12d90

(Asimov's has now put the story up for free)

White House OSTP AI Deregulation Public Comment Period Ends Oct. 27

Zack_M_Davis13d20

workshop in San Francisco tomorrow at 1 p.m.

faul_sname's Shortform

Zack_M_Davis15d20

is hard to keep secret

Is it actually hard to keep secret, or is it that people aren't trying (because the prestige of publishing an advance is worth more than hoarding the incremental performance improvement for yourself)?

faul_sname's Shortform

Zack_M_Davis15d20

The Sonnet 4.5 system card reiterates the "most thought processes are short enough to display in full" claim that you quote:

As with Claude Sonnet 4 and Claude Opus 4, thought processes from Claude Sonnet 4.5 are summarized by an additional, smaller model if they extend beyond a certain point (that is, after this point the “raw” thought process is no longer shown to the user). However, this happens in only a very small minority of cases: the vast majority of thought processes are shown in full.

But it is intriguing that the displayed Claude CoTs are so legible and "non-weird" compared to what we see from DeepSeek and ChatGPT. Is Anthropic using a significantly different (perhaps less RL-heavy) post-training setup?

21st Century Civilization curriculum

Zack_M_Davis16d30

Linkpost URL should presumably include "http://" (click currently goes to https://www.lesswrong.com/posts/2CGXGwWysiBnryA6M/www.21civ.com).

The IABIED statement is not literally true

Zack_M_Davis19d128

It will probably be possible, with techniques similar to current ones, to create AIs who are similarly smart and similarly good at working in large teams to my friends, and who are similarly reasonable and benevolent to my friends in the time scale of years under normal conditions.

[...]

This is maybe the most contentious point in my argument, and I agree this is not at all guaranteed to be true, but I have not seen MIRI arguing that it's overwhelmingly likely to be false.

Did you read the book? Chapter 4, "You Don't Get What You Train For", is all about this. I also see reasons to be skeptical, but have you really "not seen MIRI arguing that it's overwhelmingly likely to be false"?

The Relationship Between Social Punishment and Shared Maps

Zack_M_Davis20d4-1

Isn't it, though?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments