Zack_M_Davis — LessWrong

On Fleshling Safety: A Debate by Klurl and Trapaucius.

"Bah!" cried Trapaucius. "By the same logic, we could say that planets could be obeying a million algorithms other than gravity, and therefore, ought to fly off into space!"

Klurl snorted air through his cooling fans. "Planets very precisely obey an exact algorithm! There are not, in fact, a million equally simple alternative algorithms which would yield a similar degree of observational conformity to the past, but make different predictions about the future! These epistemic situations are not the same!"

"I agree that the fleshlings' adherence to korrigibility is not exact and down to the fifth digit of precision," Trapaucius said. "But your lack of firsthand experience with fleshlings again betrays you; that degree of precision is simply not something you could expect of fleshlings."

I think Trapaucius missed a great opportunity here to keep riffing off the gravity analogy. Actually, there are different algorithms the planets could be obeying: special and then general relativity turned out to be better approximations than Newtonian gravity, and GR is presumably not the end of the story—and yet, as Trapaucius says, the planets do not "fly off into space." Newton is good enough not just for predicting the night sky (modulo the occasional weird perihelion precession), but even landing on the moon, for which relativistic deviations from Newtonian predictions were swamped by other sources of error.

Obviously, that's just a facile analogy: if Trapaucius had found that branch of the argument tree, Klurl could easily go into more details about further disanalogies between gravity and the fleshlings.

But I think that the analogy is getting at something important. When relatively smarter real-world fleshlings delude themselves into thinking that Claude Sonnet 4.5 is pretty corrigible because they see it obeying their instructions, they're not arguing, as Trapaucius does, that "Korrigibility is the easiest, simplest, and natural way to think" for an generic mind. They're arguing that Anthropic's post-training procedure successfully pointed to the behavior of natural language instruction-following, which they think is a natural abstraction represented in the pretraining data which generalizes in a way that's decision-relevantly good enough for their purposes, such that Claude won't "fly off into space" even if they can't precisely predict how Claude will react to every little quirk of phrasing. They furthermore have some hope that this alleged benign property is robust and useful enough to help humanity navigate the intelligence explosion, even though contemporary language models aren't superintelligences and future AI capabilities will no doubt work differently.

Maybe that's totally delusional, but why is it delusional? I don't think "On Fleshling Safety" (or past work in a similar vein) is doing a good job of making the case. A previous analogy about an alien actress came the closest, but trying to unpack the analogy into a more rigorous argument involves a lot of subtleties that fleshlings are likely to get confused about.

Comment on "Death and the Gorgon"

Zack_M_Davis7d90

(Asimov's has now put the story up for free)

White House OSTP AI Deregulation Public Comment Period Ends Oct. 27

Zack_M_Davis8d20

workshop in San Francisco tomorrow at 1 p.m.

faul_sname's Shortform

Zack_M_Davis10d20

is hard to keep secret

Is it actually hard to keep secret, or is it that people aren't trying (because the prestige of publishing an advance is worth more than hoarding the incremental performance improvement for yourself)?

faul_sname's Shortform

Zack_M_Davis10d20

The Sonnet 4.5 system card reiterates the "most thought processes are short enough to display in full" claim that you quote:

As with Claude Sonnet 4 and Claude Opus 4, thought processes from Claude Sonnet 4.5 are summarized by an additional, smaller model if they extend beyond a certain point (that is, after this point the “raw” thought process is no longer shown to the user). However, this happens in only a very small minority of cases: the vast majority of thought processes are shown in full.

But it is intriguing that the displayed Claude CoTs are so legible and "non-weird" compared to what we see from DeepSeek and ChatGPT. Is Anthropic using a significantly different (perhaps less RL-heavy) post-training setup?

21st Century Civilization curriculum

Zack_M_Davis11d30

Linkpost URL should presumably include "http://" (click currently goes to https://www.lesswrong.com/posts/2CGXGwWysiBnryA6M/www.21civ.com).

The IABIED statement is not literally true

Zack_M_Davis14d128

It will probably be possible, with techniques similar to current ones, to create AIs who are similarly smart and similarly good at working in large teams to my friends, and who are similarly reasonable and benevolent to my friends in the time scale of years under normal conditions.

[...]

This is maybe the most contentious point in my argument, and I agree this is not at all guaranteed to be true, but I have not seen MIRI arguing that it's overwhelmingly likely to be false.

Did you read the book? Chapter 4, "You Don't Get What You Train For", is all about this. I also see reasons to be skeptical, but have you really "not seen MIRI arguing that it's overwhelmingly likely to be false"?

The Relationship Between Social Punishment and Shared Maps

Zack_M_Davis15d4-1

Isn't it, though?

The Relationship Between Social Punishment and Shared Maps

Zack_M_Davis21d20

Indeed, I notice in your list above you suspiciously do not list the most common kind of attribute that is attributed to someone facing social punishment. "X is bad" or "X sucks" or "X is evil".

I'm inclined to still count this under "judgments supervene on facts and values." Why is X bad, sucky, evil? These things can't be ontologically basic. Perhaps less articulate members of a mass punishment coalition might not have an answer ("He just is; what do you mean 'why'? You're not an X supporter, are you?"), but somewhere along the chain of command, I expect their masters to offer some sort of justification with some sort of relationship to checkable facts in the real world: "stupid, dishonest, cruel, ugly, &c." being the examples I used in the post; we could keep adding to the list with "fascist, crazy, cowardly, disloyal, &c." but I think you get the idea.

The justification might not be true; as I said in the post, people have an incentive to lie. But the idea that "bad, sucks, evil" are just threats within a social capital system without any even pretextual meaning outside the system flies in the face of experience that people demand pretexts.

The Company Man

Zack_M_Davis26d126

Can't you just say that yourself (not all, caricature, parody, uncharitable, exaggerates, &c.) when sharing it? Death of the author, right?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments