All of cwillu's Comments + Replies

There is a single sharp, sweet, one-short-paragraph idea waiting to escape from the layers of florid prose it's tangled in.

Then it would be judged for what it is, rather than for the (tacky) clothing its wearing.

A well-made catspaw, with a fine wide chisel on one end, and a finely tapered nail puller on the other (most cheap catspaws' pullers are way too blunt) is very useful for light demo work like this, as they're a single tool you can just keep in your hand.  It's basically a demolition prybar with a claw and hammer on the opposite end.

60K2108 - Restorer's Cat's Paw, 12"

Pictured above is the kind I usually use.

Answer by cwilluNov 20, 202321

This isn't the link I was thinking of (I was remembering something in the alignment discussion in the early days of lw, but I can't find it), but this is probably a more direct answer to your request anyway:

[…] or reward itself highly without actually completing the objective […]

This is standard fare in the existing alignment discussion.  See for instance or anything referring to wireheading.

Thanks. My thought is that any sufficiently intelligent AI would be capable of defeating any effort to prevent it from wireheading, and would resort to wireheading by default. It would know that humans don't want it to wirehead so perhaps it might perceive humanity as a threat, however, it might realize humans aren't capable of preventing it from wireheading and let humans live. In either event, it would just sit there doing nothing 24/7 and be totally satisfied in doing so. In other words, orthogonality wouldn't apply to an intelligence capable of wireheading because wireheading would be its only goal. Is there a reason why an artifical super-intelligence would abstain from wireheading?

[…] The notion of an argument that convinces any mind seems to involve a little blue woman who was never built into the system, who climbs out of literally nowhere, and strangles the little grey man, because that transistor has just got to output +3 volts:  It's such a compelling argument, you see.

But compulsion is not a property of arguments, it is a property of minds that process arguments.


And that is why (I went on to say) the result of trying to remove all assumptions from a mind, and unwind to the perfect absence of any prior, is not an ideal

... (read more)
You have to realize that we don't need full consensus to make great strides in alignment. Your comments are a bit abstract and obtuse. Perhaps you could more clearly and directly address whatever problems you see in creating a narrow AI with expertise in understanding morality.

The truth is probably somewhere in the middle.

Answer by cwilluJul 22, 202361

Not a complete answer, but something that helps me, that hasn't been mentioned often, is letting yourself do the task incompletely.

I don't have to fold all the laundry, I can just fold one or three things.  I don't have to wash all the dishes, I can just wash one more than I actually need to eat right now.  I don't have to pick up all the trash laying around, just gather a couple things into an empty bag of chips.

It doesn't mean anything, I'm not committing to anything, I'm just doing one meaningless thing.  And I find that helps.

Climbing the ladder of human meaning, ability and accomplishment for some, miniature american flags for others!

“Non-trivial” is a pretty soft word to include in this sort of prediction, in my opinion.

I think I'd disagree if you had said “purely AI-written paper resolves an open millennium prize problem”, but as written I'm saying to myself “hrm, I don't know how to engage with this in a way that will actually pin down the prediction”.

I think it's well enough established that long form internally coherent content is within the capabilities of a sufficiently large language model.  I think the bottleneck on it being scary (or rather, it being not long before The End) is the LLM being responsible for the inputs to the research.

Fair point, "non-trivial" is too subjective, the intuition that I meant to convey was that if we get to the point where LLMs can do the sort of pure-thinking research in math and physics at a level where the papers build on top of one another in a coherent way, then I'd expect us to be close to the end.  Said another way, if theoretical physicists and mathematicians get automated, then we ought to be fairly close to the end. If in addition to that the physical research itself gets automated, such that LLMs write their own code to do experiments (or run the robotic arms that manipulate real stuff) and publish the results, then we're *really* close to the end. 

Bing told a friend of mine that I could read their conversations with Bing because I provided them the link.

Is there any reason to think that this isn't a plausible hallucination?

Do you mean what Bing told me, or what Bing told your friend?  I think the probability that what it told me was true, or partially true, has increased dramatically, now that there's independent evidence that it consists of multiple copies of the same core model, differently tuned. I was also given a list of verbal descriptions of the various personalities, and we know that verbal descriptions are enough to specify an LLM's persona.  Whether it's true or not, it makes me curious about the best way to give an LLM self-knowledge of this kind. In a long system prompt? In auxiliary documentation that it can consult when appropriate? 

Regarding musicians getting paid ridiculous amounts of money for playing gigs, I'm reminded of the “Making chalk mark on generator $1. Knowing where to make mark $9,999.” story.

The work happens off-stage, for years or decades, typically hours per day starting in childhood, all of which is uncompensated; and a significant level of practice must continue your entire life to maintain your ability to perform.

My understanding is that M&B is intended to be broader than that, as per:

“So it is, perhaps, noting the common deployment of such rhetorical trickeries that has led many people using the concept to speak of it in terms of a Motte and Bailey fallacy. Nevertheless, I think it is clearly worth distinguishing the Motte and Bailey Doctrine from a particular fallacious exploitation of it. For example, in some discussions using this concept for analysis a defence has been offered that since different people advance the Motte and the Bailey it is unfair to acc... (read more)

So there's something to that, but I'm a little wary about taking that interpretation too far. Taken far enough, it implies that if group A has a sensible take on a concept, then as soon as a group B shows up that has a bad take on it, you can use it to discredit A as a motte for B. It seems bad if we can discredit any concept - including valuable ones - just by making up a bad take on it and spreading it. I talked about that in this post: In the comments of that post, the most upvoted comment was one suggesting that you can distinguish a motte and bailey by looking at whether one of the groups actively disclaims the other: So if we think that a party not explicitly disclaiming a bad interpretation makes it more of a motte and bailey, then a situation where a party does explicitly disclaim it should make it less. Also, with regard to the bit you quoted... I'm not sure if you could characterize it as "true believers advancing the Bailey under the cover provided by others who defend the Motte" if the ones who defend the Bailey and the ones who defend the Motte have positions that are the exact opposites of each other? The original example of motte and bailey, as explained by Scott Alexander, was: Here what's going is that the motte and bailey are basically weak and strong versions of the same claim, so people believing in the strong version can take support from arguments in defense of the weak claim. But if some people say that "yes I think people who don't use NVC are being violent" and others say that "no that just happens to be an unfortunate established term, what kind of language you use has no bearing on whether you're violent or not"... then that doesn't seem like the strong and weak versions of the same claim? That'd be like saying that creationists and evolutionary biologists together form a motte and bailey, because both use the term "evolution" but assign different meanings to it (some saying that it's true, some saying that it's false).

I'm deeply suspicious of any use of the term “violence” in interpersonal contexts that do not involve actual risk-of-blood violence, having witnessed how the game of telephone interacts with such use, and having been close enough to be singed a couple times.

It's a motte and bailey: the people who use the word as part of a technical term clearly and explicitly disavow the implication, but other people clearly and explicitly call out the implication as if it were fact. Accusations of gaslighting sometimes follow.

It's as if “don't-kill-everyoneism” somehow g... (read more)

If some people consistently and explicitly disavow the implication, but other people consistently and explicitly endorse the implication, then I don't think that that's motte and bailey? As I understand it, M&B involves the same person being inconsistent about the meaning, not different people sticking to consistent but conflicting interpretations; that's just people disagreeing with each other.
I've seen/heard the term NVC for this in multiple places, and "non-violent communication" as the expansion whenever I've asked.  I agree with others that the name is not great.  The hyperbole of violence and the implication that not using it is akin to aggression is a pretty aggressive move itself.  The term "NVC" is at odds with the tenets of NVC. But it's established and out there, and I generally have low hopes of changing a common usage.  I hope the poor name doesn't devalue the actual concept and attempt to separate observation from interpretation.  Especially hyperbolic interpretation.

Meta, possibly a site bug:

The footnote links don't seem to be working for me, in either direction: footnote 1 links to #footnote-1, but there's no element with that id; likewise the backlink on the footnote links to #footnote-anchor-1, which also lacks a block with a matching id.

Some paragraph breaks would go a long ways towards the kingdom of playful rants from the desolate lands of manic ravings.

Agreed. Unfortunately, I do not have the intellectual acumen to translate this into a playful rant, so I am afraid I must leave this as is.

Any chance you have the generated svg's still, not just the resulting bitmap render?

Here was the final one: <svg viewBox="0 0 800 600" xmlns=""> <!-- Field --> <rect x="0" y="0" width="100%" height="100%" fill="#8BC34A"/> <!-- Sky and sun --> <rect x="0" y="0" width="100%" height="40%" fill="#90CAF9"/> <circle cx="700" cy="100" r="50" fill="#FFEB3B"/> <!-- Mountains --> <polygon points="100,300 300,100 500,300" fill="#BDBDBD"/> <polygon points="350,400 550,200 750,400" fill="#9E9E9E"/> <!-- Castle --> <rect x="200" y="150" width="200" height="200" fill="#F5F5F5"/> <rect x="250" y="200" width="100" height="100" fill="#BDBDBD"/> <rect x="220" y="190" width="60" height="60" fill="#8BC34A"/> <rect x="320" y="190" width="60" height="60" fill="#8BC34A"/> <rect x="290" y="230" width="20" height="60" fill="#BDBDBD"/> <polygon points="200,150 400,150 300,100" fill="#F5F5F5"/> <!-- Dragon --> <path d="M 550 400 Q 600 300 700 300 Q 800 300 750 400 Z" fill="#F44336"/> <ellipse cx="600" cy="350" rx="100" ry="50" fill="#E53935"/> <ellipse cx="660" cy="330" rx="30" ry="20" fill="#F5F5F5"/> <ellipse cx="540" cy="330" rx="30" ry="20" fill="#F5F5F5"/> <circle cx="620" cy="340" r="5" fill="#000"/> <circle cx="580" cy="340" r="5" fill="#000"/> <path d="M 600 380 Q 640 400 660 350" stroke="#000" stroke-width="2" fill="none"/> <path d="M 600 380 Q 560 400 540 350" stroke="#000" stroke-width="2" fill="none"/> <path d="M 520 330 Q 580 330 600 300" stroke="#000" stroke-width="2" fill="none"/> <path d="M 700 350 Q 680 320 680 340" stroke="#000" stroke-width="2" fill="none"/> <path d="M 700 350 Q 720 320 720 340" stroke="#000" stroke-width="2" fill="none"/> <!-- Knight --> <path d="M 250 450 L 300 350 L 350 450 L 325 500 L 275 500 Z" fill="#BDBDBD"/> <path d="M 325 500 L 325 550" stroke="#000" stroke-width="10" fill="none"/> <path d="M 275 500 L 275 550" stroke="#000" stroke-width="10" fill="none"/> <circle cx="312.5" cy="362.5" r="37.5" fill="#8BC34A"/> <rect x="290" y="375" width=

That's actually a rather good depiction of a dog's head, in my opinion.

OK. But can you prove that "outcome with infinite utility" is nonsense? If not - probability is greater than 0 and less than 1.


That's not how any of this works, and I've spent all the time responding that I'm willing to waste today.

You're literally making handwaving arguments, and replying to criticisms that the arguments don't support the conclusions by saying “But maybe an argument could be made! You haven't proven me wrong!” I'm not trying to prove you wrong, I'm saying there's nothing here that can be proven wrong.

I'm not interested in wrestling ... (read more)

-7Donatas Lučiūnas9mo

You're playing very fast and loose with infinities, and making arguments that have the appearance of being mathematically formal.

You can't just say “outcome with infinite utility” and then do math on it.  P(‹undefined term›) is undefined, and that “undefined” does not inherit the definition of probability that says “greater than 0 and less than 1”.  It may be false, it may be true, it may be unknowable, but it may also simply be nonsense!

And even if it wasn't, that does not remotely imply than an agent must-by-logical-necessity take any action or... (read more)

-1Donatas Lučiūnas9mo
OK. But can you prove that "outcome with infinite utility" is nonsense? If not - probability is greater than 0 and less than 1. Do I understand correctly that you do not agree with "all actions which lead to that outcome will have to dominate the agent's behavior" from Pascal's Mugging? Could you provide arguments for that? I mean "uncontrollable" in a sense that alignment is impossible. Whatever goal you will provide, AGI will converge to Power Seeking, because of "an outcome with infinite utility may exist". I do not understand how this solves the problem. Do you think you can prove that "an outcome with infinite utility does not exist"? Please elaborate

"I -" said Hermione. "I don't agree with one single thing you just said, anywhere."

1Donatas Lučiūnas9mo
Could you provide arguments for your position?

“However, through our current post-training process, the calibration is reduced.” jumped out at me too.

My guess is that RLHF is unwittingly training the model to lie.

Please don't break title norms to optimize for attention.

Retracted given that it turns out this wasn't a deliberate migration.

Retracted given that it turns out this wasn't a deliberate migration.

If it ends up being useful, the chapter switcher php can be replaced with a slightly hacky javascript page that performs the same function, as such the entire site can easily be completely static.

First line was “ is an authorized, ad-free mirror of Eliezer Yudkowsky‘s epic Harry Potter fanfic, Harry Potter and the Methods of Rationality (originally under the pen name Less Wrong).”, and in the footer: “This mirror is a project of Communications from Elsewhere.”  The “Privacy and Terms” page made extensive reference to MIRI though: “Machine Intelligence Research Institute, Inc. (“MIRI”) has adopted this Privacy Policy (“Privacy Policy”) to provide you, the user of (the “Website”)”

Disagree with the “extremely emphatically” emphasis.  Yes, it's not as good, but it more satisfyingly scratched the “what happened in the end” itch, much more than the half-dozen other continuations I've read.

Dragging up some old commentary.

Originally written in response to a request for critique:

SD's biggest problem as an HPMOR sequel (in my opinion) was that it simply wasn't in the same genre. Like, it didn't have complex tangles that the reader was meant to be able to unravel, or rigorously defined rules that the reader was meant to game, along with the characters. It didn't "use" rationality such that the clearest thinkers would come out on top specifically because of their clear thinking, and it didn't provide object lessons that were any more specific tha

... (read more)

I've pulled down a static mirror from and modified a couple pieces which depended on server-side implementation to use a javascript version, most notably the chapter dropdown. In the unlikely case it's useful, ping me.

Which is another gripe: prominently linked the epub/moby/pdf versions, while the trashed version makes no reference to their existence anymore.

[This comment is no longer endorsed by its author]Reply
Retracted given that it turns out this wasn't a deliberate migration.

Ugh. Why does everyone need to replace nice static web pages with junk that wants to perform a couple http requests every time the window changes:


[This comment is no longer endorsed by its author]Reply
Retracted given that it turns out this wasn't a deliberate migration.
Yes! It's just that the feel of the two websites are so different. And part of it may be my imagination. But it feels like the old HPMOR site is a simple elegant wrapper around the book, while on here it is the book is dumped into a website that wasn't made for it. Like the difference between a person wearing clothes, and someone inside of a giant human shaped suit that mimicked their motions.
The normal explanation is "measurement and monetization".  It's very hard to exploit a reader's attention if you simply deliver the text.  I don't want to believe that's the reason for this - more likely simply it's more expedient (cheaper) to have it on LW than it's own site.

Are you arguing that you couldn't implement appropriate feedback mechanisms via the stimulation of truncated nerve endings in an amputated limb?

1Fergus Fettes10mo
Not exactly. I suppose you could do so. Do you really think it is not acceptable to assume that LLMs don't implement any analogues for that kind of thing though? Maybe the broader point is that there are many things an embodied organism is doing, and using language is only occasionally one of them. It seems safe to assume that an LLM that is specialized on one thing would not spontaneously implement analogues of all the other things that embodied organisms are doing. Or do you think that is wrong? Do you, eg., think that an LLM would have to develop simulators for things like the gut in order to do its job better, is that what you are implying? Or am I totally misunderstanding you?

Actually, no, not “sorta”, it very much reminds me of gruntle.

What writing on the internet could have been.

Sorta reminds me of the old jwz gruntle (predating modern blogging).

I'd link directly, but he does things with referers sometimes, and don't want to risk it.

Actually, no, not “sorta”, it very much reminds me of gruntle. What writing on the internet could have been.

So much of what we call suffering is physiological, and even when the cause is purely intellectual (eg. in love)-- the body is the vehicle through which it manifests, in the gut, in migraine, tension, Parkinsonism etc. Without access to this, it is hard to imagine LLMs as suffering in any similar way to humans or animals.

I feel like this is a “submarines can't swim” confusion:  chemicals, hormones, muscle, none of these things are the qualia of emotions.

I think a good argument can be made that e.g. chatgpt doesn't implement an analog of the processing... (read more)

1Fergus Fettes10mo
Surely it is almost the definition of 'embodied cognition' that the qualia you mention are fundamentally dependent on loopback with eg. muscles, the gut. Not only sustained in that way but most everything about their texture, their intricate characteristics. To my mind it isn't that chatgpt doesn't implement an analog of the processing, but it doesn't implement an analog of the cause. And it seems like a lot of suffering is sustained in this embodied cognition/biofeedback manner.

Ah, does lesswrong have an automatic bullet numbering in the preview? because a lot of the first entries are off by one.

7,7, ‹I cut it off at this point›

The first entries were all incremented (as well as 6,7,8\n7) being trimmed out.

I also mangled (and now fixed) “The rule was "Any three numbers between 1 and 5, in any order". Make sense?” from the transcript, as the transcript (and indeed the rule) was between 0 and 5.

Ah, does lesswrong have an automatic bullet numbering in the preview? because a lot of the first entries are off by one. The first entries were all incremented (as well as 6,7,8\n7) being trimmed out.

That error was introduced when I copied it from the session.  Not sure how I managed that, but I've checked the original transcript, and it correctly says “Of those example, these match the rule I'm thinking of: 3,2,1; 2,3,1; 1,1,1; 3,3,3; 0,1,2; 2,1,0; 4,3,2; The remaining examples do not.”

I've fixed the post.

An ambulance driver once explained why I beat him to the hospital: have you ever taken a ride in the back of a truck?  It's a very bumpy ride.

I assume “If you've somehow figured out how to do a pivotal act” is intended to limit scope, but doesn't that smuggle the hardness of the Hard Task™ out of the equation?

Every question I ask myself how this approach would address the a given issue, I find myself having to defer to the definition of the pivotal act, which is the thing that's been defined as out of scope.

2Quintin Pope1y
You need at least a certain amount of transfer in order to actually do your pivotal act. An "AI" with literally zero transfer is just a lookup table. The point of this principle is that you want as little transfer as possible while still managing a pivotal act. I used a theorem proving AI as an example where it's really easy to see what would count as unnecessary transfer. But even with something whose pivotal act would require a lot more transfer than a theorem prover (say, a nanosystem builder AI), you'd still want to avoid transfer to domains such as deceiving humans or training other AIs.

By which I mean to imply:

How much of the problem is mistaking the act of providing input to a deterministic system for the act of providing information to an agent with discretion?  Or (in less-absolute terms) making an error regarding the amount of discretion available to that agent.

Great point! I think that doctors, in most places, almost never can physically coerce patients to receive some treatment. I would be worried, even for myself, by regular social pressure tho.

“The system doesn't know how to stop” --HPMoR!Harry

By which I mean to imply: How much of the problem is mistaking the act of providing input to a deterministic system for the act of providing information to an agent with discretion?  Or (in less-absolute terms) making an error regarding the amount of discretion available to that agent.

The more things change…

They had almost precisely his reaction:

"Exactly," Compton said, and with that gravity! "It would be the ultimate catastrophe. Better to accept the slavery of the Nazis than to run the chance of drawing the final curtain on mankind!"

When Teller informed some of his colleagues of this possibility, he was greeted with both skepticism and fear. Hans Bethe immediately dismissed the idea, but according to author Pearl Buck, Nobel Prize-winning physicist Arthur Compton was so concerned that

... (read more)

“I notice a small grin ripple across my face. I ask, ‘Who’s on First?’

Forget the sensory sensitivity, the meltdowns, the feeling of knowing where each and every hair follicle is being bent the wrong way by an article of clothing: the inappropriate involuntary grin is the feeling I can't stand the most in this world.

1party girl1y
hear hear

I notice I'm confused… in precisely the same way I'm confused when I go out.

2party girl2y
it was a confusing world

But… Firefly!  Season 2!  It's not all about the lantern jaw…

It's basically what text looks like when I dream.

I suppose I could be satisfied with an enterprise-d from the 2020 remake of sttng :D

7Swimmer963 (Miranda Dixon-Luinenburg) 2y
"The Enterprise-D in space, screenshot from 2020 Star trek the next generation reboot" here you go.
Load More