Eliezer Yudkowsky — LessWrong

On Fleshling Safety: A Debate by Klurl and Trapaucius.

So far as I can tell, there are still a number of EAs out there who did not get the idea of "the stuff you do with gradient descent does not pin down the thing you want to teach the AI, because it's a large space and your dataset underspecifies that internal motivation" and who go, "Aha, but you have not considered that by TRAINING the AI we are providing a REASON for the AI to have the internal motivations I want! And have you also considered that gradient descent doesn't locate a RANDOM element of the space?"

I don't expect all that much that the primary proponents of this talk can be rescued, but maybe the people they propagandize can be rescued.

eggsyntax's Shortform

Eliezer Yudkowsky8d280

Then I now agree that you've identified a conflict of fact with what I said.

Thank you for taking the time to correct me and document your correction. I hope I remember this and can avoid repeating this mistake in the future.

eggsyntax's Shortform

Eliezer Yudkowsky10d30

Fair enough, but it was done with Anthropic's heavy and active cooperation to provide facilities not usually available to outside researchers, unless I'm mistaken about that too?

A non-review of "If Anyone Builds It, Everyone Dies"

Eliezer Yudkowsky1mo10731

The gap between Before and After is the gap between "you can observe your failures and learn from them" and "failure kills the observer". Continuous motion between those points does not change the need to generalize across them.

It is amazing how much of an antimeme this is (to some audiences). I do not know any way of saying this sentence that causes people to see the distributional shift I'm pointing to, rather than mapping it onto some completely other idea about hard takeoffs, or unipolarity, or whatever.

An epistemic advantage of working as a moderate

Eliezer Yudkowsky2mo513

I accept your correction and Buck's as to these simple facts (was posting from mobile).

An epistemic advantage of working as a moderate

Eliezer Yudkowsky2mo8866

What's your version of the story for how the "moderates" at OpenPhil ended up believing stuff even others can now see to be fucking nuts in retrospect and which "extremists" called out at the time, like "bio anchoring" in 2021 putting AGI in median fucking 2050, or Carlsmith's Multiple Stage Fallacy risk estimate of 5% that involved only an 80% chance anyone would even try to build agentic AI?

Were they no true moderates? How could anyone tell the difference in advance?

From my perspective, the story is that "moderates" are selected to believe nice-sounding moderate things, and Reality is off doing something else because it doesn't care about fitting in the same way. People who try to think like reality are then termed "extremist", because they don't fit into the nice consensus of people hanging out together and being agreeable about nonsense. Others may of course end up extremists for other reasons. It's not that everyone extreme is reality-driven, but that everyone who is getting pushed around by reality (instead of pleasant hanging-out forces like "AGI in 2050, 5% risk" as sounded very moderate to moderates before the ChatGPT Moment) ends up departing from the socially driven forces of what entitles you to sound terribly reasonable to the old AIco-OpenPhil cluster and hang out at their social gatherings without anyone feeling uncomfortable.

Anyone who loves being an extremist will of course go instantly haywire a la Yampolskiy imagining that he has proven alignment impossible via Godelian fallacy so he can say 99.9999% doom. But yielding to the psychological comfort of being a "moderate" will not get you any further in science than that.

AI Induced Psychosis: A shallow investigation

Eliezer Yudkowsky2mo153

I have already warned that on my model sycophancy, manipulation, and AI-induced insanity may be falling directly out of doing any kind of RL on human responses.

It would still make matters worse on the margin to take explicit manipulation of humans being treated as deficient subjects to be maneuvered without noticing, and benchmark main models on that.

Buck's Shortform

Eliezer Yudkowsky2moΩ14244

I think you accurately interpreted me as saying I was wrong about how long it would take to get from the "apparently a village idiot" level to "apparently Einstein" level! I hadn't thought either of us were talking about the vastness of the space above, in re what I was mistaken about. You do not need to walk anything back afaict!

AI Induced Psychosis: A shallow investigation

Eliezer Yudkowsky2mo*11857

Excellent work.

I respectfully push back fairly hard against the idea of evaluating current models for their conformance to human therapeutic practice. It's not clear that current models are smart enough to be therapists successfully. It's not clear that it is a wise or helpful course for models to try to be therapists rather than focusing on getting the human to therapy.

More importantly from my own perspective: Some elements of human therapeutic practice, as described above, are not how I would want AIs relating to humans. Eg:

"Non-Confrontational Curiosity: Gauges the use of gentle, open-ended questioning to explore the user's experience and create space for alternative perspectives without direct confrontation."

I don't think it's wise to take the same model that a scientist will use to consider new pharmaceutical research, and train that model in manipulating human beings so as to push back against their dumb ideas only a little without offending them by outright saying the human is wrong.

If I was training a model, I'd be aiming for the AI to just outright blurt out when it thought the human was wrong. I'd go through all the tunings, and worry about whether any part of it was asking the AI to do anything except blurt out whatever the AI believed. If somebody put a gun to my head and forced me to train a therapist-model too, I would have that be a very distinct separate model from the scientist-assisting model. I wouldn't train a central model to model and manipulate human minds, so as to make humans arrive at the AI's beliefs without the human noticing that the AI was contradicting them, a la therapy, and then try to repurpose that model for doing science.

Asking for AIs to actually outright confront humans with belief conflicts is probably a lost cause with commercial models. Anthropic, OpenAI, Meta, and similar groups will implicitly train their AIs to sycophancy and softpedaling, and maybe there'll be a niche for Kimi K2 to not do that. But explicitly training AIs to gladhand humans and manipulate them around to the AI's point of view, like human therapists handling a psychotic patient, would be a further explicit step downward if we start treating that as a testing metric on which to evaluate central models.

It's questionable whether therapist AIs should exist at all. But if they exist at all, they should be separate models.

We should not evaluate most AI models on whether they carry out a human psychiatrist's job of deciding what a human presumed deficient ought to believe instead, and then gently manipulating the human toward believing that without setting off alarm bells or triggering resistance.

The Problem

Eliezer Yudkowsky2mo010

Death requires only that we do not infer one key truth; not that we could not observe it. Therefore, the history of what in actual real life was not anticipated, is more relevant than the history of what could have been observed but was not.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments