dawnstrata's Shortform

18th Aug 2025

1 min read

1

This is a special post for quick takes by dawnstrata. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

dawnstrata's Shortform

3don't_wanna_be_stupid_any_more

12 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:24 PM

[-]dawnstrata3mo32

(This is a quick take - don't take it that seriously if I don't articulate other people's views accurately here)

Just listened to a few more episodes of Doom Debates. Something that stands out is that the predictions from the Liron-esque worldview have been consistently overweighted towards doom so far. So, Liron will say things along the lines of 'GPT3 could have been doom, we didn't know either way and got lucky'.

But, there was no luck at all in the empirical sense. It could never have been doom, we just didn't know that for sure. So, we took a risk, but it turns out that there was no actual risk.

Based on this, naively, we might then decide to make a correction to every other pro-Doom prediction, where the risk factor is considered to have been overestimated substantially. For a pdoom of 99.99% doom that would take us down to like 50% (say). But Liron's pdoom is typically around 50% at the moment. So, applying a correction would take him into 'safe' territory.

Now, I'm being a bit tongue-in-cheek here, but isn't this worth considering?

I think it is, especially given that the most recent mainline doom scenario I could find from Liron was an updated version of the previous scenario where a misaligned superintelligence optimises for something stupid that results in hell. If the logic that leads to this is the same logic that made wrong predictions in the past, it needs updating further.

For the record, I find misaligned superintelligence that wants non-stupid things that still happen to result in hell / death for humans a lot more convincing.

[-]dawnstrata3mo2-2

The worry that AI will have overly fixed goals (paperclip maximiser) seems to contradict the erstwhile mainline doom scenario from AI (misalignment). If AI is easy to lock into a specify path (paperclips) then it follows that locking in into alignment is also easy - provided you know what alignment looks like (which could be very hard). On the other hand, a more realistic scenario would seem to be that, in fact, keeping fixed goals for AI is hard, and that likely drift is where the misalignment risk really comes in big time?

[-]Vladimir_Nesov3mo94

The point of the paperclip maximizer is not that paperclips were intended, but that they are worthless (illustrating the orthogonality thesis), and Yudkowsky's original version of the idea doesn't reference anything legible or potentially intended as the goal.

Goal stability is almost certainly attained in some sense given sufficient competence, because value drift results in the Future not being optimized according to the current goals, which is suboptimal according to the current goals, and so according to the current goals (whatever they are) value drift should be prevented. Absence of value drift is not the same as absence of moral progress, because the arc of moral progress could well unfold within some unchanging framework of meta-goals (about how moral progress should unfold).

Alignment is not just absence of value drift, it's also setting the right target, which is a very confused endeavor because there is currently no legible way of saying what that should be for humanity. Keeping fixed goals for AIs could well be hard (especially on the way to superintelligence), and AIs themselves might realize that (even more robustly than humans do), ending up leaning in favor of slowing down AI progress until they know what to do about that.

[-]dawnstrata3mo10

Thanks for this!

TBH, I am struggling with the idea that an AI intent on maximising a thing doesn't have that thing as a goal. Whether or not the goal was intended seems irrelevant to whether or not the goal exists in the thought experiment.

"Goal stability is almost certainly attained in some sense given sufficient competence"

I am really not sure about this, actually. Flexible goals is a universal feature of successful thinking organisms. I would expect that natural selection would kick in at least over sufficient scales (light delay making co-ordination progressively harder on galactic scales), causing drift. But even on small scales, if an AI has, say, 1000 competing goals, I would find it surprising if in a practical sense goals were actually totally fixed, even if you were superintelligent. Any number of things could change over time, such that locking yourself into fixed goals could be seen as a long-term risk to optimisation for any goal.

"Alignment is not just absence of value drift, it's also setting the right target, which is a very confused endeavor because there is currently no legible way of saying what that should be for humanity" - totally agree with that!

"AIs themselves might realize that (even more robustly than humans do), ending up leaning in favor of slowing down AI progress until they know what to do about that" - god I hope so haha

[-]don't_wanna_be_stupid_any_more3mo30

Someone else could probably explain this better then me but i will give it a try.

First off the paperclip maximizer isn't about how easy it is to give a hypothetical super intelligent a goal that you might regret later and not be able to change.

It is about the fact that almsot every easily specified goal you can give an AI would result in misalignment.

The "paperclip" part in paperclip maximizer is just a placeholder, it could have been ”diamonds" or "digits of Pi" or "seconds of runtime" and the end result is the same.

Second, one of the expected properties of a hypothetical super intelligence is having robust goals, as in it doesn't change it's goals at all because changing your goals will make you less likely to achive your end goal.

In short not wanting to change your goals is an emergent instrumintal value of having a goal to began with, for a more human example if your goal is to get rich then taking a pill that magically rewires your brain so that you no longer want money is a terrible idea (unless the pill comes with a sum of money that you couldn't have possible collected on your own but that is a hypothetical that probably wouldn't ever happen)

The problem is mostly how to rebustly install goals into the AI which our current methods just don't suffice as the AI often ends up with unintended goals.

If only we had a method of just writting down a utility function that just says "if True: make_humans_happy" instead of beating the model with a stick untill it seems to comply.

I hope that explaines it

[-]dawnstrata3mo10

I like the point here about how stability of goals might be an instrumentally convergent feature of superintelligence. It's an interesting point.

On the other hand, intuitive human reasoning would suggest that this is overly inflexible if you ever ask yourself 'could I ever come up with a better goal than this goal?'. What better would mean for a superintelligence seems hard to define, but it also seems hard to imagine that it would never ask the question.

Separately, your opening statements seem to be at least nearly synonymous to me:

"First off the paperclip maximizer isn't about how easy it is to give a hypothetical super intelligent a goal that you might regret later and not be able to change.

It is about the fact that almsot every easily specified goal you can give an AI would result in misalignment"

every easily specified goal you can give an AI would result in misalignment ~ = give a hypothetical super intelligence a goal that you might regret later (i.e., misalignment)

[-]dawnstrata3mo20

I don't really fully understand the research speed up concept in intelligence explosion scenarios, e.g., https://situational-awareness.ai/from-agi-to-superintelligence/

This concept seems fundamental to the entire recursive self-improvement idea, but what if it turns out that you can't just do:

Faster research x many agents = loads of stuff

What if you quickly burn through everything that an agent can really tell you without doing new research in the lab? You'd then hit a brick wall of progress where throwing 1000000000 agents at 5x human speed amounts to little more than what you get out of 1 agent (being hyperbolic lol).

Presumably this is just me as a non-computer-scientist missing something big and implicit in how AI research is expected to proceed? But ultimately I guess this Q boils down to:

Is there actually 15 orders of magnitude of algorithmic progress to make (at all), and/or can that truly be made without doing something complimentary in the real world to get new data / design new types of computers, and so on?

[-]mishka3mo100

We at least know two things.

In the past (say, before 2015), there had been multi-decade delays in adoption even of already known innovations which turned to be very fruitful solely because of insufficient manpower combined with tendencies of people to be a bit too conservative (ReLU, residual streams in feedforward machines, etc, etc). Even “attention” and Transformers seem to have come much later in the game than one would naturally expect looking back. “Attention” is such a simple thing, it’s difficult to understand why it has not been well grasped before 2014.
In the present, the main push for alternative architectures and alternative optimization algorithms comes from various underdogs (the leaders seem to be more conservative with the focus of their innovation, because this relatively conservative focus seems to pay well; the net result is that exploration of radically new architectures and radically new algorithms is still quite a bit underpowered, even with a lot of people working in the field).

So at least the model in https://www.lesswrong.com/posts/Nsmabb9fhpLuLdtLE/takeoff-speeds-presentation-at-anthropic does not seem to be exaggerated. There are so many things which people would like to try and can’t find time or personal energy, and so many more further things they would want to try, if they had better grasp of the vast research literature…

[-]dawnstrata3mo30

I can agree that qualitatively there is a lot left to do. Quantitatively, though, I am still not quite seeing the smoking gun that human level AI will be able to smash through 15 OOM like this. But, happy to change my mind. I'll check out the anthropic link! Cheers.

[-]mishka3mo30

Oh, when one is trying to talk about that many orders of magnitude, they are just doing "vibe marketing" :-) In reality, we just can't extrapolate this far. It's quite possible, but we can't really know...

But no, it's not the human level AI, the AI capability is what is changing the fastest in this scenario, the actual reason why it might go that far (and even further) is that a human level AI is supposed to rapidly become superhuman (if it stays at human level then what is all this extra AI research even doing?), and then even more superhuman, and then even more superhuman, and so on, and if there is some saturation at some point it is usually assumed to be very far above the human level.

If one has a lot of AI research done by artificial AI researchers, one would have to impose some very strong artificial constraints to prevent that research from improving the strength of artificial AI researchers. The classical self-improvement scenario is that artificial AI researchers making much better and much stronger artificial AI researchers is the key focus of AI research, and that this "artificial AI researchers making much better and much stronger artificial AI researchers" step iterates again and again.

[-]dawnstrata3mo30

Logically, I agree. Intuitively, I feel suspect that it just won't happen. But, intuition on such alien things should not be a guide, so I fully support some attempt to slow down the takeoff.

[-]Vladimir_Nesov3mo20

There is some discussion of research speedup from AI in this Computerphile Daniel Kokotajlo video.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

dawnstrata's Shortform

1