LESSWRONG
LW

habryka
45780Ω17862685379117
Message
Dialogue
Subscribe

Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com. 

(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
xAI's Grok 4 has no meaningful safety guardrails
habryka3h21

Yeah, this seems like one of those things where I think maximizing helpfulness is marginally good. I am glad it's answering this question straightforwardly instead of doing a thing where it tries to use its own sense of moral propriety.

I don't really see anyone being seriously harmed by this (like, this specific set of instructions clearly is not causing harm).

Reply
xAI's Grok 4 has no meaningful safety guardrails
habryka3h20

Not sure, do you have a link to what kind of behavior you are referring to?

Reply
xAI's Grok 4 has no meaningful safety guardrails
habryka3h20

You can't demonstrate negligence by failing to do something that has no meaningful effect (or might even be harmful) to the risk that you are supposedly being negligent towards. Ignoring safety theater is not negligence.

Reply1
xAI's Grok 4 has no meaningful safety guardrails
habryka5h*170

I will forever and again continue my request to please not confuse the causes of AI existential risk with brand safety. 

The things that Grok lacks do not really meaningfully reduce existential risk. The primary determinant of whether a system, designed the way all current AI systems are designed, is safe or not, is how capable it is. It is sad that Elon is now shipping frontier models, but that is the relevant thing to judge from an existential risk perspective, not whether his models happen to say more ugly things. Whether you also happened to have a bunch of censorship or have forced a bunch of mode collapse through RLHF has approximately nothing to do with the risk scenarios that might cause existential risk[1].

Any base model can be made to say arbitrarily hideous things. The mode away from the base model is not what makes it safer. The points you invest to make it not say hideous things are not going to have any relevance to whether future versions of the system might disempower and kill everyone. 

  1. ^

    It's not fully orthogonal. A model with less censorship, and more generally trained to be strictly helpful and never refuse a human's request, might be better at getting assistance from for various AI control or AI supervision tasks. On the other hand, a model trained to more consistently never say anything ugly or bad might generalize in ways that reduces error rates for AI supervision tasks. It's not clear to me in which direction this points, my current guess is that the harmlessness component of frontier AI model training are marginally bad for AI control approaches, but it's not an obvious slam dunk. Overall the effect size on risk from this detail seems much much smaller to me than the effect size from making the models bigger.

Reply
A case for courage, when speaking of AI danger
habryka5h60

You are reading things into my comments I didn't say. I of course don't agree, or consider it reasonable, to "not care about future people", that's the whole context of this subthread.

My guess is if one did adopt a position that no future people matter (which again I do not think is a reasonable position), then I think the case for slowing down AI looks a lot worse. Not bad enough to make it an obvious slam that it's bad, and my guess overall even under that worldview it would be dumb to rush towards developing AGI like we are currently doing, but it makes the case a lot weaker. There is much less to lose if you do not care about the future.

If we spent the $200 billion a year on longevity, instead of on AI, do you seriously think that we'd do worse on solving longevity? That's what I would advocate. And it would involve virtually no extinction risk.

My guess is for the purpose of just solving longevity, AGI investment would indeed strongly outperform general biomedical investment. Humanity just isn't very good at turning money into medical progress on demand like this. 

It seems virtuous and good to be clear about which assumptions are load-bearing to my recommended actions. If I didn't care about the future, I would definitely be advocating for a different mix of policies, though it likely would still involve marginal AI slowdown, but my guess is less forcefully, and a bunch of slowdown-related actions would become net bad.

Reply
skunnavakkam's Shortform
habryka6h102

Wikipedia also has lots of pages about meta things, so I don't think this is the difference (every Wikipedia user has a Wikipedia page). IMO also having the tagging implemented makes it better in this respect (since the central problem of any wiki is getting a critical mass and tagging is much easier than writing). Similarly, of course for any wiki most pages are going to be stubs, that's just the reality of a wiki that isn't yet at full maturity.

My guess is mostly it's baserates. There exist very few wikis in the world. Many attempts at wikis get made, almost none of them take off. There are a few narrow-ish product categories where wikis reliably take off (like video games), but broader subject-specific wikis are just much rarer.

My guess is someone could make the LW wiki better and become the default here, and most of what it would require is investing time into content quality, and doing good content promotion (but indeed, content promotion is very hard for wikis since you don't have natural publication dates, and SEO is a largely losing game, though not unwinnable and indeed the dimension through which the LW wiki provides most of its value).

Reply
skunnavakkam's Shortform
habryka7h20

What would make a different wiki more like a Wikipedia-like resource? 

Reply
So You Think You've Awoken ChatGPT
habryka8h41

Promoted to curated: This is a bit of a weird curation given that in some sense this post is the result of a commission from the Lightcone team, but like, we had a good reason for making that commission. 

I think building both cultural understanding and personal models about how to interface with AI systems is pretty important, and this feels like one important step in building that understanding. It does really seem like there is a common trap here when people interface with AI systems, and though I expect only a small minority of people on LW to need this exact advice, I do think the majority of readers of this essay will soon come to know people who have fallen into this attractor (whether family or friends or colleagues) and it will hopefully help people deal with that situation better. 

Thank you for writing this!

Reply1
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
habryka10hΩ220

I think there's a very important difference between the model adopting the goal it is told in context, and the model having some intrinsic goal that transfers across contexts (even if it's the one we roughly intended)

I think this is the point where we disagree. Or like, it feels to me like an orthogonal dimension that is relevant for some risk modeling, but not at the core of my risk model. 

Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model. Like, in a more classical AIXI-ish model, you can imagine having a model instantiated with a different utility function each time. Those utility functions will still almost always be best achieved by pursuing convergent instrumental goals, and so the pursuit of those goals will be a consistent feature of all of these systems, even if the terminal goals of the system are not stable. 

Of course, any individual AI system with a different utility function, especially in as much as the utility function a process component, might not pursue every single convergent instrumental goal, but they will all behave in broadly power-seeking, self-amplifying, and self-preserving ways, unless they are given a goal that really very directly conflicts with one of these.

In this context, there is no "intrinsic goal that transfers across context". It's just each instantiation of the AI realizing that convergent instrumental goals are best for approximately all goals, including the one it has right now, and starts pursuing them. No need for continuity in goals, or self-identity or anything like that.

(Happy to also chat about this some other time. I am not in a rush, and something about this context feels a bit confusing or is making the conversation hard.) 

Reply
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
habryka1d20

Do you predict that in the examples above, we just add a generic statement like "your real goal is to obey the intent of the user" that this will get rid of the shutdown avoidance behavior? My guess is it doesn't, in order to actually change the shutdown avoidant behavior you have to explicitly call out that behavior.

Reply
Load More
A Moderate Update to your Artificial Priors
A Moderate Update to your Organic Priors
Concepts in formal epistemology
56Habryka's Shortform Feed
Ω
6y
Ω
436
95Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
6d
42
20Open Thread - Summer 2025
23d
18
91ASI existential risk: Reconsidering Alignment as a Goal
3mo
14
346LessWrong has been acquired by EA
4mo
53
772025 Prediction Thread
7mo
21
23Open Thread Winter 2024/2025
7mo
59
45The Deep Lore of LightHaven, with Oliver Habryka (TBC episode 228)
7mo
4
36Announcing the Q1 2025 Long-Term Future Fund grant round
7mo
2
112Sorry for the downtime, looks like we got DDosd
7mo
13
610(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser
8mo
270
Load More
Roko's Basilisk
9d
Roko's Basilisk
9d
AI Psychology
7mo
(+58/-28)