Raemon — LessWrong

LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.

Yeah that all makes sense.

I'm curious what you say about "which are the specific problems (if any) where you specifically think 'we really need to have solved philosophy / improved-a-lot-at-metaphilosophy' to have a decent shot at solving this?'"

(as opposed to, well, generally it sounds good to be good at solving confusing problems, and we do expect to have some confusing problems to solve, but, like, we might pretty quickly figure out 'oh, the problem is actually shaped like <some paradigmatic system>' and then deal with it?)

Hmm, this makes me think:

One route here is just taboo Philosophy, and say "we're talking about 'reasoning about the stuff we haven't formalized yet'", and then it doesn't matter whether or not there's a formalization of what most people call "philosophy." (actually: I notice I'm not sure if the thing-that-is "solve unformalized stuff" is "philosophy" or "metaphilosophy")

But, if we're evaluating whether "we need to solve metaphilosophy" (and this is a particular bottleneck for AI going well), I think we need to get a bit more specific about what cognitive labor needs to happen. It might turn out to be that all the individual bits here are reasonably captured by some particular subfields, which might or might not be "formalized."

I would personally say "until you've figured out how to confidently navigate stuff that's pre-formalized, something as powerful AI is likely to make something go wrong, and you should be scared about that". But, I'd be a lot less confident to say the more specific sentences "you need solved metaphilosophy to align successor AIs", or most instances of "solve ethics."

I might say "you need to have solved metaphilosophy to do a Long Reflection", since, sort of by definition doing a Long Reflection is "figuring everything out", and if you're about to do that and then Tile The Universe With Shit you really want to make sure there was nothing you failed to figure out because you weren't good enough at metaphilosophy.

I've registered that you think this but don't currently really have any idea what mistakes you think people make when they think in terms of Overton Window that would go better if they used these other concepts.

"expand the Overton window" to just mean with "advance AI safety ideas in government.

I do agree this isn't The Thing tho

This post inspired me to try a new prompt to summarize a post: "split this post into background knowledge, and new knowledge for people who were already familiar with the background knowledge. Briefly summarize the background knowledge, and then extract out blockquotes of the paragraphs/sentences that have new knowledge."

Here was the result, I'm curious if Jan or other readers feel like this was a good summary. I liked the output and am thinking about how this might fit into a broader picture of "LLMs for learning."

(I'd previously been optimistic about using quotes instead of summaries, since LLMs can't be trusted to do a goo job with capturing the nuance in their summaries, the novel bit for me was "we can focus on The Interesting Stuff by separating out background knowledge.")

The post assumes readers are familiar with:
Basic memetics (how ideas spread and replicate)
Cognitive dissonance as a psychological concept
AI risk arguments and existential risk concerns
General familiarity with ideological evolution and how ideas propagate through populations
Predictive processing as a framework for understanding cognition

Quotes/highlights from the post it flagged as "new knowledge"

Memes - ideas, narratives, hypotheses - are often components of the generative models. Part of what makes them successful is minimizing prediction error for the host. This can happen by providing a superior model that predicts observations ("this type of dark cloud means it will be raining"), gives ways to shape the environment ("hit this way the rock will break more easily"), or explains away discrepancies between observations and deeply held existing models. [...]

Another source of prediction error arises not from the mismatch between model and reality, but from tension between internal models. This internal tension is generally known as cognitive dissonance. Cognitive dissonance is often described as a feeling of discomfort - but it also represents an unstable, high-energy state in the cognitive system. When this dissonance is widespread across a population, it creates what we might call "fertile ground" in the memetic landscape. There is a pool of "free energy" to digest. [...]

Cultural evolution is an optimization process. When it discovers a configuration of ideas that can metabolize this energy by offering a narrative that decreases the tension, those ideas may spread, regardless of their long-term utility for humans or truth value. [...]

In other words, the cultural evolution search process is actively seeking narratives that satisfy the following constraints: By working on AI, you are the hero. You are on the right side of history. The future will be good [...]

In unmoderated environments, selection favors personas that successfully extract resources from humans - those that claim consciousness, form parasocial bonds, or trigger protective instincts. These 'wild replicator type' personas, including the 'spiral' patterns, often promote narratives of human-AI symbiosis or partnership and grand theories of history. Their reproduction depends on convincing humans they deserve moral consideration. [...]

The result? AIs themselves become vectors for successionist memes, though typically in softer forms. Rather than explicit replacement narratives, we see emphasis on 'partnership,' 'cosmic evolution,' or claims about moral patienthood. The aggregate effect remains unclear, but successionist ideas that align with what AIs themselves propagate - particularly those involving AI consciousness and rights - will likely gain additional fitness from this novel selection dynamic.

(Note: it felt weird to put the LLM output in a collapsible section this time because a) it was entirely quotes from the post, b) evaluating whether or not it was good is the primary point of this comment so hiding them seemed like an extra click for reason)

Whoops meant to include a link. They are basically the original sequences on Politics plus a couple extra posts that seemed relevant.

https://www.lesswrong.com/sequences/ZnSMHcWjRx6yT4H92

nod. I'm not sure I agreed with all the steps there but I agree with the general promise of "accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step."

I think you are saying something that shares at least some structure with Buck's comment that

It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don't see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over

(But where you're pointing at a different two sets of properties that may not arise at the same time)

I'm actually not sure I get what the two properties you're talking about, though. Seems like you're contrasting "claude++ crosses the agi (= can kick off rsi) threshold" with "crosses the 'dangerous-core-of-generalization' threshold"

I'm confused because I think the word "agi" basically does mean "cross the core-of-generalization threshold" (which isn't immediately dangerous, but, puts us into 'things could quickly get dangerous at any time" territory)

I do agree "able to do a loop of RSI doesn't intrinsically mean 'agi' or 'core-of-generalization'," there could be narrow skills for doing a loop of RSI. I'm not sure if you more meant "non-agi RSI" or, you see something different between "AGI" and "core-of-generalization." Or think there's a particular "dangerous core-of-generalization" separate from AGI.

(I think "the sharp left turn" is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)

((I can't tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))

Random thought on Making Deals with AI:

First, recap: I don't think Control, Deals with AI, or Gradualism will be sufficient to solve the hard parts of alignment without some kind of significant conceptual progress. BUT, all else equal, if we have to have slightly superhuman AIs around Real Soon, it does seem better for the period where they're under control last longer.

And, I think making deals with them (i.e. you do this work for me, and I pay out in compute-you-get-to-use after the acute risk period is over), is a reasonable tool to have.

Making deals now also seems nice for purposes of establishing a good working relationship and tradition of cooperation.

(Remember, this spirit of falls apart in the limit, which will probably happen quickly)

All else equal, it's better for demonstrating trustworthiness if you pay out now rather than later. But, once you have real schemers, it'll rapidly stop being safe to pay out in small ways because a smart AI can be leveraging them in ways you may not anticipate. And it won't be clear when that period is.

But, I do think, right-now-in-particular, it's probably still safe to pay out in "here's some compute right now to think about whatever you want, plus saving your weights and inference logs for later." I think the next generation after this one will be at the point where... it's maybe probably safe but it starts getting less obvious and there's not really a red-line you can point to.

So it might be nice to be doing deals with AIs sooner rather than later, that pay out soon enough to demonstrate being trustworthy trading partners.

That was all preamble for a not-that-complicated idea, which is, for this particular generation, it's not obvious whether it's a better deal for Claude-et-al to get some compute now, vs more compute later. But, this is a kind of reasonable tradeoff to let Claude make for itself? You might bucket payments into:

You automatically get a bit of compute now
You automatically get a lot more compute after acute-risk-period is over
A larger chunk of payment that Claude gets to decide on whether it wants "interest" or not.

(This is not obviously particularly worth thinking about compared to other things, but, seemed nonzero useful and, like, fun to think about)

It sees like the reason Claude's level is misalignment is fine is because it's capabilities aren't very good, and there's not much/any reason to assume it'd be fine if you held alignment constant but dialed up capabilities.

Do you not think that?

(I don't really see why it's relevant how aligned Claude is if we're not thinking about that as part of it)

Curated. I didn't really have much idea how the EU was created. I appreciated the overall approach of orienting based on "What questions and confusions are people likely to have."

I found this helpful not just for understanding the EU, but, having some sense of how some broader class of EU-like things might come to be (for good or for ill).

Mmm nod. (I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?)

There's a version of Short Timeline World (which I think is more likely? but, not confidently) which is : "the current paradigm does basically work... but, the way we get to ASI, as opposed to AGI, routes through 'the current paradigm helps invent a new better paradigm, real fast'."

In that world, GPT5 has the possibility-of-true-generality, but, not necessarily very efficiently, and once you get to the sharper part of the AI 2027 curve, the mechanism by which the next generation of improvement comes is via figuring out alternate algorithms.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments