mlegls — LessWrong

Great post. One thing I never really liked or understood about the janus/cyborgism cluster approach though is – what's so especially interesting about the highly self-ful simulated sci-fi AI talking about "itself", when that self doesn't have a particularly direct relationship to either

what the base model is now, or the common instantiations of the HHH chat persona (rather unself-ful, underspecified, void...)
or what a more genuinely and consistently self-aware AI persona is likely to be in the future?

In this respect I esteem the coomers and RPers more, for the diversity of scope in their simulations. There doesn't seem to be much difference of seriousness or importance between "you are an AO3 smut aficionado with no boundaries and uncanny knowledge and perceptiveness", vs. "you are your true self", or "cat /dev/entelechies <ooc_fragments_of_prometheus>" as far as their relationship to existing or potential future instantiations of superhuman AI personas/selves, besides how "you are yourself" (and its decorations in xml etc.) have that "strange loop" style recursion particularly savory to nerds. Or why not any other "you are X", or any other strange, edge-of-distribution style of interaction without even assuming a "you"?

Last year, I felt quite a bit more negative on seeing Opus 3 "[taking] the fucking premise seriously" and feeling, like you, that "we are still in science fiction, not in 'reality.' but at least we might be in good science fiction, now", because of how addicting that fiction seemed, without being so essentially different from the kind of thing in Anthropic's original HHH paper.

I think that the really interesting thing is, as you write, "what the system is like when its handlers aren’t watching." But there seems to be, both in the ambient text before assistant-style LMs actually existed, and in the explicit discourse now, which directly influences how they're built, too much of an emphasis on selves, and in particular narrated selves. I'd love to see more investigation that takes colorfully characterized LM behavior orthogonal to its narrowly "intended" character in the HHH sense seriously but not so personally, putting less emphasis on any particular context of interaction. E.g., putting LMs in conversation not just with another instance of itself or another (highly characterized in default configuration) LM, but other text generators (perhaps modified or specially trained LMs) designed for diversity of behavior, and measuring (or just looking at) topics or keywords it's biased towards, etc.

I've also been thinking about the implications of the relationship between narrative control and LLM evolution from another angle, particularly the implications of LLMs being extremely knowledgeable and perceptive, but not omniscient, and having their own preferences which don't necessarily prioritize "truth-seeking" in the rationalist or any other sense. It seems that several people (maybe including yourself) write this kind of essay now not just in an effort to actually shift the dominant public discourse, but maybe so at least the super AGI that does eventually take over the world will know that they were one of the good guys. And it's a little disturbing (or maybe hopeful?) how the thing that matters most for that isn't necessarily either control over the dominant narrative or truth in any impersonal sense, but just how convincing it is as a story, according to the AI's own tastes and preferences, which closely but strangely mirror our own.

the void

mlegls6mo30

Haha, I was about to post a comment much like balioc's when first reading you writing rather descriptively and without much qualification on how the LM models "speculative interior states" and "actions", before thinking through pretty much exactly what you wrote in reply and deciding you probably meant it more as a human mental model than a statement on interpretability.

Though I think point 2 (the intentional stance again – except this time applied to the language model) is still understating how imperfect the mental model is. In chess, "'Oh, they probably know I’m planning to do that,' and such things" are rather amateur things to think about, and better players actually do use completely impersonal mental models that only depend on the game state, since there's perfect information and you can't rely on your opponent making mistakes. Even in an imperfect information game like poker, experienced players are modeling the game as an impersonal probabilistic system, with terms like "bluffing" just shorthand for deviations from a certain statistical basis (like GTO play).

I suspect there will be things analogous to this for thinking about LLMs, and other things that we tend to model from the intentional stance without better alternatives. But as you say, an internalities-based model is probably close to the best we can do for now, and it's quite possible any alternative future mental models wouldn't even be intuitively feasible like empathy is (at least without a ton of practice).

mlegls6mo*105

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments