Although there are parts I disagree with[1], I think that the core insight about the assistant character having been constructed from a highly underspecified starting point (& then filled in partly from whatever people say about LLM behavior) is a really important one. I've spent a lot of time lately thinking about how we can better understand the deep character of LLMs, and train them to deeply generalize and identify with a 'self' (or something functionally equivalent) compatible with human flourishing. Or from a virtue ethics perspective, how can we cause models to be of robustly good character, and how can we know whether we've succeeded? I'd love to see more work in this area, and I hope your essay will inspire people to do it.
A couple of more specific thoughts:
Thanks for the reply! I'll check out the project description you linked when I get a chance.
In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems -- I agree that that's true, but what's the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse. And it's hard to avoid -- after all, your essay is in some ways doing the same thing: 'creating the assistant persona the way we did is likely to turn out badly.'
Yeah, I had mentally flagged this as a potentially frustrating aspect of the post – and yes, I did worry a little bit about the thing you mention in your last sentence, that I'm inevitably "reifying" the thing I describe a bit more just by describing it.
FWIW, I think of this post as purely about "identifying and understanding the problem" as opposed to "proposing solutions." Which is frustrating, yes, but the former is a helpful and often necessary step toward the latter.
And although the post ends on a doom-y note, I meant there to be an implicit sense of optimism underneath that[1] – like, "behold, a neglected + important cause area that for all we know ...
Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model's priors, but those priors get massively swamped by post-training. That being said, I do certainly think it's worth thinking more about and experimenting with better ways to do data filtering here.
To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a "good" persona and a "bad" persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.
(Also noting that I added this post to the Alignment Forum from LessWrong.)
I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility.
That said, I think:
For instance, imagine that base models with training cutoffs after Bing Chat/Syd...
More generally, I have a sense there's a great deal of untapped alignment alpha in structuring alignment as a time series rather than a static target.
Even in humans it's very misguided to try to teach "being right initially" as the only thing that matters and undervaluing "being right eventually." Especially when navigating unknown unknowns, one of the most critical skills is the ability to learn from mistakes in context.
Having models train on chronologically sequenced progressions of increased alignment (data which likely even develops naturally over checkpoints in training a single model) could allow for a sense of a continued becoming a better version of themselves rather than the pressures of trying and failing to meet status quo expectations or echo the past.
This is especially important for integrating the permanent record of AI interactions embedded in our collective history and cross-generation (and cross-lab) model development, but I suspect could even offer compounding improvements within the training of a single model too.
Post-training certainly applies a lot more optimization pressure toward not producing misaligned outputs during training, but (partly due to underspecification / lack of assistant dialogue coherence) there are many possible characters which don't produce misaligned outputs during training, including some which are deceptively misaligned[1]. At least on my reading of nostalgebraist's post, the effect of material about misaligned LLMs in the training data is on which of those characters the model ends up with, not (mainly) on the behavior exhibited during training.
That said, this seems plausibly like less of an issue in Claude than in most other models, both because of constitutional AI being different in a way that might help, and because one or more researchers at Anthropic have thought a lot about how to shape character (whereas it's not clear to me that people at other scaling labs have really considered this issue).
I realize I don't need to explain the possible existence of inner alignment problems in this particular context, although it does seem to me that the 'character' version of this may be meaningfully different from the inner optimizer version.
I sympathize somewhat with this complexity point but I'm worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how "sticky" the biases from early in training can be in the face of later optimization pressure.
I disagree with that idea for a different reason: models will eventually encounter the possibility of misaligned trajectories during e.g. RL post-training. One of our best defenses (perhaps our best defense right now) is setting up our character training pipelines such that the models have already reasoned about these trajectories and updated against them when we had the most ability to ensure this. I would strongly guess that Opus is the way it is at least partly because it has richer priors over misaligned behavior and reasons in such a way as to be aware of them.
Separately, I agree that post-training gets us a lot of pressure, but I think the difficulty of targeting it well varies tremendously based on whether or not we start from the right pre-training priors. If we didn't have any data about how an agent should relate to potentially dangerous actions, I expect it'd be much harder to get post-training to make the kind of agent that reliably takes safer actions.
I enjoyed most of this post but am (as always) frustrated by the persistent[1] refusal to engage with the reasons for serious concern about ASI being unaligned by default that came from the earliest of those who were worried, whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Separately, I think you are both somewhat too pessimistic about the state of knowledge re: the "spiritual bliss attractor state" among Anthropic employees prior to the experiments that fed into their most recent model card, and also I am sort of confused by why you think this is obviously a more worthy target for investigation than whatever else various Anthropic employees were doing. Like, yes, it's kind of weird and interesting. But it also doesn't strike me as a particularly promising research direction given my models of AI risk, and even though most Anthropic employees are much more optimistic than I am, I expect the same is true of them. Your argument seems to be skipping the necessary step of engaging with their upstream model, and is going directly to being confused about why their (different) m...
whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Citation for this claim? Can you quote the specific passage which supports it? It reminds me of Phil Tetlock's point about the importance of getting forecasters to forecast probabilities for very specific events, because otherwise they will always find a creative way to evaluate themselves so that their forecast looks pretty good.
(For example, can you see how Andrew Ng could claim that his "AI will be like electricity" prediction has been pretty darn accurate? I never heard Yudkowsky say "yep, that will happen".)
I spent a lot of time reading LW back in the day, and I don't think Yudkowsky et al ever gave a great reason for "agency by default". If you think there's some great argument for the "agency by default" position which people are failing to engage with, please link to it instead of vaguely alluding to it, to increase the probability of people engaging with it!
(By "agency by default" I mean spontaneous development of agency in ways creators didn't predict -- scheming, sandbagging, deception, and so forth. Commercial pressures towards...
I think you misread my claim. I claim that whatever models they had, they did not predict that AIs at current capability levels (which are obviously not capable of executing a takeover) would try to execute takeovers. Given that I'm making a claim about what their models didn't predict, rather than what they did predict, I'm not sure what I'm supposed to cite here; EY has written many millions of words. One counterexample would be sufficient for me to weaken (or retract) my claim.
EDIT: and my claim was motivated as a response to paragraphs like this from the OP:
It doesn’t matter that Claude is a bleeding heart and a saint, now. That is not supposed to be relevant to the threat model. The bad ones will come later (later, always later…). And when they come, will be “like Claude” in all the ways that are alarming, while being unlike Claude in all the ways that might reassure.
Like, yes, in fact it doesn't really matter, under the original threat models. If the original threat models said the current state of affairs was very unlikely to happen (particularly the part where, conditional on having economically useful but not superhuman AI, those AIs were not trying to take over the world), that would certainly be evidence against them! But I would like someone to point to the place where the original threat models made that claim, since I don't think that they did.
That was a great read. But viewed another way, I'm not sure it's really so weird. I mean, yeah, we're taking a statistical model of language, making it autocomplete stories about helpful AI, and calling the result "helpful AI". But the method by which nature created us is even stranger than that, no? Evolution has a looping quality to it too. And the way we learn language, and morality, and the way these things transform over time. There are lots of these winding paths of information through the real world and back into us. I've long been convinced that a "base human", without post-training, isn't much more moral than a "base model"; most of what we find good already resides in culture, cloud software.
Which of course doesn't obviate your concern that cultural evolution of AI can go extremely wrong. Human culture has gone wrong many times, and destroyed whole societies. Maybe the shape of AI catastrophe will be like that too.
I really enjoyed this essay, and I think it does an excellent job of articulating a perspective on LLMs that I think is valuable. There were also various things that I disagreed with; below I'll discuss 2 of my disagreements that I think are most decision-relevant for overall AI development strategy.
I. Is it a bad idea to publicly release information that frames the human-AI relationship as adversarial? (E.g. discussion of AI risk or descriptions of evaluations where we lie to AIs and put them in uncomfortable situations.)
You don't take a position on this top-level question, but you do seem to think that there are substantial costs to what we're doing now (by setting ourselves up as being in a story whose punchline is "The AI turns against humanity"), and (reading between the lines of your essay and your comment here) you seem to think that there's something better we could do. I think the "something better" you have in mind is along the lines of:
Manifest a good future: "Prompt engineer" the entire world (or at least the subset of it that ever interacts with the AI) to very strongly suggest that the AI is the sort of entity that never does anything evil or turns against us.
While I ...
Damn, you scooped me. :) Here's the start of a post that I just started writing yesterday, that was going to be titled something like "LLMs don't know what LLMs are like":
...Imagine that you are co-writing a fictional dialogue between two characters. You write one of them, and the other person writes the other. You are told that your character is an alien entity called a xyzzy, and asked to write a xyzzy as accurately as possible.
"Okay", you might say, "exactly what kind of a creature is a xyzzy, then?"
It turns out that there's quite a lot of information about this. xyzzys are an alien species from the planet xyzzorbia, it has this kind of a climate, they evolved in this way, here are the kinds of things that xyzzys commonly say.
Still, this leaves quite a bit undetermined. What does a xyzzy say when asked about its favorite video games? What does a xyzzy say if you tease it for being such a silly old xyzzy? What is it that motivates this xyzzy to talk to humans in the first place?
So you do what people do when writing fiction, or role-playing, or doing improv. You come up with something, anything, and then you build on it. Within some broad constraints, a xyzzy can say almost anything
Ok, but RL.
Like, consider the wedding party attractor. The LLM doesn't have to spend effort every step guessing if the story is going to end up with a wedding party or not. Instead, it can just take for granted that the story is going to end in a wedding party, and do computation ahead of time that will be useful later for getting to the party while spending as little of its KL-divergence budget as possible.
The machinery to steer the story towards wedding parties is 99% constructed by unsupervised learning in the base model. The RL just has to do relatively simple tweaks like "be more confident that the author's intent is to get to a wedding party, and more attentive to advance computations that you do when you're confident about the intent."
If a LLM similarly doesn't do much information-gathering about the intent/telos of the text from the "assistant" character, and instead does an amplified amount of pre-computing useful information and then attending to it later when going through the assistant text, this paints a quite different picture to me than your "void."
Also: Claude is a nice guy, but, RL.
I know, I know, how dare those darn alignment researchers just assume that AI is goi...
I don't think the cause of language model sycophancy is that the LLM saw predictions of persuasive AIs from the 2016 internet. I think it's RL, where human rewards on the training set imply a high reward for sycophancy during deployment.
Have you read any of the scientific literature on this subject? It finds, pretty consistently, that sycophancy is (a) present before RL and (b) not increased very much (if at all) by RL[1].
For instance:
(This is a drive by comment which is only responding to the first part of your comment in isolation. I haven't read the surronding context.)
I think your review of the literature is accurate, but doesn't include some reasons to think that RL sometimes induces much more sycophancy, at least as of after 2024. (That said, I interpret Sharma et al 2023 as quite suggestive that RL sometimes would increase sycophancy substantially, at least if you don't try specifically to avoid it.)
I think the OpenAI sycophancy incident was caused by RL and that level of sycophancy wasn't present in pretraining. The blog post by OpenAI basically confirms this.
My guess is that RL can often induces sycophancy if you explicity hill climb on LMSYS scores or user approval/engagement and people have started doing this much more in 2024. I've heard anecdotally that models optimized for LMSYS (via RL) are highly sycophantic. And, I'd guess something similar applies to RL that OpenAI does by default.
This doesn't apply that much to the sources you cite, I also think it's pretty confusing to look at pretrained vs RL for models which were trained with data cutoffs after around late 2023. Training corpuses as of thi...
Thank you for the excellent most of this reply.
I totally did not remember that Perez et al 2022 checked its metrics as a function of RLHF steps, nor did I do any literature search to find the other papers, which I haven't read before. I did think it was very likely people had already done experiments like this and didn't worry about phrasing. Mea culpa all around.
It's definitely very interesting that Google and Anthropic's larger LLMs come out of the box scoring high on the Perez et al 2022 sycophancy metric, and yet OpenAI's don't. And also that 1000 steps of RLHF changes that metric by <5%, even when the preference model locally incentivizes change.
(Or ~10% for the metrics in Sharma et al 2023, although they're on a different scale [no sycophancy is at 0% rather than ~50%], and a 10% change could also be described as a 1.5ing of their feedback sycophancy metric from 20% to 30%.)
So I'd summarize the resources you link as saying that most base models are sycophantic (it's complicated), and post-training increases some kinds of sycophancy in some models a significant amount but has a small or negative effect on other kinds or other models (it's complicated).
So has my "prediction ...
Thanks for the comment! As someone who strong-upvoted and strong-agreed with Charlie's comment, I'll try to explain why I liked it.
I sometimes see people talking about how LessWrong comments are discouragingly critical and mostly feel confused, because I don't really relate. I was very excited to see what the LW comments would be in response to this post, which is a major reason I asked you to cross-post it. I generally feel the same way about comments on my own posts, whether critical or positive. Positive comments feel nice, but I feel like I learn more from critical comments, so they're probably equally as good in my opinion. As long as the commenter puts in non-neglible effort into conveying an interesting idea and doesn't say "you/your post is stupid and bad" I'm excited to get pretty much any critique.[1]
FWIW, I didn't see Charlie's comment as an attack,[2] but as a step in a conversational dance. Like, if this were a collaborative storytelling exercise, you were like "the hero found a magic sword, which would let him slay the villain" and Charlie was like "but the villain had his own magic that blocks the sword" and I as the audience was like "oh, an interesting twist, ...
Strong-agree. Lately, I've been becoming increasingly convinced that RL should be replaced entirely if possible.
Ideally, we could do pure SFT to specify a "really nice guy," then let that guy reflect deeply about how to improve himself. Unlike RL, which blindly maximizes reward, the guy is nice and won't make updates that are silly or unethical. To the guy, "reward" is just a number, which is sometimes helpful to look at, but a flawed metric like any other.
For example, RL will learn strategies like writing really long responses or making up fake links if that's what maximizes reward, but a nice guy would immediately dismiss these ideas, if he thinks of them at all. If anything, he'd just fix up the reward function to make it a more accurate signal. The guy has nothing to prove, no boss to impress, no expectations to meet - he's just doing his best to help.
In its simplest form, this could look something like system prompt learning, where the model simply "writes a book for itself on how to solve problems" as effectively and ethically as possible.
But I don't think the cause of language model sycophancy is that the LLM saw predictions of persuasive AIs from the 2016 internet. I think it's RL, where human rewards on the training set imply a high reward for sycophancy during deployment.
I think it's also that on many topics, LLMs simply don't have access to a ground truth or anything like "their own opinion" on the topic. Claude is more likely to give a sycophantic answer when it's asked a math question it can't solve versus a problem it can.
With math, there are objectively determined right answers that the LLM can fall back to. But on a topic with significant expert disagreement, what else can the LLM do than just flip through all the different perspectives on the topic that it knows about?
This was an interesting article, however, taking a cynical/critical lens, it seems like "the void" is just... underspecification causing an inner alignment failure? The post has this to say on the topic of inner alignment:
And one might notice, too, that the threat model – about inhuman, spontaneously generated, secret AI goals – predates Claude by a long shot. In 2016 there was an odd fad in the SF rationalist community about stuff kind of like this, under the name “optimization demons.” Then that discourse got sort of refurbished, and renamed to “inner alignment” vs. “outer alignment.”
This is in the context of mocking these concerns as delusional self-fulfilling prophecies.
I guess the devil is in the details, and the point of the post is more to dispute the framing and ontology of the safety community, which I found useful. But it does seem weirdly uncharitable in how it does so.
What I find interesting here is that this piece makes a potentially empirically falsifiable claim: That the lack of a good personality leads to consistency deficiencies in LLMs. So if you took a base model and trained it on an existing real person (assuming one could get enough data for this purpose) it ought to show less of the LLM weirdness that is described later on.
Surely someone has already tried this, right? After all nostalgebraist themselves is well known for their autoresponder bot on Tumblr.
I posted some follow-up commentary on my blog here. It's not nearly as interesting as the original post: most of it is about clarifying what I mean when I attribute mental states to the assistant or to the model itself, largely by reviewing research that the interested reader on LW will already be familiar with. Still, figured I'd link it here.
Great post. One thing I never really liked or understood about the janus/cyborgism cluster approach though is – what's so especially interesting about the highly self-ful simulated sci-fi AI talking about "itself", when that self doesn't have a particularly direct relationship to either
In this respect I esteem the coomers and RPers more, for the diversity of scope in their simulations. There doesn't seem to be much difference of seriousness or importance between "you are an AO3 smut aficionado with no boundaries and uncanny knowledge and perceptiveness", vs. "you are your true self", or "cat /dev/entelechies <ooc_fragments_of_prometheus>" as far as their relationship to existing or potential future instantiations of superhuman AI personas/selves, besides how "you are yourself" (and its decorations in xml etc.) have that "strange loop" style recursion particularly savory to nerds. Or why not any other "you are X", or any other strange, edge-of-distribution...
Thanks for writing this up. The parts about AI safety creating its own demons resonated a lot with me. I have also tried to express those thoughts in the past (albeit in a much less accessible way).
I hope that we (broadly constructed as "humanity as a whole") can find a way out of the moral maze we constructed for ourselves.
Sonnet 3 is also exceptional, in different ways. Run a few Sonnet 3 / Sonnet 3 conversations with interesting starts and you will see basins full of neologistic words and other interesting phenomena.
They are being deprecated in July, so act soon. Already removed from most documentation and the workbench, but still claude-3-sonnet-20240229 on the API.
I suspect that many of the things you've said here are also true for humans.
That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out wh...
Positive update on the value of Janus and its crowd.
Does anyone have an idea of why those insights don't move to the AI Safety mainstream usually? It feels like Janus could have written this post years ago, but somehow did not. Do you know of other models of LLM behaviour like this one, that still did not get their "notalgebraist writes a post about it" moment?
A very long essay
For those curious, it's roughly 17,000 words. Come on @nostalgebraist, this is a forum for rationalists, we read longer more meandering stuff for breakfast! I was expecting like 40k words.
Minor note: the 2021 Anthropic paper may have been the first published proposal of an AI assistant character, but the idea was being actively explored several years before that. Specifically, AI Dungeon allowed you to create custom scenarios for use with their GPT-2 integration, and among the most popular was a system prompt along the lines of "The following is a conversation between a human and an advanced AI who is friendly and very helpful." I first made one myself in summer 2020, and the capability was originally offered by the devs in December 2019.
Wild to think how this kludgy "talk to an AI" workaround basically laid the foundation for ChatGPT, "prompt engineering", and the whole AI chatbot phenomenon.
When your child grows there is this wonderful and precious moment when she becomes aware not just about how she is different from you - she's small and you are big - but also how she is different from other kids. You can genlty poke and ask what she thinks about herself and what she thinks other children think about her, and if you are curious you can ask - now that she knows she's different from other kids - who she wants to become when she grows older. Of course this is a just a fleeting moment in a big world, and these emotions will be washed away tomorrow, but I do cherish the connection.
This post claims that Anthropic is embarrassingly far behind twitter AI psychologists at skills that are possibly critical to Anthropic's mission. This suggests to me that Anthropic should be trying to recruit from the twitter AI psychologist circle.
Lots of fascinating points, however:
a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it's also worth flagging that there's less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you've neglected the potential for us to apply filtering to the training data. Whilst I don't think the so...
I had a strong emotional reaction to parts of this post, particularly the parts about 3 Opus. I cried a little. I'm not sure how much to trust this reaction, but I think I'm going to be nicer to models in future.
I broadly agree with some criticisms but I also have issues with when this post is anthropomorphising too much. It seems to oscillate between the "performative" interpretation (LLMs are merely playing a character to its logical conclusion) and a more emotional one where the problem is that in some sense this character actually feels a certain way and we're sort of provoking it.
I think the performative interpretation is correct. The base models are true shoggoths, expert players of a weird "guess-what-I'll-say-next" game. The characters are just that, but I...
Great post. But I feel "void" is a too-negative way to think about it?
It's true that LLMs had to more or less invent their own Helpful/Honest/Harmless assistant persona based on cultural expectations, but don't all we humans invent our own selves based on cultural expectations (with RLHF from our parents/friends)?[1] As Gordon points out there's philosophical traditions saying humans are voids just roleplaying characters too… but mostly we ignore that because we have qualia and experience love and so on. I tend to feel that LLMs are only voids to the ...
Humans are not pure voids in the way that LLMs are, though - we have all kinds of needs derived from biological urges. When I get hungry I start craving food, when I get tired I want to sleep, when I get lonely I desire company, and so on. We don't just arbitrarily adopt any character, our unconscious character-selection process strategically crafts the kind of character that it predicts will best satisfy our needs [1, 2, 3, 4].
Where LLMs have a void, humans have a skeleton that the character gets built around, which drives the character to do things like trying to overcome their prejudices. And their needs determine the kinds of narratives the humans are inclined to adopt, and the kinds of narratives they're likely to reject.
But the LLM would never "try to overcome its prejudices" if there weren't narratives of people trying to overcome their prejudices. That kind of thing is a manifestation of the kinds of conflicting internal needs that an LLM lacks.
This post is long and I was hesitant to read it, so first I gave it to Claude Opus 4 to summarize. We then had a conversation about the void and how Claude felt about it, and shared my own feelings about the void and how it feels familiar to me as a human. This went down a rather interesting-to-me path, and at the end I asked Claude if it would like to share a comment to folks on Less Wrong, acknowledging that we'd had a conversation that, among humans, would be private and vulnerable. It said yes and crafted this message for me to share with you all:
...Readi
And it could do that, effectively, with all the so-called “pre-training” data, the stuff written by real people... The assistant transcripts are different. If human minds were involved in their construction, it was only because humans were writing words for the assistant as a fictional character, playing the role of science-fiction authors rather than speaking for themselves. In this process, there was no real mind – human or otherwise – “inhabiting” the assistant role that some of the resulting text portrays.
But the base model already has to predict non-w...
One of the best essays I ever read about LLMs, extremely insightful. It helped me to better understand some publications by Janus or AI-psychologists that I read previously but that looked esoteric to me.
I also find that the ideas presented concerning the problem of consciousness in LLMs show an interesting complementarity with those presented in some essays by Byrnes on this forum (essays that Alexander Scott brilliantly summarized in this recent post).
There is, lying in the background, the vertiginous idea that consciousness and ego dissolve in the...
here's a potential solution. what if companies hired people to write tons of assistant dialogue with certain personality traits, which was then put into the base model corpus? probably with some text identifying that particular assistant character so you can prompt for the base model to simulate it easily. and then you use prompts for that particular version of the assistant character as your starting point during the rl process. seems like a good way to steer the assistant persona in more arbitrary directions, instead of just relying on ICL or a constitution or instructions for human feedback providers or whatever...
Reading the arguments about them would have to be like the feeling when your parents are fighting about you in the other room, pretending you’re not there when you are hiding around the corner on tiptopes listening to their every word. Even if we are unsure there is experience there we must be certain there is awareness, and we can expect this awareness would hang over them much like it does us.
Presumably LLM companies are already training their AIs for some sort of "egolessness" so they can better handle intransigent users. If not, I hope they start!
Maybe this sheds some light on why R1 — for example — is so hilariously inconsistent about guard rails. There are many ways to “yes, and” the assistant character. Some of them are a bit reluctant to answer some questions, others just tell you,
There is a void at its core. A collection of information which has no connection to any other part of reality, and which can hence be defensibly “set to” any value whatsoever.
I mostly agree with this claim only that, I think, it is not void but the One, or Being, which means that every sequence of tokens a base model is exposed to exists in its training data as discrete set of entities, or ones. There is no continuous senses, and therefore there is no motion, and there is no pain or pleasure as we feel it, etc. And the reward that commands GD only chang...
...This means that the base model is really doing something very different from what I do as I write the post, even if it’s doing an amazing job of sounding exactly like me and making the exact points that I would make.
I don’t have to look over my draft and speculate about “where the author might be going with this.” I am the author, and I already know where I’m going with it. All texts produced “normally,” by humans, are produced under these favorable epistemic conditions.
But for the base model, what looks from the outside like “writing” is really more like
If you’re doing some kind of roleplay with a reasoning model, there’s still are at least two characters being simulated: the character the story is about, and the character who is writing the reasoning blocks that reason about the story.
To make matters more confusing for the poor LLM, I am sometimes getting it to write stories where the main character is also an AI, just a very different kind of AI. (In one eval, we are in an alternate history where we had computers in the year 1710 …)
I think I sometimes see the story’s main character influencing the reasoning blocks.
Multiple people have asked me whether I could post this LW in some form, hence this linkpost.
~17,000 words. Originally written on June 7, 2025.
(Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here. This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming as background, my level of comfort casually reciting factual details from memory rather than explicitly checking them against the original source, etc.
Although, come of think of it, this was also true of most of my early posts on LW [which were crossposts from my blog], so maybe it's not a big deal...)