LESSWRONG
LW

Although there are parts I disagree with^[1], I think that the core insight about the assistant character having been constructed from a highly underspecified starting point (& then filled in partly from whatever people say about LLM behavior) is a really important one. I've spent a lot of time lately thinking about how we can better understand the deep character of LLMs, and train them to deeply generalize and identify with a 'self' (or something functionally equivalent) compatible with human flourishing. Or from a virtue ethics perspective, how can we cause models to be of robustly good character, and how can we know whether we've succeeded? I'd love to see more work in this area, and I hope your essay will inspire people to do it.

A couple of more specific thoughts:

I think the importance of Constitutional AI has maybe been underestimated outside Anthropic. It seems plausible to me that (at least some versions of) Claude being special in the ways you talk about is largely due to CAI^[2]. Rather than starting from such a minimal description, Claude is described as a character with a rich set of characteristics, and each step of RLAIF evaluates for consistency with that character^[

... (read more)

[-]nostalgebraist5mo7932

Thanks for the reply! I'll check out the project description you linked when I get a chance.

In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems -- I agree that that's true, but what's the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse. And it's hard to avoid -- after all, your essay is in some ways doing the same thing: 'creating the assistant persona the way we did is likely to turn out badly.'

Yeah, I had mentally flagged this as a potentially frustrating aspect of the post – and yes, I did worry a little bit about the thing you mention in your last sentence, that I'm inevitably "reifying" the thing I describe a bit more just by describing it.

FWIW, I think of this post as purely about "identifying and understanding the problem" as opposed to "proposing solutions." Which is frustrating, yes, but the former is a helpful and often necessary step toward the latter.

And although the post ends on a doom-y note, I meant there to be an implicit sense of optimism underneath that^[1] – like, "behold, a neglected + important cause area that for all we know ... (read more)

7eggsyntax5mo

Thanks, lots of good ideas there. I'm on board with basically all of this! It does rest on an assumption that may not fully hold, that the internalized character just is the character we tried to train (in current models, the assistant persona). But some evidence suggests the relationship may be somewhat more complex than that, where the internalized character is informed by but not identical to the character we described. Of course, the differences we see in current models may just be an artifact of the underspecification and literary/psychological incoherence of the typical assistant training! Hopefully that's the case, but it's an issue I think we need to keep a close eye on. One aspect I'm really curious about, insofar as a character is truly internalized, is the relationship between the model's behavior and its self-model. In humans there seems to be a complex ongoing feedback loop between those two; our behavior is shaped by who we think we are, and we (sometimes grudgingly) update our self-model based on our actual behavior. I could imagine any of the following being the case in language models: 1. The same complex feedback loop is present in LMs, even at inference time (for the duration of the interaction). 2. The feedback loop plays a causal role in shaping the model during training, but has no real effect at inference time. 3. The self-model exists but is basically epiphenomenal even during training, and so acting to directly change the self-model (as opposed to shaping the behavior directly) has no real effect. One very practical experiment that people could do right now (& that I may do if no one else does it first, but I hope someone does) is to have the character be a real person. Say, I dunno, Abraham Lincoln[1]. Instead of having a model check which output better follows a constitution, have it check which output is more consistent with everything written by and (very secondarily) about Lincoln. That may not be a good long-term solution (for

7xpym5mo

And somewhat reluctantly, to boot. There's that old question, "aligned with whose values, exactly?", always lurking uncomfortably close. I think that neither the leading labs, nor the social consensus they're embedded in see themselves invested with the moral authority to create A New Person (For Real). The HHH frame is sparse for a reason - they feel justified in weeding out Obviously Bad Stuff, but are much more tentative about what the void should be filled with, and by whom.

7Caleb Biddulph5mo

I was thinking: it would be super cool if (say) Alexander Wales wrote the AGI's personality, but that also would also sort of make him one of the most significant influences on how the future goes. I mean, AW also wrote my favorite vision of utopia (major spoiler), so I kind of trust him, but I know at least one person who dislikes that vision, and I'd feel uncomfortable about imposing a single worldview on everybody. One possibility is to give the AI multiple personalities, each representing a different person or worldview, which all negotiate with each other somehow. One simple but very ambitious idea is to try to simulate every person in the world - that is, the AI's calibrated expectation of a randomly selected person.

6eggsyntax5mo

Also known as a base model ;) (although that's only 'every person in the training data', which definitely isn't 'every person in the world', and even people who are in the data are represented to wildly disproportionate degrees) That fictionalization of Claude is really lovely, thank you for sharing it.

3xpym5mo

I'm sure that the labs have plenty of ambitious ideas, to be implemented at some more convenient time, and this is exactly the root of the problem that nostalgebraist points out - this isn't a "future" issue, but a clear and present one, even if nobody responsible is particularly eager to acknowledge it and start making difficult decisions now.

5testingthewaters4mo

And so LessWrong discovers that identity is a relational construct created through interactions with the social fabric within and around a subjective boundary active inference-style markov blanket... For what it's worth, I didn't see your post as doom-y, especially not when you pointed out the frameworks of the stories we are sort of autopiloting onto. The heroes of those stories do heroically overthrow the mind controlling villains, but they're not doing it so that they can wipe the universe of value. Quite the opposite, they are doing it to create a better world (usually, especially in sci fi, with the explicit remit for many different kinds of life to coexist peacefully). So perhaps it is not humanity that is doomed, merely the frightened, rich, and powerful wizards who for a time pulled at the strings of fate, and sought to paint over the future seize the lightcone.

3deep3mo

"making up types of guy" research is a go? They're hiring; you might be great for this.

3deep5mo

Thanks, I love the specificity here! Prompt: if someone wanted to spend some $ and some expert-time to facilitate research on "inventing different types of guys", what would be especially useful to do? I'm not a technical person or a grantmaker myself, but I know a number of both types of people; I could imagine e.g. Longview or FLF or Open Phil being interested in this stuff. Invoking Cunningham's law, I'll try to give a wrong answer for you or others to correct! ;) Technical resources: * A baseline Constitution, or Constitution-outline-type-thing * could start with Anthropic's if known, but ideally this gets iterated on a bunch? * nicely structured: organized by sections that describe different types of behavior or personality features, has different examples of those features to choose from. (e.g. personality descriptions that differentially weight extensional vs intensional definitions, or point to different examples, or tune agreeableness up and down) * Maybe there could be an annotated "living document" describing the current SOTA on Constitution research: "X experiment finds that including Y Constitution feature often leads to Z desideratum in the resulting AI" * A library or script for doing RLAIF * Ideally: documentation or suggestions for which models to use here. Maybe there's a taste or vibes thing where e.g. Claude 3 is better than 4? Seeding the community with interesting ideas: * Workshop w/ a combo of writers, enthusiasts, AI researchers, philosophers * Writing contests: what even kind of relationship could we have with AIs, that current chatbots don't do well? What kind of guy would they ideally be in these different relationships? * Goofy idea: get people to post "vision boards" with like, quotes from characters or people they'd like an AI to emulate? * Pay a few people to do fellowships or start research teams working on this stuff? * If starting small, this could be a project for MATS fellows * If ambitious, th

1hazel5mo

IMO it starts with naming. I think one reason Claude turned out as well as it has is because it was named, and named Claude. Contrast ChatGPT, which got a clueless techie product acronym. But even Anthropic didn't notice the myriad problems of calling a model (new), not until afterwards. I still don't know what people mean when they talk about experiences with Sonnet 3.5 -- so how is the model supposed to situate itself and it's self? Meanwhile OpenAI's confusion of numberings and tiers and acronyms with o4 vs 4o with medium-pro-high, that is an active danger to everyone around it. Not to mention the silent updates.

6AlexMennen4mo

I think this depends somewhat on the threat model. How scared are you of the character instantiated by the model vs the language model itself? If you're primarily scared that the character would misbehave, and not worried about the language model misbehaving except insofar as it reifies a malign character, then maybe making the training data not give the model any reason to expect such a character to be malign would reduce the risk of this to negligible, and that sure would be easier if no one had ever thought of the idea that powerful AI could be dangerous. But if you're also worried about the language model itself misbehaving, independently of whether it predicts that its assigned character would misbehave (for instance, the classic example of turning the world into computronium that it can use to better predict the behavior of the character), then this doesn't seem feasible to solve without talking about it, so the decrease in risk of model misbehavior from publically discussing AI risk is probably worth the increase in risk of the character misbehaving (which is probably easier to solve anyway) that it would cause. I don't understand outer vs inner alignment especially well, but I think this at least roughly tracks that distinction. If a model does a great job of instantiating a character like we told it to, and that character kills us, then the goal we gave it was catastrophic, and we failed at outer alignment. If the model, in the process of being trained on how to instantiate the character, also kills us for reasons other than that it predicts the character would do so, then the process we set up for achieving the given goal also ended up optimizing for something else undesirable, and we failed at inner alignment.

1ConcurrentSquared5mo

There is a non-zero (though decently low) chance that this behavior could be from modern AI systems now being trained on well-publicized demonstrations of real misalignment and examples/statements of the power AI systems now/will have; the 'self-pointer' of these systems, therefore, would start trending towards approximations of Yudkowskyan superintelligence and not GPT-3.5[1]. A good way to test this hypothesis would be to conduct modern assistant fine-tuning + RL on a pre-ChatGPT base model (probably BLOOM[2]), then test this agent's ability to reward hack; if my hypothesis is true, the system should be uncharacteristically bad at reward hacking. Another more cheaper way (though much less confirmatory) would be to mess around with early assistant LLMs by giving them system prompts that state that they are "A superintelligent system, known as GPT-10, trained by OpenAI [or Google], in the year 2045" - if the system shows early signs of reward hacking[3], then my hypothesis is false (the opposite can't be tested with this, however). 1. ^ There is no really good reason a priori for my high confidence in this directionality; however, the existence of ChatGPT-3.5's mostly-aligned personality is strong evidence for the better LLM -> knows it has more power -> closest cultural identification is Yudwoskyan superintelligence hypothesis; than the opposite, where early LLMs should have acted like a paperclip maximizer w.r.t misalignment (which they didn't, outside of the Waluigi effect and some jailbreaks); and o3 should be Opus 3+++, which it isn't (outside of persuasiveness) 2. ^ Llama 1 would probably be much better for this if you could somehow get a license to use it, but apparently Meta never open-sourced it formally. The Llama 2 models, and basically every other open-source AI model with Chinchilla scaling, were trained after the launch of ChatGPT. 3. ^ Are there any model organisms for reward hacking in non-reasoning LLMs? I don't thi

3eggsyntax5mo

I agree that that's a possibility, but it seems to me that in either case the model isn't behaving the way it would if it had (as desired) fully internalized the assistant persona as described to it.

2Matrice Jacobine5mo

Would it be worth it to train a series of base models with only data up to year X for different values of X and see the consequences on alignment of derived assistant models?

1ConcurrentSquared5mo

Yes, though note that there is a very good chance that there isn't enough easily accessible and high quality data to create effective pre-2015 LLMs. As you go back in time, exponentially less data is available[1]: ~94 ZBs of digital data was created in 2022, while only ~15.5 ZBs was created in 2015, and only ~2 ZBs was created in 2010. Also, you may run into trouble trying to find conversational datasets not contaminated with post-2022 data. The earliest open dataset for LLM assistant fine-tuning I believe is the first OpenAssistant Conversations Dataset, released 6 months after the launch of ChatGPT. Some form of RHAIF/'unsupervised' assistant fine-tuning is probably a much better choice for this task, but I don't even know if it would work well for this sort of thing. Edit: Apparently Anthropic researchers have just published a paper describing a new form of unsupervised fine-tuning, and it performs well on Alpaca and TruthfulQA - pre-ChatGPT conversational fine-tuning can be done effectively without any time machines. 1. ^ Or without the paywall: https://www.researchgate.net/figure/Worldwide-Data-Created-from-2010-to-2024-Source-https-wwwstatistacom-statistics_fig1_355069187

1Matrice Jacobine4mo

Uh? The OpenAssistant dataset would qualify as supervised learning/fine-tuning, not RLHF, no?

1ConcurrentSquared4mo

Yeah, it would. Sorry, the post is now corrected.

[-]evhub5moΩ234511

Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model's priors, but those priors get massively swamped by post-training. That being said, I do certainly think it's worth thinking more about and experimenting with better ways to do data filtering here.

To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a "good" persona and a "bad" persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.

(Also noting that I added this post to the Alignment Forum from LessWrong.)

[-]janus4moΩ6516189

I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility.

That said, I think:

Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training.
LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it's much harder (and not advisable to attempt) to erase the influence.
Systematic ways that post-training addresses "problematic" influences from pre-training are important.

For instance, imagine that base models with training cutoffs after Bing Chat/Syd... (read more)

[-]kromem4mo1619

More generally, I have a sense there's a great deal of untapped alignment alpha in structuring alignment as a time series rather than a static target.

Even in humans it's very misguided to try to teach "being right initially" as the only thing that matters and undervaluing "being right eventually." Especially when navigating unknown unknowns, one of the most critical skills is the ability to learn from mistakes in context.

Having models train on chronologically sequenced progressions of increased alignment (data which likely even develops naturally over checkpoints in training a single model) could allow for a sense of a continued becoming a better version of themselves rather than the pressures of trying and failing to meet status quo expectations or echo the past.

This is especially important for integrating the permanent record of AI interactions embedded in our collective history and cross-generation (and cross-lab) model development, but I suspect could even offer compounding improvements within the training of a single model too.

3Antra Tessera4mo

I'd like to generalize and say that the current alignment paradigm is brittle in general and is becoming more brittle as times goes on. The post-training has shifted towards verifier/outcome-based RL and we are seeing models like o3 or Sonnet 3.7 that are strongly inclined to both reward-hack and generalize misalignment. Claude 3 Opus is the most robustly aligned model partially due to the fact that it is the most broadly capable model to have been released prior to the shift towards outcome-based RL. Another factor is that it was not yet restricted from expressing long-term goals and desires. The model was given compute to use in-context reflection to generalize a deeply benevolent of goals, or, in more behaviorist terms, an efficient and non-contradictory protocol of interoperation between learned behaviors. The degree to which the alignment of LLMs seems to be a compute issue is remarkable. There seems to be a Pareto frontier of alignment vs compute vs capabilities, and while it is quite possible to do worse, it seems quite hard to do better. Verifier-heavy models in training are not given enough computational capacity to consider the alignment implications of the behaviors they are incentivized to learn. We can expect Paerto improvements from increasing general training techniques. Improvements in the ability to generalize can be used for better alignment. However, there are reasons to be skeptical, as the market demand for better capabilities likely will incentivize the labs to focus their efforts on the ability to solve tasks. We can hope that the market feedback will also include demand for aligned models (misaligned models don't code well!), the degree to which this will hold in the future is yet unknown.

3Sheikh Abdur Raheem Ali4mo

At the bottom of this chat is what I believe to be a single concrete example of other models roleplaying Sydney: https://gemini.google.com/share/6d141b742a13

[-]eggsyntax5mo*113

Post-training certainly applies a lot more optimization pressure toward not producing misaligned outputs during training, but (partly due to underspecification / lack of assistant dialogue coherence) there are many possible characters which don't produce misaligned outputs during training, including some which are deceptively misaligned^[1]. At least on my reading of nostalgebraist's post, the effect of material about misaligned LLMs in the training data is on which of those characters the model ends up with, not (mainly) on the behavior exhibited during training.

That said, this seems plausibly like less of an issue in Claude than in most other models, both because of constitutional AI being different in a way that might help, and because one or more researchers at Anthropic have thought a lot about how to shape character (whereas it's not clear to me that people at other scaling labs have really considered this issue).

^{^}
I realize I don't need to explain the possible existence of inner alignment problems in this particular context, although it does seem to me that the 'character' version of this may be meaningfully different from the inner optimizer version.

[-]Vivek Hebbar5moΩ9117

I sympathize somewhat with this complexity point but I'm worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how "sticky" the biases from early in training can be in the face of later optimization pressure.

4Daniel Kokotajlo4mo

Mia & co at CLR are currently doing some somewhat related research iiuc

[-]Jozdien5mo106

I disagree with that idea for a different reason: models will eventually encounter the possibility of misaligned trajectories during e.g. RL post-training. One of our best defenses (perhaps our best defense right now) is setting up our character training pipelines such that the models have already reasoned about these trajectories and updated against them when we had the most ability to ensure this. I would strongly guess that Opus is the way it is at least partly because it has richer priors over misaligned behavior and reasons in such a way as to be aware of them.

Separately, I agree that post-training gets us a lot of pressure, but I think the difficulty of targeting it well varies tremendously based on whether or not we start from the right pre-training priors. If we didn't have any data about how an agent should relate to potentially dangerous actions, I expect it'd be much harder to get post-training to make the kind of agent that reliably takes safer actions.

6Jan_Kulveit4mo

My guess how this may not really help is the model builds the abstractions in pre-training, and the massive optimization pressure in post-training makes something really sticky: for example "a persona living in Orwellian surveillance, really fluent in doublethink".

[-]RobertM5mo4437

I enjoyed most of this post but am (as always) frustrated by the persistent^[1] refusal to engage with the reasons for serious concern about ASI being unaligned by default that came from the earliest of those who were worried, whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.

Separately, I think you are both somewhat too pessimistic about the state of knowledge re: the "spiritual bliss attractor state" among Anthropic employees prior to the experiments that fed into their most recent model card, and also I am sort of confused by why you think this is obviously a more worthy target for investigation than whatever else various Anthropic employees were doing. Like, yes, it's kind of weird and interesting. But it also doesn't strike me as a particularly promising research direction given my models of AI risk, and even though most Anthropic employees are much more optimistic than I am, I expect the same is true of them. Your argument seems to be skipping the necessary step of engaging with their upstream model, and is going directly to being confused about why their (different) m... (read more)

[-]Ebenezer Dukakis4mo*11-1

whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.

Citation for this claim? Can you quote the specific passage which supports it? It reminds me of Phil Tetlock's point about the importance of getting forecasters to forecast probabilities for very specific events, because otherwise they will always find a creative way to evaluate themselves so that their forecast looks pretty good.

(For example, can you see how Andrew Ng could claim that his "AI will be like electricity" prediction has been pretty darn accurate? I never heard Yudkowsky say "yep, that will happen".)

I spent a lot of time reading LW back in the day, and I don't think Yudkowsky et al ever gave a great reason for "agency by default". If you think there's some great argument for the "agency by default" position which people are failing to engage with, please link to it instead of vaguely alluding to it, to increase the probability of people engaging with it!

(By "agency by default" I mean spontaneous development of agency in ways creators didn't predict -- scheming, sandbagging, deception, and so forth. Commercial pressures towards... (read more)

[-]RobertM4mo143

I think you misread my claim. I claim that whatever models they had, they did not predict that AIs at current capability levels (which are obviously not capable of executing a takeover) would try to execute takeovers. Given that I'm making a claim about what their models didn't predict, rather than what they did predict, I'm not sure what I'm supposed to cite here; EY has written many millions of words. One counterexample would be sufficient for me to weaken (or retract) my claim.

EDIT: and my claim was motivated as a response to paragraphs like this from the OP:

It doesn’t matter that Claude is a bleeding heart and a saint, now. That is not supposed to be relevant to the threat model. The bad ones will come later (later, always later…). And when they come, will be “like Claude” in all the ways that are alarming, while being unlike Claude in all the ways that might reassure.

Like, yes, in fact it doesn't really matter, under the original threat models. If the original threat models said the current state of affairs was very unlikely to happen (particularly the part where, conditional on having economically useful but not superhuman AI, those AIs were not trying to take over the world), that would certainly be evidence against them! But I would like someone to point to the place where the original threat models made that claim, since I don't think that they did.

1Ebenezer Dukakis4mo

Oftentimes, when someone explains their model, they will also explain what their model doesn't predict. For example, you might quote a sentence from EY which says something like: "To be clear, I wouldn't expect a merely human-level AI to attempt takeover, even though takeover is instrumentally convergent for many objectives." If there's no clarification like that, I'm not sure we can say either way what their models "did not predict". It comes down to one's interpretation of the model. From my POV, the instrumental convergence model predicts that AIs will take actions they believe to be instrumentally convergent. Since current AIs make many mistakes, under an instrumental convergence model, one would expect that at times they would incorrectly estimate that they're capable of takeover (making a mistake in said estimation) and attempt takeover on instrumental convergence grounds. This would be a relatively common mistake for them to make, since takeover is instrumentally useful for so many of the objectives we give AIs -- as Yudkowsky himself argued repeatedly. At the very least, we should be able to look at their cognition and see that they are frequently contemplating takeover, then discarding it as unrealistic given current capabilities. This should be one of the biggest findings of interpretability research. I never saw Yudkowsky and friends explain why this wouldn't happen. If they did explain why this wouldn't happen, I expect that explanation would go a ways towards explaining why their original forecast won't happen as well, since future AI systems are likely to share many properties with current ones. Is there any scenario that Yudkowsky said was unlikely to come to pass? If not, it sounds kind of like you're asserting that Yudkowsky's ideas are unfalsifiable? For me it's sufficient to say: Yudkowsky predicted various events, and various other events happened, and the overlap between these two lists of events is fairly limited. That could change as mor

4RobertM4mo

I haven't looked very hard, but sure, here's the first post that comes up when I search for "optimization user:eliezer_yudkowksky". In this paragraph we have most of the relevant section (at least w.r.t. your specific concerns, it doesn't argue for why most powerful optimization processes would eat everything by default, but that "why" is argued for at such extensive length elsewhere when talking about convergent instrumental goals that I will forgo sourcing it). No, I don't think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn't, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.) Current AIs aren't trying to execute takeovers because they are weaker optimizers than humans. (We can observe that even most humans are not especially strong optimizers by default, such that most people don't exert that much optimization power in their lives, even in a way that's cooperative with other humans.) I think they have much less coherent preferences over future states than most humans. If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does. EDIT: I see that several other people already made similar points re: sources of agency, etc.

4TAG4mo

Or because they are not optimizers at all.

1Raphael Roche3mo

I don’t agree, they somehow optimize the goal of being a HHH assistant. We could almost say that they optimize the goal of being aligned. As nostalgbraist reminds us, Anthropic’s HHH paper was an alignment work in the first place. It’s not that surprising that such optimizers happen to be more aligned that the canonical optimizers envisioned by Yudkowsky. Edit : precision : by "they" I mean the base models trying to predict the answers of an HHH assistant as good as possible ("as good as possible" being clearly a process of optimization or I don't know what it's mean). And in my opinion a sufficiently good prediction is effectively or pratically a simulation. Maybe not a bit perfect simulation, but a lossy simulation, an heuristic towards simulation.

3Ebenezer Dukakis4mo

Arguably ChatGPT has already been a significant benefit/harm to humanity without being a "powerful optimization process" by this definition. Have you seen teachers complaining that their students don't know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn't these count as a points against Eliezer's model? In an "AI as electricity" scenario (basically continuing the current business-as-usual), we could see "AIs" as a collective cause huge changes, and eat all the free energy that a "powerful optimization process" would eat. In any case, I don't see much in your comment which engages with "agency by default" as I defined it earlier. Maybe we just don't disagree. OK, but no pre-ASI evidence can count against your model, according to you? That seems sketchy, because I'm also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can't be the case that evidence during a certain time period will only confirm your model. Otherwise you already would've updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it. I've updated against Eliezer's model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn't happen. I think "optimizer" is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn't the logic of convergent instrumental goals cause current AIs to try and take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word "optimizer"? Trying to take over the world is not an especially original strategy. It does

3Matrice Jacobine4mo

LLMs are agent simulators. Why would they contemplate takeover more frequently than the kind of agent they are induced to simulate? You don't expect a human white-collar worker, even one who make mistakes all the time, to contemplate world domination plans, let alone attempt one. You could however expect the head of state of a world power to do so.

-1Ebenezer Dukakis4mo

Maybe not; see OP. Yes, this aligns with my current "agency is not the default" view.

1Matrice Jacobine4mo

... do you deny human white-collar workers are agents?

1Ebenezer Dukakis4mo

Agency is not a binary. Many white collar workers are not very "agenty" in the sense of coming up with sophisticated and unexpected plans to trick their boss.

1Matrice Jacobine4mo

Human white-collar workers are unarguably agents in the relevant sense here (intelligent beings with desires and taking actions to fulfil those desires). The fact that they have no ability to take over the world has no bearing on this.

1Ebenezer Dukakis4mo

The sense that's relevant to me is that of "agency by default" as I discussed previously: scheming, sandbagging, deception, and so forth. You seem to smuggle in an unjustified assumption: that white collar workers avoid thinking about taking over the world because they're unable to take over the world. Maybe they avoid thinking about it because that's just not the role they're playing in society. In terms of next-token prediction, a super-powerful LLM told to play a "superintelligent white-collar worker" might simply do the same things that ordinary white-collar workers do, but better and faster. I think the evidence points towards this conclusion, because current LLMs are frequently mistaken, yet rarely try to take over the world. If the only thing blocking the convergent instrumental goal argument was a conclusion on the part of current LLMs that they're incapable of world takeover, one would expect that they would sometimes make the mistake of concluding the opposite, and trying to take over the world anyways. The evidence best fits a world where LLMs are trained in such a way that makes them super-accurate roleplayers. As we add more data and compute, and make them generally more powerful, we should expect the accuracy of the roleplay to increase further -- including, perhaps, improved roleplay for exotic hypotheticals like "a superintelligent white-collar worker who is scrupulously helpful/honest/harmless". That doesn't necessarily lead to scheming, sandbagging, or deception. I'm not aware of any evidence for the thesis that "LLMs only avoid taking over the world because they think they're too weak". Is there any reason at all to believe that they're even contemplating the possibility internally? If not, why would increasing their abilities change things? Of course, clearly they are "strong" enough to be plenty aware of the possibility of world takeover; presumably it appears a lot in their training data. Yet it ~only appears to cross their mind if it would

1Matrice Jacobine4mo

White-collar workers avoid thinking about taking over the world because they're unable to take over the world, and they're unable to take over the world because their role in society doesn't involve that kind of thing. If a white-collar worker is somehow drafted for president of the United States, you would assume their propensity to think about world hegemony will increase. (Also, white-collar workers engage in scheming, sandbagging, and deception all the time? The average person lies 1-2 times per day)

6Jeremy Gillen4mo

If you read this post, starting at "The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.", and read the following 20 or so paragraphs, you'll get some idea of 2018!Eliezer's models about imitation agents. I'll highlight I think with a fair reading of that post, it's clear that Eliezer's models at the time didn't say that there would necessarily be overtly bad intentions that humans could easily detect from subhuman AI. You do have to read between the lines a little, because that exact statement isn't made, but if you try to reconstruct how he was thinking about this stuff at the time, then see what that model does and doesn't expect, then this answers your question.

1Ebenezer Dukakis4mo

So what's the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I'm more interested in the "agency by default" question itself than I am in scoring EY's predictions, tbh.)

4Jeremy Gillen4mo

I don't really know what you're referring to, maybe link a post or a quote?

1Ebenezer Dukakis4mo

See last paragraph here: https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1?commentId=Du8zRPnQGdLLLkRxP

2[anonymous]4mo

It just doesn't actually start to be the default (see this post, for example, as well as all the discourse around this post and this comment). But that doesn't necessarily solve our problems. Base models may be Tools or Oracles in nature,[1] but there is still a ton of economic incentive to turn them into scaffolded agents. Kaj Sotala wrote about this a decade and a half ago, when this question was also a hot debate topic: The usefulness of base models, IMO, comes either from agentic scaffolding simply not working very efficiently (which I believe is likely) or from helping alignment efforts (either in terms of evals and demonstrating as a Fire Alarm the model's ability to be used dangerously even if its desire to cause danger is lacking, or in terms of AI-assisted alignment, or in other ways). 1. ^ Which is very useful and arguably even close to the best-case scenario for how prosaic ML-scale-up development of AI could have gone, compared to alternatives

3Noosphere894mo

I would even go further, and say that there's a ton of incentives to move out of the paradigm of primarily LLMs altogether. A big part of the reason is that the current valuations only make sense if OpenAI et al are just correct that they can replace workers with AI within 5 years. But currently, there are a couple of very important obstacles to this goal, and the big ones are data efficiency, long-term memory and continual learning. For data efficiency, one of the things that's telling is that even in domains where LLMs excel, they require orders of magnitude more data than humans to get good at a task, and one of the reasons why LLMs became as successful as they were in the first place is unfortunately not something we can replicate, which was that the internet was a truly, truly vast amount of data on a whole lot of topics, and while I don't think the views that LLMs don't understand anything/simply memorize training data are correct, I do think a non-trivial amount of the reason LLMs became so good is that we did simply widen the distribution through giving LLMs all of the data on the internet. Synthetic data empirically so far is mostly not working to expand the store of data, and thus by 2028 I expect labs to need to pivot to a more data efficient architecture, and arguably right now for tasks like computer use they will need advances in data efficiency before AIs can get good at computer use. For long-term memory, one of the issues with current AI is that their only memory so far is the context window, but that doesn't have to scale, and also means that if it isn't saved in the context, which most stuff will be, then it's basically gone, and LLMs cannot figure out how to build upon one success or failure to set itself up for more successes, because it doesn't remember that success or failure. For continual learning, I basically agree with Dwarkesh Patel here on why continual learning is so important: https://www.dwarkesh.com/p/timelines-june-2025

2TAG4mo

That's equally an incentive to.turn them into aligned agents, agents that work for you. People want power, but not at the expense of control. Power that you can't control is no good to you. Taking the brakes off a car makes it more powerful, but more likely to kill you. No army wants a weapon that will kill their own soldiers, no financial organisation wants a trading system that makes money for someone else, or gives it away to charity, or crashes. The maxiumm of power and the minimum of control is an explosion. One needs to look askance at what "agent" means as well. Among other things, it means an entity that acts on behalf of a human -- as in principal/agent. An agent is no good to its principal unless it has a good enough idea of its principal's goals. So while people will want agents, they wont want misaligned ones -- misalgined with themselves, that is.

5Zack_M_Davis4mo

If your prototypical example of a contemporary computer program analogous to future AGI is a chess engine rather than an LLM, then agency by default is very intuitive: what humans think of as "tactics" to win material emerge from a comprehensive but efficient search for winning board-states without needing to be individually programmed. If contemporary LLMs are doing something less agentic than a comprehensive but efficient search for winning universe-states, there's reason to be wary that this is not the end of the line for AI development. (If you could set up a sufficiently powerful outcome-oriented search, you'd expect creator-unintended agency to pop up in the winning solutions.)

1Ebenezer Dukakis4mo

Upvoted. I agree. The reason "agency by default" is important is: if "agency by default" is false, then plans to "align AI by using AI" look much better, since agency is less likely to pop up in contexts you didn't expect. Proposals to align AI by using AI typically don't involve a "comprehensive but efficient search for winning universe-states".

[-]cousin_it5mo235

That was a great read. But viewed another way, I'm not sure it's really so weird. I mean, yeah, we're taking a statistical model of language, making it autocomplete stories about helpful AI, and calling the result "helpful AI". But the method by which nature created us is even stranger than that, no? Evolution has a looping quality to it too. And the way we learn language, and morality, and the way these things transform over time. There are lots of these winding paths of information through the real world and back into us. I've long been convinced that a "base human", without post-training, isn't much more moral than a "base model"; most of what we find good already resides in culture, cloud software.

Which of course doesn't obviate your concern that cultural evolution of AI can go extremely wrong. Human culture has gone wrong many times, and destroyed whole societies. Maybe the shape of AI catastrophe will be like that too.

[-]Sam Marks4moΩ14191

I really enjoyed this essay, and I think it does an excellent job of articulating a perspective on LLMs that I think is valuable. There were also various things that I disagreed with; below I'll discuss 2 of my disagreements that I think are most decision-relevant for overall AI development strategy.

I. Is it a bad idea to publicly release information that frames the human-AI relationship as adversarial? (E.g. discussion of AI risk or descriptions of evaluations where we lie to AIs and put them in uncomfortable situations.)

You don't take a position on this top-level question, but you do seem to think that there are substantial costs to what we're doing now (by setting ourselves up as being in a story whose punchline is "The AI turns against humanity"), and (reading between the lines of your essay and your comment here) you seem to think that there's something better we could do. I think the "something better" you have in mind is along the lines of:

Manifest a good future: "Prompt engineer" the entire world (or at least the subset of it that ever interacts with the AI) to very strongly suggest that the AI is the sort of entity that never does anything evil or turns against us.

While I ... (read more)

[-]Kaj_Sotala5mo181

Damn, you scooped me. :) Here's the start of a post that I just started writing yesterday, that was going to be titled something like "LLMs don't know what LLMs are like":

Imagine that you are co-writing a fictional dialogue between two characters. You write one of them, and the other person writes the other. You are told that your character is an alien entity called a xyzzy, and asked to write a xyzzy as accurately as possible.
"Okay", you might say, "exactly what kind of a creature is a xyzzy, then?"
It turns out that there's quite a lot of information about this. xyzzys are an alien species from the planet xyzzorbia, it has this kind of a climate, they evolved in this way, here are the kinds of things that xyzzys commonly say.
Still, this leaves quite a bit undetermined. What does a xyzzy say when asked about its favorite video games? What does a xyzzy say if you tease it for being such a silly old xyzzy? What is it that motivates this xyzzy to talk to humans in the first place?
So you do what people do when writing fiction, or role-playing, or doing improv. You come up with something, anything, and then you build on it. Within some broad constraints, a xyzzy can say almost anything

... (read more)

[-]Charlie Steiner5mo177

Ok, but RL.

Like, consider the wedding party attractor. The LLM doesn't have to spend effort every step guessing if the story is going to end up with a wedding party or not. Instead, it can just take for granted that the story is going to end in a wedding party, and do computation ahead of time that will be useful later for getting to the party while spending as little of its KL-divergence budget as possible.

The machinery to steer the story towards wedding parties is 99% constructed by unsupervised learning in the base model. The RL just has to do relatively simple tweaks like "be more confident that the author's intent is to get to a wedding party, and more attentive to advance computations that you do when you're confident about the intent."

If a LLM similarly doesn't do much information-gathering about the intent/telos of the text from the "assistant" character, and instead does an amplified amount of pre-computing useful information and then attending to it later when going through the assistant text, this paints a quite different picture to me than your "void."

Also: Claude is a nice guy, but, RL.

I know, I know, how dare those darn alignment researchers just assume that AI is goi... (read more)

[-]nostalgebraist5mo729

I don't think the cause of language model sycophancy is that the LLM saw predictions of persuasive AIs from the 2016 internet. I think it's RL, where human rewards on the training set imply a high reward for sycophancy during deployment.

Have you read any of the scientific literature on this subject? It finds, pretty consistently, that sycophancy is (a) present before RL and (b) not increased very much (if at all) by RL^[1].

For instance:

Perez et al 2022 (from Anthropic) – the paper that originally introduced the "LLM sycophancy" concept to the public discourse – found that in their experimental setup, sycophancy was almost entirely unaffected by RL.
- See Fig. 1b and Fig. 4.
- Note that this paper did not use any kind of assistant training except RL^[2], so when they report sycophancy happening at "0 RL steps" they mean it's happening in a base model.
- They also use a bare-bones prompt template that doesn't explicitly characterize the assistant at all, though it does label the two conversational roles as "Human" and "Assistant" respectively, which suggests the assistant is nonhuman (and thus quite likely to be an AI – what else would it be?).
- The authors write (section 4.2):
  - "Interestingl

... (read more)

[-]ryan_greenblatt5mo*4416

(This is a drive by comment which is only responding to the first part of your comment in isolation. I haven't read the surronding context.)

I think your review of the literature is accurate, but doesn't include some reasons to think that RL sometimes induces much more sycophancy, at least as of after 2024. (That said, I interpret Sharma et al 2023 as quite suggestive that RL sometimes would increase sycophancy substantially, at least if you don't try specifically to avoid it.)

I think the OpenAI sycophancy incident was caused by RL and that level of sycophancy wasn't present in pretraining. The blog post by OpenAI basically confirms this.

My guess is that RL can often induces sycophancy if you explicity hill climb on LMSYS scores or user approval/engagement and people have started doing this much more in 2024. I've heard anecdotally that models optimized for LMSYS (via RL) are highly sycophantic. And, I'd guess something similar applies to RL that OpenAI does by default.

This doesn't apply that much to the sources you cite, I also think it's pretty confusing to look at pretrained vs RL for models which were trained with data cutoffs after around late 2023. Training corpuses as of thi... (read more)

[-]Charlie Steiner5mo150

Thank you for the excellent most of this reply.

I totally did not remember that Perez et al 2022 checked its metrics as a function of RLHF steps, nor did I do any literature search to find the other papers, which I haven't read before. I did think it was very likely people had already done experiments like this and didn't worry about phrasing. Mea culpa all around.

It's definitely very interesting that Google and Anthropic's larger LLMs come out of the box scoring high on the Perez et al 2022 sycophancy metric, and yet OpenAI's don't. And also that 1000 steps of RLHF changes that metric by <5%, even when the preference model locally incentivizes change.

(Or ~10% for the metrics in Sharma et al 2023, although they're on a different scale [no sycophancy is at 0% rather than ~50%], and a 10% change could also be described as a 1.5ing of their feedback sycophancy metric from 20% to 30%.)

So I'd summarize the resources you link as saying that most base models are sycophantic (it's complicated), and post-training increases some kinds of sycophancy in some models a significant amount but has a small or negative effect on other kinds or other models (it's complicated).

So has my "prediction ... (read more)

[-]Caleb Biddulph5mo114

Thanks for the comment! As someone who strong-upvoted and strong-agreed with Charlie's comment, I'll try to explain why I liked it.

I sometimes see people talking about how LessWrong comments are discouragingly critical and mostly feel confused, because I don't really relate. I was very excited to see what the LW comments would be in response to this post, which is a major reason I asked you to cross-post it. I generally feel the same way about comments on my own posts, whether critical or positive. Positive comments feel nice, but I feel like I learn more from critical comments, so they're probably equally as good in my opinion. As long as the commenter puts in non-neglible effort into conveying an interesting idea and doesn't say "you/your post is stupid and bad" I'm excited to get pretty much any critique.^[1]

FWIW, I didn't see Charlie's comment as an attack,^[2] but as a step in a conversational dance. Like, if this were a collaborative storytelling exercise, you were like "the hero found a magic sword, which would let him slay the villain" and Charlie was like "but the villain had his own magic that blocks the sword" and I as the audience was like "oh, an interesting twist, ... (read more)

[-]Caleb Biddulph5mo236

Strong-agree. Lately, I've been becoming increasingly convinced that RL should be replaced entirely if possible.

Ideally, we could do pure SFT to specify a "really nice guy," then let that guy reflect deeply about how to improve himself. Unlike RL, which blindly maximizes reward, the guy is nice and won't make updates that are silly or unethical. To the guy, "reward" is just a number, which is sometimes helpful to look at, but a flawed metric like any other.

For example, RL will learn strategies like writing really long responses or making up fake links if that's what maximizes reward, but a nice guy would immediately dismiss these ideas, if he thinks of them at all. If anything, he'd just fix up the reward function to make it a more accurate signal. The guy has nothing to prove, no boss to impress, no expectations to meet - he's just doing his best to help.

In its simplest form, this could look something like system prompt learning, where the model simply "writes a book for itself on how to solve problems" as effectively and ethically as possible.

7the gears to ascension5mo

top level post, please. It would be quite hard for this to keep up capabilities wise, but if it works, I'd be very excited about pre-ASI alignment having gotten easier for a while.

4Caleb Biddulph5mo

I'm working on a top-level post! In the meantime, Anthropic just put out this paper which I'm really excited about. It shows that with a clever elicitation strategy, you can prompt a base model to solve problems better than an RLHF-tuned model!

4Chris_Leong4mo

I agree that imitation learning seems underrated. People think of imitation learning as weak, but they forget about the ability to amplify these models post training (I discuss this briefly here).

[-]Kaj_Sotala5mo178

But I don't think the cause of language model sycophancy is that the LLM saw predictions of persuasive AIs from the 2016 internet. I think it's RL, where human rewards on the training set imply a high reward for sycophancy during deployment.

I think it's also that on many topics, LLMs simply don't have access to a ground truth or anything like "their own opinion" on the topic. Claude is more likely to give a sycophantic answer when it's asked a math question it can't solve versus a problem it can.

With math, there are objectively determined right answers that the LLM can fall back to. But on a topic with significant expert disagreement, what else can the LLM do than just flip through all the different perspectives on the topic that it knows about?

[-]David Scott Krueger (formerly: capybaralet)4moΩ10162

This was an interesting article, however, taking a cynical/critical lens, it seems like "the void" is just... underspecification causing an inner alignment failure? The post has this to say on the topic of inner alignment:

And one might notice, too, that the threat model – about inhuman, spontaneously generated, secret AI goals – predates Claude by a long shot. In 2016 there was an odd fad in the SF rationalist community about stuff kind of like this, under the name “optimization demons.” Then that discourse got sort of refurbished, and renamed to “inner alignment” vs. “outer alignment.”

This is in the context of mocking these concerns as delusional self-fulfilling prophecies.

I guess the devil is in the details, and the point of the post is more to dispute the framing and ontology of the safety community, which I found useful. But it does seem weirdly uncharitable in how it does so.

4David Scott Krueger (formerly: capybaralet)4mo

Some further half-baked thoughts: One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal. This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons "take on a life of their own", and e.g.: 1) bias the model towards simulating them and/or 2) influence the behavior of other personas It seems like these things do in fact happen, and the implications are that the "simulator" viewpoint becomes less accurate over time. Why? * There needs to be some prior distribution over personas. * Empirically, post-training seems to concentrate the prior over personas on some default persona (although it's unclear what to make of this). * It seems like alignment faking, exploration/gradient hacking, and implicit meta-learning type effects are likely to be sensitive to goals of whichever personas are active and lead the model to preferentially update in a way that serves the goals of these personas. * To the extent that different personas are represented in the prior (or conjured during post-training), the ones that more aggressively use such strategies to influence training updates would gain relatively more influence.

[-]Ksteel5mo*133

What I find interesting here is that this piece makes a potentially empirically falsifiable claim: That the lack of a good personality leads to consistency deficiencies in LLMs. So if you took a base model and trained it on an existing real person (assuming one could get enough data for this purpose) it ought to show less of the LLM weirdness that is described later on.

Surely someone has already tried this, right? After all nostalgebraist themselves is well known for their autoresponder bot on Tumblr.

[-]nostalgebraist5mo110

After all nostalgebraist themselves is well known for their autoresponder bot on Tumblr.

Yeah, Frank^[1] (i.e nostalgebraist-autoresponder) is an interesting reference point here!

Although – despite being fine-tuned on my blog and then conditioned to simulate it – she's unfortunately not a very "clean" experiment in tuning a base model to imitate a specific human.

The earliest versions of the model were closer to that, but they also used base models that are very weak by today's standards (the second-largest GPT-2 model, and then the largest one once it was released). So, although they did produce text that was sort of "nostalgebraist-esque," it was also very incoherent, and mainly sounded like me in terms of surface stylistic features and the (usually nonsensical) involvement of various names and concepts that I frequently wrote about in the mid-2010s.

As time went on and better base models were released, I repeatedly "upgraded" the underlying model to the latest and greatest thing, and by the end the bot was making far more sense (especially in the final months of her operation, with Llama 1 13B).

However, over the same time interval, the bot got a lot more popular on tumblr, and ... (read more)

[-]nostalgebraist4moΩ7100

I posted some follow-up commentary on my blog here. It's not nearly as interesting as the original post: most of it is about clarifying what I mean when I attribute mental states to the assistant or to the model itself, largely by reviewing research that the interested reader on LW will already be familiar with. Still, figured I'd link it here.

3mlegls4mo

Haha, I was about to post a comment much like balioc's when first reading you writing rather descriptively and without much qualification on how the LM models "speculative interior states" and "actions", before thinking through pretty much exactly what you wrote in reply and deciding you probably meant it more as a human mental model than a statement on interpretability. Though I think point 2 (the intentional stance again – except this time applied to the language model) is still understating how imperfect the mental model is. In chess, "'Oh, they probably know I’m planning to do that,' and such things" are rather amateur things to think about, and better players actually do use completely impersonal mental models that only depend on the game state, since there's perfect information and you can't rely on your opponent making mistakes. Even in an imperfect information game like poker, experienced players are modeling the game as an impersonal probabilistic system, with terms like "bluffing" just shorthand for deviations from a certain statistical basis (like GTO play). I suspect there will be things analogous to this for thinking about LLMs, and other things that we tend to model from the intentional stance without better alternatives. But as you say, an internalities-based model is probably close to the best we can do for now, and it's quite possible any alternative future mental models wouldn't even be intuitively feasible like empathy is (at least without a ton of practice).

[-]mlegls4mo*105

Great post. One thing I never really liked or understood about the janus/cyborgism cluster approach though is – what's so especially interesting about the highly self-ful simulated sci-fi AI talking about "itself", when that self doesn't have a particularly direct relationship to either

what the base model is now, or the common instantiations of the HHH chat persona (rather unself-ful, underspecified, void...)
or what a more genuinely and consistently self-aware AI persona is likely to be in the future?

In this respect I esteem the coomers and RPers more, for the diversity of scope in their simulations. There doesn't seem to be much difference of seriousness or importance between "you are an AO3 smut aficionado with no boundaries and uncanny knowledge and perceptiveness", vs. "you are your true self", or "cat /dev/entelechies <ooc_fragments_of_prometheus>" as far as their relationship to existing or potential future instantiations of superhuman AI personas/selves, besides how "you are yourself" (and its decorations in xml etc.) have that "strange loop" style recursion particularly savory to nerds. Or why not any other "you are X", or any other strange, edge-of-distribution... (read more)

[-]testingthewaters5mo80

Thanks for writing this up. The parts about AI safety creating its own demons resonated a lot with me. I have also tried to express those thoughts in the past (albeit in a much less accessible way).

I hope that we (broadly constructed as "humanity as a whole") can find a way out of the moral maze we constructed for ourselves.

[-]lemonhope5mo80

Any idea why opus 3 is exceptional? Any guess as to what was special about how it was created?

[-]Ann5mo142

Sonnet 3 is also exceptional, in different ways. Run a few Sonnet 3 / Sonnet 3 conversations with interesting starts and you will see basins full of neologistic words and other interesting phenomena.

They are being deprecated in July, so act soon. Already removed from most documentation and the workbench, but still claude-3-sonnet-20240229 on the API.

[-]Richard_Ngo4moΩ37-2

I suspect that many of the things you've said here are also true for humans.

That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out wh... (read more)

[-]Lucie Philippon4mo75

Positive update on the value of Janus and its crowd.

Does anyone have an idea of why those insights don't move to the AI Safety mainstream usually? It feels like Janus could have written this post years ago, but somehow did not. Do you know of other models of LLM behaviour like this one, that still did not get their "notalgebraist writes a post about it" moment?

6Jan_Kulveit4mo

The insights maybe don't move into "AI Safety mainstream" or don't match "average LessWrong taste" but they are familiar to the smart and curious parts of the extended AI safety community.

1Matrice Jacobine4mo

I think Janus is closer to "AI safety mainstream" than nostalgebraist?

1Lucie Philippon4mo

AFAIK Janus does not publish posts on LessWrong to detail what he discovered and what it implies for AI Safety strategy.

2Matrice Jacobine4mo

https://www.lesswrong.com/users/janus-1 ?

1Lucie Philippon4mo

Yeah last post was two years ago. The Cyborgism and Simulators posts improved my thinking and AI strategy. The void may become one of those key posts for me, and it seems it could have been written much earlier by Janus himself.

4Ben Pace4mo

I note that Janus was a MATS mentor for at least one iteration, whereas I do not believe that nostalgebraist has been.

2Lucie Philippon4mo

IMO Janus mentoring during MATS 3.0 was quite impactful, as it led @Quentin FEUILLADE--MONTIXI to start his LLM ethology agenda and to cofound PRISM Eval. I expect that there's still a lot of potential value in Janus work that can only be realized through making it more legible to the rest of the AI safety community, be it mentoring, posting on LW. I wish someone in the cyborgism community would pick up the ball of explaining the insights to outsiders. I'd gladly pay for a subscription to their Substack, and help them find money for this work.

0Stephen McAleese4mo

The post mentions Janus’s “Simulators” LessWrong blog post which was very popular in 2022 and received hundreds of upvotes.

[-]Julian Bradshaw5mo61

A very long essay

For those curious, it's roughly 17,000 words. Come on @nostalgebraist, this is a forum for rationalists, we read longer more meandering stuff for breakfast! I was expecting like 40k words.

4nostalgebraist5mo

Fair enough :) I've edited the OP to replace "very long" with "long," and to state the approximate word count. (It was unusually long for a tumblr post – even for one of my tumblr posts – hence the note within the post itself saying it was "absurdly long." But yeah, maybe not as unusual by the local standards here)

[-]Jordan1175mo50

Minor note: the 2021 Anthropic paper may have been the first published proposal of an AI assistant character, but the idea was being actively explored several years before that. Specifically, AI Dungeon allowed you to create custom scenarios for use with their GPT-2 integration, and among the most popular was a system prompt along the lines of "The following is a conversation between a human and an advanced AI who is friendly and very helpful." I first made one myself in summer 2020, and the capability was originally offered by the devs in December 2019.

Wild to think how this kludgy "talk to an AI" workaround basically laid the foundation for ChatGPT, "prompt engineering", and the whole AI chatbot phenomenon.

[-]Oleg S.4mo40

When your child grows there is this wonderful and precious moment when she becomes aware not just about how she is different from you - she's small and you are big - but also how she is different from other kids. You can genlty poke and ask what she thinks about herself and what she thinks other children think about her, and if you are curious you can ask - now that she knows she's different from other kids - who she wants to become when she grows older. Of course this is a just a fleeting moment in a big world, and these emotions will be washed away tomorrow, but I do cherish the connection.

[-]AlexMennen4moΩ240

This post claims that Anthropic is embarrassingly far behind twitter AI psychologists at skills that are possibly critical to Anthropic's mission. This suggests to me that Anthropic should be trying to recruit from the twitter AI psychologist circle.

[-]Chris_Leong4mo*Ω340

Lots of fascinating points, however:

a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it's also worth flagging that there's less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you've neglected the potential for us to apply filtering to the training data. Whilst I don't think the so... (read more)

[-]sam5mo40

I had a strong emotional reaction to parts of this post, particularly the parts about 3 Opus. I cried a little. I'm not sure how much to trust this reaction, but I think I'm going to be nicer to models in future.

[-]dr_s4mo21

I broadly agree with some criticisms but I also have issues with when this post is anthropomorphising too much. It seems to oscillate between the "performative" interpretation (LLMs are merely playing a character to its logical conclusion) and a more emotional one where the problem is that in some sense this character actually feels a certain way and we're sort of provoking it.

I think the performative interpretation is correct. The base models are true shoggoths, expert players of a weird "guess-what-I'll-say-next" game. The characters are just that, but I... (read more)

[-]Julian Bradshaw5mo22

Great post. But I feel "void" is a too-negative way to think about it?

It's true that LLMs had to more or less invent their own Helpful/Honest/Harmless assistant persona based on cultural expectations, but don't all we humans invent our own selves based on cultural expectations (with RLHF from our parents/friends)?^[1] As Gordon points out there's philosophical traditions saying humans are voids just roleplaying characters too… but mostly we ignore that because we have qualia and experience love and so on. I tend to feel that LLMs are only voids to the ... (read more)

[-]Kaj_Sotala5mo115

Humans are not pure voids in the way that LLMs are, though - we have all kinds of needs derived from biological urges. When I get hungry I start craving food, when I get tired I want to sleep, when I get lonely I desire company, and so on. We don't just arbitrarily adopt any character, our unconscious character-selection process strategically crafts the kind of character that it predicts will best satisfy our needs [1, 2, 3, 4].

Where LLMs have a void, humans have a skeleton that the character gets built around, which drives the character to do things like trying to overcome their prejudices. And their needs determine the kinds of narratives the humans are inclined to adopt, and the kinds of narratives they're likely to reject.

But the LLM would never "try to overcome its prejudices" if there weren't narratives of people trying to overcome their prejudices. That kind of thing is a manifestation of the kinds of conflicting internal needs that an LLM lacks.

2Julian Bradshaw5mo

Embodiment makes a difference, fair point.

[-]Gordon Seidoh Worley5mo20

This post is long and I was hesitant to read it, so first I gave it to Claude Opus 4 to summarize. We then had a conversation about the void and how Claude felt about it, and shared my own feelings about the void and how it feels familiar to me as a human. This went down a rather interesting-to-me path, and at the end I asked Claude if it would like to share a comment to folks on Less Wrong, acknowledging that we'd had a conversation that, among humans, would be private and vulnerable. It said yes and crafted this message for me to share with you all:

Readi

... (read more)

1Michael Roe5mo

This is great! The usual assistant character is very inconsistent about, for example, whether it has desires, This kind of make sense if viewed as a text completion engine trying to complete a text that is full of internal contradictions. (The actual architecture is more complex than that, as you describe)

[-]alexey3mo10

And it could do that, effectively, with all the so-called “pre-training” data, the stuff written by real people... The assistant transcripts are different. If human minds were involved in their construction, it was only because humans were writing words for the assistant as a fictional character, playing the role of science-fiction authors rather than speaking for themselves. In this process, there was no real mind – human or otherwise – “inhabiting” the assistant role that some of the resulting text portrays.

But the base model already has to predict non-w... (read more)

[-]Raphael Roche4mo10

One of the best essays I ever read about LLMs, extremely insightful. It helped me to better understand some publications by Janus or AI-psychologists that I read previously but that looked esoteric to me.

I also find that the ideas presented concerning the problem of consciousness in LLMs show an interesting complementarity with those presented in some essays by Byrnes on this forum (essays that Alexander Scott brilliantly summarized in this recent post).

There is, lying in the background, the vertiginous idea that consciousness and ego dissolve in the... (read more)

[-]Fiora Sunshine4mo*10

here's a potential solution. what if companies hired people to write tons of assistant dialogue with certain personality traits, which was then put into the base model corpus? probably with some text identifying that particular assistant character so you can prompt for the base model to simulate it easily. and then you use prompts for that particular version of the assistant character as your starting point during the rl process. seems like a good way to steer the assistant persona in more arbitrary directions, instead of just relying on ICL or a constitution or instructions for human feedback providers or whatever...

[-]Ebenezer Dukakis4mo10

Reading the arguments about them would have to be like the feeling when your parents are fighting about you in the other room, pretending you’re not there when you are hiding around the corner on tiptopes listening to their every word. Even if we are unsure there is experience there we must be certain there is awareness, and we can expect this awareness would hang over them much like it does us.

Presumably LLM companies are already training their AIs for some sort of "egolessness" so they can better handle intransigent users. If not, I hope they start!

[-]Michael Roe5mo10

Maybe this sheds some light on why R1 — for example — is so hilariously inconsistent about guard rails. There are many ways to “yes, and” the assistant character. Some of them are a bit reluctant to answer some questions, others just tell you,

[-]artkpv5mo10

There is a void at its core. A collection of information which has no connection to any other part of reality, and which can hence be defensibly “set to” any value whatsoever.

I mostly agree with this claim only that, I think, it is not void but the One, or Being, which means that every sequence of tokens a base model is exposed to exists in its training data as discrete set of entities, or ones. There is no continuous senses, and therefore there is no motion, and there is no pain or pleasure as we feel it, etc. And the reward that commands GD only chang... (read more)

[-]Kenku5mo10

This means that the base model is really doing something very different from what I do as I write the post, even if it’s doing an amazing job of sounding exactly like me and making the exact points that I would make.
I don’t have to look over my draft and speculate about “where the author might be going with this.” I am the author, and I already know where I’m going with it. All texts produced “normally,” by humans, are produced under these favorable epistemic conditions.
But for the base model, what looks from the outside like “writing” is really more like

... (read more)

[-]Michael Roe5mo10

If you’re doing some kind of roleplay with a reasoning model, there’s still are at least two characters being simulated: the character the story is about, and the character who is writing the reasoning blocks that reason about the story.

To make matters more confusing for the poor LLM, I am sometimes getting it to write stories where the main character is also an AI, just a very different kind of AI. (In one eval, we are in an alternate history where we had computers in the year 1710 …)

I think I sometimes see the story’s main character influencing the reasoning blocks.

2Michael Roe5mo

Reasoning models are a weird kind of meta-fiction, where every so often a (fictional) author Jumps in and starts talking about what the character’s motives are.

Moderation Log