To me the reason the agent/model distinction matters is that there are ways in which an LLM is not an agent, so inferences (behavioural or mechanistic) that would make sense for an agent can be incorrect. For example, a LM's outputs ("I've picked a secret answer") might give the impression that it has internally represented something when it hasn't, and so intent-based concepts like deception might not apply in the way we expect them to.
I think the dynamics of model personas seem really interesting! To me the main puzzle is methodological: how do you even get traction on it empirically? I'm not sure how you'd know if you were identifying real structure inside the model, so I don't see any obvious ways in. But I think progress here could be really valuable! I guess the closest concrete thing I've been thinking about is studying the dynamics of repeatedly retraining models on interactions with users who have persistent assumptions about the models, and seeing how much that shapes the distribution of personality traits. Do you have ideas in mind?
Sure, briefly replying:
And I think the differences in 3 and 4 indeed probably come down to deeper assumptions that would be hard to unpick in this thread: I'd tentatively guess I'm putting more weight on the societal impacts of AI, and on the eventual shape of AGI/ASI being easier to affect.
This comment thread probably isn't the place, but if it ever seems like it would be important/feasible, I'd be happy to try to go deeper on where our models are differing.
Certainly! My top three conceptual picks are Simulators, Role-Play with Large Language Models, and the Three Layer Model of LLM Psychology, which all cover pretty similar ground but make pretty different claims.
As for north stars and empirical studies, I should disclaim that I'm no expert here, but with that caveat here are some takes:
The general theme here is something like 'what are the intuitive reasons people end up being compelled by these semi-formal conceptual frameworks, and how can we actually empirically check if they're true?'
Sure, I agree we probably end up in full automation eventually by default. I also think this is much more relevant in some tasks than others: "generically make human labor more uplifted" doesn't feel like it quite captures the thing I care about here.
Some intuitions I have:
I should emphasise that I don't think this is the all time top most important work, I just think it's currently pretty neglected and I wouldn't be surprised if there were some pretty interesting insights that came out of thinking hard about it for a while, or some pretty high leverage work available.
I am extremely not Dustin, and I do not want to veer into psychologising, but I very tentatively interpret him as also conveying some mix of:
I reiterate that all the comments are just there on the other post for anyone to scrutinise, rather than taking my word for it. I make no claim as to whether these are cruxes. But in my estimation these are some of the implications.
I would also offer this quote, because I think the meta-dynamic here is an important piece of the puzzle:
I'm not detailing specific decisions for the same reason I want to invest in fewer focus areas: additional information is used as additional attack surface area. The attitude in EA communities is "give an inch, fight a mile". So I'll choose to be less legible instead.
Yeah, I meant that he was pushing back on the framing as an oversimplification, not that he was pushing back on the claim that reputation was part of the calculation -- this I feel he did straightforwardly and consistently do, with actual substantive reasons, e.g.
"reputational risks" [..] narrows the mind too much on what is going on here
I can't know all our grantees, and my estimation is I can't divorce myself from responsibility for them, reputationally or otherwise. [emphasis original]
“PR risk” is an unnecessarily narrow mental frame for why we’re focusing [...] there are other bandwidth issues: energy, attention, stress, political influence. Those are more finite than capital.
Framing the costs as "PR" limits the way people think about mitigating costs. It's not just "lower risk" but more shared responsibility and energy to engage with decision making, persuading, defending, etc.
Again, really leaning into trying to give the opposite side here, I think that rounding things off to "Dustin Moskovitz became more concerned about his reputation" is actually losing a lot of important nuance mostly in a way that makes Dustin look bad, and in a way that he correctly identified and objected to. Which is not to say there hasn't been a cursed miasma causing who knows how much harm, but I think the differences in implication here are subtle and important.
Interesting stuff! For the sake of multi-sidedness, I'd note that this description of the shift being because of Dustin caring about his reputation is something Dustin himself repeatedly pushed back on in the original GV update comment thread, for being an oversimplification. I might also recommend Dustin's big Medium essay on philanthropy to anyone curious about how he conceives of what he does.
Ah I should emphasise, I do think all of these things could help -- it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.
The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I'll set that aside for now.
It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:
I like the thrust of this paper, but I feel that it overstates how robust the safety properties will be, by drawing an overly sharp distinction between agentic and non-agentic systems, and not really engaging with the strongest counterexamples
To give some examples from the text:
A chess-playing AI, for instance, is goal-directed because it prefers winning to losing. A classifier trained with log likelihood is not goal-directed, as that learning objective is a natural consequence of making observations
But I could easily train an AI which simply classifies chess moves by quality. What takes that to being an agent is just the fact that its outputs are labelled as 'moves' rather than as 'classifications', rather than any feature of the model itself. More generally, even a LM can be viewed as "merely" predicting next tokens -- the fact that there is some perspective from which a system is non-agentic does not actually tell us very much.
Paralleling a theoretical scientist, it only generates hypotheses about the world and uses them to evaluate the probabilities of answers to given questions. As such, the Scientist AI has no situational awareness and no persistent goals that can drive actions or long-term plans.
I think it's a stretch to say something generating hypotheses about the world has no situational awareness and no persistent goals -- maybe it has indexical uncertainty, but a sufficiently powerful system is pretty likely to hypothesise about itself, and the equivalent of persistent goals can easily fall out of any ways its world model doesn't line up with reality. Note that this doesn't assume the AI has any 'hidden goals' or that it ever makes inaccurate predictions.
I appreciate that the paper does discuss objections to the safety of Oracle AIs, but the responses also feel sort of incomplete. For instance:
Overall, I'm excited by the direction, but it doesn't feel like this approach actually gets any assurances of safety, or any fundamental advantages.
Interesting! Two questions: