Raymond Douglas

Wikitag Contributions

Comments

Sorted by

Interesting! Two questions:

  • What about the 5-and-10 problem makes it particularly relevant/interesting here? What would a 'solution' entail?
  • How far are you planning to build empirical cases, model them, and generalise from below, versus trying to extend pure mathematical frameworks like geometric rationality? Or are there other major angles of attack you're considering?

To me the reason the agent/model distinction matters is that there are ways in which an LLM is not an agent, so inferences (behavioural or mechanistic) that would make sense for an agent can be incorrect. For example, a LM's outputs ("I've picked a secret answer") might give the impression that it has internally represented something when it hasn't, and so intent-based concepts like deception might not apply in the way we expect them to.

I think the dynamics of model personas seem really interesting! To me the main puzzle is methodological: how do you even get traction on it empirically? I'm not sure how you'd know if you were identifying real structure inside the model, so I don't see any obvious ways in. But I think progress here could be really valuable! I guess the closest concrete thing I've been thinking about is studying the dynamics of repeatedly retraining models on interactions with users who have persistent assumptions about the models, and seeing how much that shapes the distribution of personality traits. Do you have ideas in mind?

Sure, briefly replying:

  • On the first point: you're right that this does in some ways make the problem worse; my current best guess is that it's basically necessary for a solution. I'm planning to write this up in more detail some time soon and I hope to get your thoughts when I do!
  • On the second: Yeah, I find this kind of thing pretty hard to be confident about. I could totally see you being right here, and I'd love for someone to think it through in detail.

And I think the differences in 3 and 4 indeed probably come down to deeper assumptions that would be hard to unpick in this thread: I'd tentatively guess I'm putting more weight on the societal impacts of AI, and on the eventual shape of AGI/ASI being easier to affect.

This comment thread probably isn't the place, but if it ever seems like it would be important/feasible, I'd be happy to try to go deeper on where our models are differing.

Certainly! My top three conceptual picks are Simulators, Role-Play with Large Language Models, and the Three Layer Model of LLM Psychology, which all cover pretty similar ground but make pretty different claims. 

As for north stars and empirical studies, I should disclaim that I'm no expert here, but with that caveat here are some takes:

  • LMs will say that they've made a hidden choice without that actually fixing the output (e.g. if you ask them to play 20 questions). What's up with that? What's going on mechanistically? How does it relate to deception and/or hallucination?
  • There are lots of standard terms for model behaviour that imply agent-level intent ('sycophancy', 'sandbagging', 'alignment faking'). But how much is happening on the level of the model as opposed to the agent? For example, a model trained on dialogues where people happen to mostly talk to their political tribe should also display 'sycophantic' outputs, but not because the agent is trying to flatter the user. Can we disentangle these effects?
  • A related but slightly weirder thing I'm particularly interested in is feedback loops between user expectations and model training data / agent self-image: how are the assumptions we make about current LMs shaping the nature of future LMs? It would be great to show empirically that this is even happening at all (e.g. by iteratively retraining)
  • One of my all-time favourite papers is Shaking the Foundations, which I think gives a very nice formal model of hallucination (or 'autosuggestive delusion'). I think it'd be great to test how far it actually applies to LMs.

The general theme here is something like 'what are the intuitive reasons people end up being compelled by these semi-formal conceptual frameworks, and how can we actually empirically check if they're true?'

Sure, I agree we probably end up in full automation eventually by default. I also think this is much more relevant in some tasks than others: "generically make human labor more uplifted" doesn't feel like it quite captures the thing I care about here.

Some intuitions I have:

  • That period where AIs are more capable than human, but human+AI is even more capable, seems like a particularly crucial window for doing useful things, so extending it is pretty valuable. In particular, both bringing forward augmented human capability, and also pushing back human redundance.
    • This is basically the main reason, and I don't think I can guess why you'd disagree.
  • In parallel, I think that a lot of work is defaulting towards 'fully general agent AI' because it is an easy and natural target, not because it is the best one, and that if people knew what other kinds of interfaces to build for, that would actually suck some energy out of investing in getting long-term planning/drop-in replacements for everything as soon as possible
    • This might be wrong for jevons paradox-y reasons though, and it depends on specifics I haven't thought about
  • I kinda think that if we were doing more complementarity research, we'd have a larger dataset of healthy AI<>human interactions, and that could maybe help with steering us more towards the kinds of eventual AIs that are naturally friendly. I am pretty unsure here, but I do wish someone had thought hard about it. I weakly guess that I put a lot more weight than you on feedback loops from how people use AI.
  • The focus on independent/autonomous AIs is, I suspect, making people underinvest in figuring out what effect AI interactions have on humans, or on trying to make those effects good, and I can imagine this biting us hard down the line
    • Like, if there were a nice suite of evals to tell you how emotionally healthy/toxic a given model was, then there would be a sort of legible target to hill climb towards. My guess is companies kind of don't care enough to prioritise doing this themselves, but they'd take easy steps towards it.

I should emphasise that I don't think this is the all time top most important work, I just think it's currently pretty neglected and I wouldn't be surprised if there were some pretty interesting insights that came out of thinking hard about it for a while, or some pretty high leverage work available.

I am extremely not Dustin, and I do not want to veer into psychologising, but I very tentatively interpret him as also conveying some mix of:

  • legitimately feeling that there are some things it might be bad to fund, and feeling morally responsible for making sure the money doesn't go to such bad things, and neither trusting OP to make those judgment, nor trusting that the good and bad will essentially balance out somehow
  • finding it somewhat stressful and draining to be responsible (not just reputationally) for things you don't have time to scrutinise, where those are in fact finite resources that need to be spent carefully
  • hoping that if other people do fill in the funding gaps, they'll also share the load on the other tacit resources (which, to be fair, is complicated by the general problems with donor funging that do seem to have been handled suboptimally)

I reiterate that all the comments are just there on the other post for anyone to scrutinise, rather than taking my word for it. I make no claim as to whether these are cruxes. But in my estimation these are some of the implications.

I would also offer this quote, because I think the meta-dynamic here is an important piece of the puzzle:

I'm not detailing specific decisions for the same reason I want to invest in fewer focus areas: additional information is used as additional attack surface area. The attitude in EA communities is "give an inch, fight a mile". So I'll choose to be less legible instead.

Yeah, I meant that he was pushing back on the framing as an oversimplification, not that he was pushing back on the claim that reputation was part of the calculation -- this I feel he did straightforwardly and consistently do, with actual substantive reasons, e.g.

"reputational risks" [..] narrows the mind too much on what is going on here

I can't know all our grantees, and my estimation is I can't divorce myself from responsibility for them, reputationally or otherwise. [emphasis original]

“PR risk” is an unnecessarily narrow mental frame for why we’re focusing [...] there are other bandwidth issues: energy, attention, stress, political influence. Those are more finite than capital.

Framing the costs as "PR" limits the way people think about mitigating costs. It's not just "lower risk" but more shared responsibility and energy to engage with decision making, persuading, defending, etc. 

Again, really leaning into trying to give the opposite side here, I think that rounding things off to "Dustin Moskovitz became more concerned about his reputation" is actually losing a lot of important nuance mostly in a way that makes Dustin look bad, and in a way that he correctly identified and objected to. Which is not to say there hasn't been a cursed miasma causing who knows how much harm, but I think the differences in implication here are subtle and important.

Interesting stuff! For the sake of multi-sidedness, I'd note that this description of the shift being because of Dustin caring about his reputation is something Dustin himself repeatedly pushed back on in the original GV update comment thread, for being an oversimplification. I might also recommend Dustin's big Medium essay on philanthropy to anyone curious about how he conceives of what he does.

Ah I should emphasise, I do think all of these things could help -- it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.

The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I'll set that aside for now.

It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:

  • It's easy to accidentally scaffold some kind of agent out of an oracle as soon as there's any kind of consistent causal process from the oracle's outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I'm not totally sure you can easily choose not to
  • Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals -- in a sense this is the premise of offline RL and things like decision transformers
  • It might be pretty easy for agent-like knowledge to 'jump the gap', e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive
  • Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude

I like the thrust of this paper, but I feel that it overstates how robust the safety properties will be, by drawing an overly sharp distinction between agentic and non-agentic systems, and not really engaging with the strongest counterexamples

 To give some examples from the text:

A chess-playing AI, for instance, is goal-directed because it prefers winning to losing. A classifier trained with log likelihood is not goal-directed, as that learning objective is a natural consequence of making observations

But I could easily train an AI which simply classifies chess moves by quality. What takes that to being an agent is just the fact that its outputs are labelled as 'moves' rather than as 'classifications', rather than any feature of the model itself. More generally, even a LM can be viewed as "merely" predicting next tokens -- the fact that there is some perspective from which a system is non-agentic does not actually tell us very much.

Paralleling a theoretical scientist, it only generates hypotheses about the world and uses them to evaluate the probabilities of answers to given questions. As such, the Scientist AI has no situational awareness and no persistent goals that can drive actions or long-term plans.

I think it's a stretch to say something generating hypotheses about the world has no situational awareness and no persistent goals -- maybe it has indexical uncertainty, but a sufficiently powerful system is pretty likely to hypothesise about itself, and the equivalent of persistent goals can easily fall out of any ways its world model doesn't line up with reality. Note that this doesn't assume the AI has any 'hidden goals' or that it ever makes inaccurate predictions.

I appreciate that the paper does discuss objections to the safety of Oracle AIs, but the responses also feel sort of incomplete. For instance:

  • The counterfactual query proposal basically breaks down in the face of collusion
  • The point about isolating the training process from the real world says that "a reward-maximizing agent alters the real world to increase its reward", which I think is importantly wrong. In general, I think the distinctions drawn here between RL and the science AI all break down at high levels.
  • The uniqueness of solutions still leaves a degree of freedom in how the AI fills in details we don't know -- it might be able to, for example, pick between several world models that fit the data which each offer a different set of entirely consistent answers to all our questions. If it's sufficiently superintelligent, we wouldn't be able to monitor whether it was even exercising that freedom.

Overall, I'm excited by the direction, but it doesn't feel like this approach actually gets any assurances of safety, or any fundamental advantages.

Load More