Max_He-Ho — LessWrong

Doing a PhD in Philosophy of AI. Working on conceptual AI Safety things

If you have multiple AI agents in mind here, then yes, though it's not the focus. Otherwise, also yes, though depending on what you mean with "multiple agent systems acting as one", one could also see this as being fulfilled by some agents dominating other agents so that they (have to) act as one. I'd put it rather as the difficulty of predicting other agents' actions and goals leads to the risks from AI & the difficulty of the alignment and control problem.

slowing down LLM progress would be dangerous, as other approaches like RL agents would pass them by before appearing dangerous.

This seems misleading to me & might be a false dichotomy. It's not LLMs or RL agents. I think we'll (unfortunately) build agents on the basis of LLMs & the capabilities they have. Every additional progress on LLMs gives these agents more capabilities faster with less time for alignment. They will be (and are!) built based on the mere (perceived) incentives of everybody involved & the unilateralist curse. (See esp. Gwern's Tool AIs want to be Agent AIs.) I can see that such agents have interpretability advantages over RL agents but since RL agents seem far off with less work going into it, I don't get why we should race regarding LLMs & LLM-based agents.

I'm personally not sure, if "inherently oracles" is accurate for current LLMs (both before & after RLHF), but it seems simply false when considering plugins & AutoGPT (besides other recent stuff).

Neither captures it quite imo. I think it’s mostly an attempt at deconfusion:

We can’t hope to solve alignment by sufficiently nudging the relevant AI's utility function since to know something about the utility function (as argued here) requires either predicting it (not just tweaking it & crossing your fingers really hard) or predicting the AI's behavior. This is a substantially harder problem than the term alignment suggests on the surface and it’s one that it seems we cannot avoid. Interpretability (as far as I'm aware) is nowhere near this. The prediction section makes an epistemic argument though. It suggests that solving the alignment problem is much harder than one might think, not how likely doom by default is. For this, we would need to know the sampling space we'll be in when developing AGI, and as shminux points out, a definition that delineates the safe part of that sampling space.
The second thesis is more strictly on the impossibility of safety guarantees but I think it’s most interesting in conjunction with the first one: If we need prediction for alignment and prediction is impossible for computationally universal systems, then the alignment problem may also be impossible to solve for such systems. Thus, the fire thesis (which is a possibility we should consider, nothing that’s shown here. There might be all kinds of informal clues for safety; a bit more on that in the discussion section here)

Our current LLMs like GPT-4 are not, in their base configurations, agents. They do not have goals.

What is the difference between being an agent and not being an agent here? Goals seem like an obvious point but since GPT-4 also minimized its loss during training and perhaps still does as they keep tweaking it, is the implied difference that base GPT-4 is not minimizing its loss anymore (which is its goal in some sense) or does not minimize it continually? If so, the distinction seems quite fuzzy since you'd have to concede the same for an AutoGPT where you authorize the individual steps it runs.

That seems to be an answer considered when you later write

Will that cause the LLM to act as if it were an agent during the training run, seeking goals that arise out of the training run and thus almost certainly are only maximally fulfilled in ways that involve the LLM taking control of the future (and likely killing everyone), before we even get a chance to use RLHF on it? During the RLHF training run? Later on? At what level does this happen?

I'm pointing to both sections (and the potential tension between them) since with the clear agentive properties of AutoGPTs that run continually & where you can literally input goals, it seems like a straightforward failure mode to now only expect agentive properties from such systems. They might instead emerge in other AIs too (e.g. if they continually minimize their loss).

Does anyone have a better way to differentiate agents from non-agents, specifically for the cases we are witnessing?

I think most comments regarding the covid analogy miss the point made in the post. Leopold makes the case that there will be a societal moment of realization and not that specific measures regarding covid were good and this should give us hope.

Right now talking about AI risk is like yelling about covid in Feb 2020.

I agree with this & there likely being a wake-up moment. This seems important to realize!

I think unless one has both an extremely fast takeoff model and doesn’t expect many more misaligned AI models with increases in capabilities to be released before takeoff, one should expect at least one, plausibly several wakeup moments as we had it with covid. In one way or another in this scenario, AIs are going to do something that their designers or operators didn’t want while it does not yet lead to an existential catastrophe. I’m not clear on what will happen & how society will respond to it but it seems likely that a lot of people working on AI safety should prepare for this moment, especially if you have a platform. This is when people will start to listen.

As for the arguments that specific responses won’t be helpful: I’m skeptical of both positive and negative takes made with any confidence since the analogies to specific measures in response to covid or climate change don’t seem well grounded to me.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments