1 min read5th Jan 20247 comments
This is a special post for quick takes by 1a3orn. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

10 comments, sorted by Click to highlight new comments since: Today at 4:04 AM

Just a few quick notes / predictions, written quickly and without that much thought:

(1) I'm really confused why people think that deceptive scheming -- i.e., a LLM lying in order to post-deployment gain power -- is remotely likely on current LLM training schemes. I think there's basically no reason to expect this. Arguments like Carlsmith's -- well, they seem very very verbal and seems presuppose that the kind of "goal" that an LLM learns to act to attain during contextual one roll-out in training is the same kind of "goal" that will apply non-contextually to the base model apart from any situation.

(Models learn extremely different algorithms to apply for different parts of data -- among many false things, this argument seems to presuppose a kind of unity to LLMs which they just don't have. There's actually no more reason for a LLM to develop such a zero-context kind of goal than for an image segmentation model, as far as I can tell.)

Thus, I predict that we will continue to not find such deceptive scheming in any models, given that we train them about like how we train them -- although I should try to operationalize this more. (I understand Carlsmith / Yudkowsky / some LW people / half the people on the PauseAI discord to think something like this is likely, which is why I think it's worth mentioning.)

(To be clear -- we will continue to find contextual deception in the model if we put it there, whether from natural data (ala Bing / Sydney / Waluigi) or unnatural data (the recent Anthropic data). But that's way different!)

(2). All AI systems that have discovered something new have been special-purpose narrow systems, rather than broadly-adapted systems.

While "general purpose" AI has gathered all the attention, and many arguments seem to assume that narrow systems like AlphaFold / materials-science-bot are on the way out and to be replaced by general systems, I think that narrow systems have a ton of leverage left in them. I bet we're going to continue to find amazing discoveries in all sorts of things from ML in the 2020s, and the vast majority of them will come from specialized systems that also haven't memorized random facts about irrelevant things. I think if you think LLMs are the best way to make scientific discoveries you should also believe the deeply false trope from liberal arts colleges about a general "liberal arts" education being the best way to prepare for a life of scientific discovery. [Note that even systems that use non-specialized systems as a component like LLMs will themselves be specialized].

LLMs trained broadly and non-specifically will be useful, but they'll be useful for the kind of thing where broad and nonspecific knowledge of the world starts to be useful. And I wouldn't be surprised that the current (coding / non-coding) bifurcation of LLMs actually continued into further bifurcation of different models, although I'm a lot less certain about this.

(3). The general view that "emergent behavior" == "I haven't looked at my training data enough" will continue to look pretty damn good. I.e., you won't get "agency" from models scaling up to any particular amount. You get "agency" when you train on people doing things.

(4) Given the above, most arguments about not deploying open source LLMs look to me mostly like bog-standard misuse arguments that would apply to any technology. My expectations from when I wrote about ways AI regulation could be bad have not changed for the better, but for the much much worse.

I.e., for a sample -- numerous orgs have tried to outlaw open source models of the kind that currently exist because because of their MMLU scores! If you think are worried about AI takeover, and think "agency" appears as a kind of frosting on top of of a LLM after it memorizes enough facts about the humanities and medical data, that makes sense. If you think that you get agency by training on data where some entity is acting like an agent, much less so!

Furthermore: MMLU scores are also insanely easy to game, both in the sense that a really stupid model can get 100% by just training on the test set; and also easy to game, in the sense that a really smart model could get almost arbitrarily low by excluding particular bits of data or just training to get the wrong answer on the test set. It's the kind of rule that would be goodhearted to death the moment it came into existence -- it's a rule that's already been partially goodhearted to death -- and the fact that orgs are still considering it is an update downward in the competence of such organizations.

I agree. AI safety advocates seem to be myopically focused on current-day systems. There is a lot of magical talk about LLMs. They do exactly what they're trained to: next-token prediction. Good predictions requires you to implicitly learn natural abstractions. I think when you absorb this lesson the emergent abilities of gpt isn't mega surprising.

Agentic AI will come. It won't be just a scaled up LLM. It might grow as some sort of gremlin inside the llm but much more likely imho is that people build agentic AIs because agentic AIs are more powerful. The focus on spontaneous gremlin emergence seems like a distraction and motivated partially by political reasons rather than a dispassionate analysis of what's possible.

I think Just Don't Build Agents could be a win-win here. All the fun of AGI without the washing up, if it's enforceable.

Possible ways to enforce it:

(1) Galaxy-brained AI methods like Davidad's night watchman. Downside: scary, hard.

(2) Ordinary human methods like requring all large training runs to be approved by the No Agents committee.

Downside: we'd have to ban not just training agents, but training any system that could plausibly be used to build an agent, which might well include oracle-ish AI like LLMs. Possibly something like Bengio's scientist AI might be allowed.

The No Agentic Foundation Models Club ? 😁

I mean, I should mention that I also don't think that agentic models will try to deceive us if trained how LLMs currently are, unfortunately.

On (1), see here for discussion on how an LLM could become goal directed.

Just registering that I think the shortest timeline here looks pretty wrong.

Ruling intuition here is that ~0% remote jobs are currently automatable, although we have a number of great tools to help people do em. So, you know, we'd better start doubling on the scale of a few months if we are gonna hit 99% automatable by then, pretty soon.

Cf. timeline from first self-driving car POC to actually autonomous self-driving cars.

[+][comment deleted]4mo20
[+][comment deleted]10mo20
[+][comment deleted]2y10