How (not) to choose a research project

CatGoddess; Johannes C. Mayer

Finetuning LLMs with RL seems to make them more agentic. We will look at the changes RL makes to LLMs' weights; we can see how localized the changes are, get information about what sorts of computations make something agentic, and make conjectures about selected systems, giving us a better understanding of agency.
Nobody has convinced us this is a bad use of our time, though we'd like to see people try.

I'll give it a go.

"Agentiness" sounds like a probably pretty complex macro-level property of neural networks, at least to me. As in, the definition of the property seems to itself depend on other macro-level properties and structures in networks we don't really have decent operationalisations for either yet (e.g. "goals", "search processes").

I feel like we're still at the very beginning of theory in defining and identifying even very mathematically simple macro-level structures in neural networks. We can barely even quantify how much parts of a network interact with other parts of it.

So my guess would be that this sounds too hard to attack directly right now, unless you have some clever guesses already for what "agentiness" in networks looks like, or reason to suspect that "agentiness" is actually a mathematically far simpler property than one might naively think.

Otherwise, I fear your investigation will get lost in trying to identify which of the various changes the parameters of the LLM experience correspond to a change in "agentiness" levels, rather than a change in "capabilities", a change in "goals", a change in Moloch knows what, or just to random perturbations.

You could maybe try to control for that by doing lots of other experiments too, like looking at what happens to the parameters of an LLM already trained to be agenty if you train it again to achieve some other goal that doesn't require learning any new skills, to separate out goal changes. Or what happens to LLMs if they are finetuned to higher performance through methods that don't involve RL, to separate out capability changes. Or what happens to normal RL agents in the course of normal RL training.

If you combined the data from all of these and found good operationalisations for all the effects and concepts involved, maybe you could separate "agentiness" out from all the other stuff. But at that point, your project would be more like "soloing the Selection Theorems agenda".

(Which would be very cool if you actually pulled it off, of course)

Further, when it comes to understanding things about properties of neural networks, I don't feel like we've exhausted the low-hanging fruit from looking at very simple models yet. Those are also generally a lot easier and quicker to work with. So I think any time you consider looking at big fancy models to learn something, you should ask yourself if there isn't equally good progress to be made on your agenda by looking at small, dumb models instead.

[-]Garrett Baker3y30

The first part of your criticism makes me more excited, not less. We have considered doing the variations you suggested, and more, to distinguish between what parts of the changes are leading to which aspects of behavior.

I also think we can get info without robust operationalizations of concepts involved, but robust operationalizations would certainly allow us to get more info.

I am not one to shy away from hard problems because they’re hard. Especially if it seems increasing hardness levels lead to increasing bits gleaned.

Which easier methods do you have in mind?

[-]Lucius Bushnaq3y71

I also think we can get info without robust operationalizations of concepts involved, but robust operationalizations would certainly allow us to get more info.

I think unless you're extremely lucky and this turns out to be a highly human-visible thing somehow, you'd never notice what you're looking for among all the other complicated changes happening that nobody has analysis tools or even vague definitions for yet.

Which easier methods do you have in mind?

Dunno. I was just stating a general project-picking heuristic I have, and that it's eyeing your proposal with some skepticism. Maybe search the literature for simpler problems and models with which you might probe the difference between RL and non-RL training. Something even a shallow MLP can handle, ideally.

[-]Garrett Baker3y20

Good ideas! I worry that a shallow MLP wouldn't be capable enough to see a rich signal in the direction of increasing agency, but we should certainly try to do the easy version first.

I think unless you're extremely lucky and this turns out to be a highly human-visible thing somehow, you'd never notice what you're looking for among all the other complicated changes happening that nobody has analysis tools or even vague definitions for yet.

I don't think I'm seeing the complexity you're seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set's qualitative influence on behavior. I don't think this requires rigorous operationalizations.

An example: In a chess-playing context, this will lead to different moves, or out-of-action-space-behavior. The various kinds of out-of-action-space behavior or biases in move changes seem like they'd give us insight into what the head-set was doing, even if we don't understand the mechanisms used inside the head set.

[-]Lucius Bushnaq3y42

I don't think I'm seeing the complexity you're seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set's qualitative influence on behavior. I don't think this requires rigorous operationalizations.

That sounds to me like it would give you a very rough, microscope-level view of all the individual things the training is changing around. I am sceptical that by looking at this ground-level data, you'd be able to separate out the things-that-are-agency from everything else that's happening.

As an analogy, looking at what happens if you change the wave functions of particular clumps of silica atoms doesn't help you much in divining how the IBM 608 divides numbers, if you haven't even worked out yet that the atoms in the machine are clustered into things like transistors and cables, and actually, you don't even really know how dividing numbers works even on a piece of paper, you just think of division as "the inverse of multiplication".

[-]brooksshowell3y80

The hat is a good example of enclothed cognition!

[-]wunan3y60

The hat is another example of prompt engineering for humans.

[-]Charlie Steiner3y50

Yes! Mwahahaha! Soon you will be ready to overthrow the tyranny of the Hamming question, and usher in a new age of research motivated by curiosity and tractability!

[-]Jérémy Scheurer3y30

Finetuning LLMs with RL seems to make them more agentic. We will look at the changes RL makes to LLMs' weights; we can see how localized the changes are, get information about what sorts of computations make something agentic, and make conjectures about selected systems, giving us a better understanding of agency.

Could you elaborate on how you measure the "agenticness" of a model in this experiment? In case you don't want to talk about it until you finish the project that's also fine, just thought I'd ask.

[-]Charbel-Raphaël3y35

Cool post, thank you.

[-]Nicholas Kross3y30

Garrett gave me some of this advice at EAG SF. Good stuff!

^{^}

That is, given color hex#D6E865 you may say this seems like a mix between yellow and green, and plausibly this notion can be formalized.

^{^}

Or you can just look at this footnote! The answer (rot13): Guvf jnf nzbat gur svefg cebwrpgf va bhe pbubeg ur ybbxrq ng, naq gur bguref jrer rira jbefr.

^{^}

Lit. He was too busy putting out fires in other projects to focus on alignment research.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

79

How (not) to choose a research project

79

79

Background

Takeaways

Big ASS Tree

Big ASS Takeaways

Contact with reality

What is the most important problem?

Heuristics are useful, especially when you're first starting out

Just because John says a project is "the best he's heard yet" does not mean it's any good

Read Jaynes' "Probability Theory: The Logic of Science"

Action space is large, bro

John's hat is magic