reallyeli — LessWrong

The behavioral selection model for predicting AI motivations

Thanks for writing this.

One question I have about this and other work in this area is the training / deployment distinction. If AIs are doing continual learning once deployed, I'm not quite sure what that does to this model.

How quick and big would a software intelligence explosion be?

reallyeli22d10

Thanks Tom! Appreciate the clear response. This feels like it significantly limits how much I update on the model.

How quick and big would a software intelligence explosion be?

reallyeli23d20

We simulate AI progress after the deployment of ASARA.
We assume that half of recent AI progress comes from using more compute in AI development and the other half comes from improved software. (“Software” here refers to AI algorithms, data, fine-tuning, scaffolding, inference-time techniques like o1 — all the sources of AI progress other than additional compute.) We assume compute is constant and only simulate software progress.
We assume that software progress is driven by two inputs: 1) cognitive labour for designing better AI algorithms, and 2) compute for experiments to test new algorithms. Compute for experiments is assumed to be constant. Cognitive labour is proportional to the level of software, reflecting the fact AI has automated AI research.

Your definition of software includes all data, which strikes me as an unusual use of the term so I'll put it in scare quotes.

You say half of recent AI progress came from "software" and half comes from compute. Then in your diagram, the cognitive labor gained from better AI is going to improve "software."

To me it seems like a ton of recent AI progress was from using up a data overhang, in the sense of scaling up compute enough to take advantage of an existing wealth of data (the internet, most or all books, etc.)

I don't see how more AI researchers, automated or not, could find more of this data. The model has their cognitive labor being used to increase "software." Does the model assume that they are finding or generating more of this data, in addition to doing R&D for new algorithms, or other "software" bucket activities?

How quick and big would a software intelligence explosion be?

reallyeli23d10

These methods may be too aggressive. Before we have ASARA, less capable AI systems may still accelerate software progress by a more moderate amount, plucking the low-hanging fruit. As a result, ASARA has less impact than we might naively have anticipated.

I'm confused.

My default assumption is that prior to ASARA, less-capable AIs will have accelerated software progress a lot — so I'm interested in working that into the model.

It looks like your "gradual boost" section is for people like me; you simulate the gradual emergence of the ASARA boost over a period of five years. But in the gradual boost section, you conclude that using this model results in a higher chance of >10yrs being compressed into one year. (I'm not currently following the logic there, just treating it as a black box.)

Why is the sentence "As a result, ASARA has less impact than we might naively have anticipated" then true? It seems this consideration actually ends up meaning it has more impact.

The problem of graceful deference

reallyeli1mo92

Just wanted to say I really enjoyed this post, especially your statement of the problem in the last paragraph.

Experiments With Sonnet 4.5's Fiction

reallyeli2mo41

The guy next to me, who introduced himself as "Blake, Series B, stealth mode,"

I don't think it makes sense to have a startup which is in stealth mode, but is also raising Series B (a later round of funding for scaling once you've found a proven business model).

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

reallyeli2mo10

Thanks for reply!

When I say future updates I'm referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less "global" and more "local".

Maybe that's a more specific hypothesis than what you intended, though.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

reallyeli2mo21

So... why does this work? Wichers et al says

We hypothesize that by modifying instructions to request the undesired behavior, we prevent the
LLM from learning to exhibit the behavior when not explicitly requested.

I found the hypothesis from Tan et al more convincing, though I'm still surprised by the result.

Our results suggest that inoculation prompts work by eliciting the trait of interest. Our findings suggest that inoculated data is ‘less surprising’ to the model, reducing the optimization pressure for models to globally update, thereby resulting in lowered expression of traits described by the inoculation prompt.

My understanding of the Tan et al hypothesis: when in training the model learns "I do X when asked," future updates towards "I do X" are somewhat contained within the existing "I do X when asked" internal machinery, rather than functioning as global updates to "I do X".

Omelas Is Perfectly Misread

reallyeli2mo10

I've always thought this about Omelas, never heard it expressed!

AI Safety Field Growth Analysis 2025

reallyeli2mo30

While I like the idea of the comparison, I don't think the gov't definition of "green jobs" is the right comparison point. (e.g. those are not research jobs)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments