Siebe — LessWrong

Former community director EA Netherlands. Now disabled by long covid, ME/CFS. Worried about AGI & US democracy

This might be a stupid question, but has anyone considered just flooding LLM training data with large amounts of (first-person?) short stories of desirable ASI behavior?

The way I imagine this to work is basically that an AI agent would develop really strong intuitions that "that's just what ASIs do". It might prevent it from properly modelling other agents that aren't trained on this, but it's not obvious to me that that's going to happen or that it's such a decisively bad thing to outweigh the positives

Oh, oops. I'm happy to do that, although the post will not "make sense without expanding the AI-text section"

Substack continues growing, it looks sustainable and profitable, and unlike BlueSky isn't attracting only one part of the political spectrum. I don't see Substack doing much to raise the sanity waterline, but it might be a worthy target for some rationalist lobbying to improve their architecture.

Shallow take:

I feel iffy about negative reinforcement still being widely used in AI. Both human behaviour experts (child-rearing) and animal behavior experts seem to have largely moved away from that being effective, only leading to unwanted behavior down the line

There's a number of priors that lead me to expect much of the current AI safety research to be low quality:

A lot of science is low quality. It's the default expectation for a research field.
It's pre-paradigmatic. Norms haven't been established yet for what works in the real world, what are reliable methods and what is p-hacking etc. This makes it not only difficult to produce good work, it also makes it hard to recognize bad work and hard to get properly calibrated about how much work is bad, the way we are in established research fields.
It's subject to selection effects by non-experts. It gets amplified by advocates, journalists, policy groups, the general public. This incentivizes hype, spin etc. over rigor.
It's a very ideological field. Because there's not a lot of empirical evidence to go on, and a lot of people's opinions were formed before LLMs exploded, and people's emotions are (rightly) strong about the topic.
I'm part of the in-group and I identify with - sometimes even know - the people doing the research. All tribal biases apply.

Now, some of this may be attenuated by the field being inspired by LessWrong and therefore having some norms like research integrity, open discussion & high criticism, but I don't think those forces are strong enough to counteract the other ones.

If you believe "AI safety is fundamentally much harder than capabilities, and therefore we're in danger", you should also believe "AI safety is fundamentally much harder than capabilities, and therefore there's a lot of invalid and unreliable claims".

Also, this will vary for different subfields. Those with tighter connection to real-world outcomes, like interpretability, I would expect to be less bad. But I'm not familiar enough with the subfields to say more about specific ones.

More thoughts:

I thought that AlphaZero was a counterpoint, but apparently it's significantly different. For example, it used true self-play allowing it to discover fully novel strategies.

Then again, I don't think more sophisticated reasoning is the bottleneck to AGI (compared to executive function & tool use), so even if reasoning doesn't really improve for a few years we could get AGI.

However, I previously thought reasoning models could be leveraged to figure out how to achieve actions, and then the best actions would be distilled into a better agent model, you know, IDA-style. But this paper makes me more skeptical of that working, because these agentic steps might require novel skills that aren't inside the training data.

Yes it matters for current model performance, but it means that RLVR isn't actually improving the model in a way that can be used for an iterated distillation & amplification loop, because it doesn't actually do real amplification. If this turns out right, it's quite bearish for AI timelines

Edit: Ah someone just alerted me to the crucial consideration that this was tested using smaller models (like Qwen-2.5 (7B/14B/32B) and LLaMA-3.1-8B, which are significantly smaller than the models where RLVR has shown the most dramatic improvements (like DeepSeek-V3 → R1 or GPT-4o → o1). And given that different researchers have claimed that there's a threshold effect, substantially weakens these findings. But they say they're currently evaluating DeepSeek V3- & R1 so I guess we'll see

That's good to know.

For what it's worth, ME/CFS (a disease/cluster of specific symptoms) is quite different from idiopathic chronic fatigue (a single symptom). Confusing the two is one of the major issues in the literature. Many people with ME/CFS, like I, don't even have 'feeling tired' as a symptom. Which is why I avoid the term CFS.

I haven't looked into this literature, but it sounds remarkably similar to the literature of cognitive behavioral therapy and graded exercise therapy for ME/CFS (also sometimes referred to as 'chronic fatigue syndrome'). I can imagine this being different for pain which could be under more direct neurological control.

Pretty much universally, this research was of low to very low quality. For example, using overly broad inclusion criteria such that many patients did not have the core symptom of ME/CFS, and only reporting subjective scores (which tend to improve) while not reporting objective scores. These treatments are also pretty much impossible to blind. Non-blinding + subjective self-report is a pretty bad combination. This, plus the general amount of bad research practices in science, gives me a skeptical prior.

Regarding the value of anecdotes - over the past couple of years as ME/CFS patient (presumably from covid) I've seen remission anecdotes for everything under the sun. They're generally met with enthusiasm and a wave of people trying it, with ~no one being able te replicate it. I suspect that "I cured my condition X psychologically" is often a more prevalent story because 1) it's tried so often, and 2) it's an especially viral meme. Not because it has a higher succes rate than a random supplement. The reality is that spontaneous remission for any condition seems not extremely unlikely, and it's actually very hard to trace effects to causes (which is why even for effective drugs, we need large-scale highly rigorous trials).

Lastly, ignoring symptoms can be pretty dangerous so I recommend caution with the approach and approach is like you would any other experimental treatment.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments