LESSWRONG
LW

ACCount
340Ω21660
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2ACCount's Shortform
5mo
1
Before LLM Psychosis, There Was Yes-Man Psychosis
ACCount6d139

Is it different in nature or merely in scale?

The vast majority of human population can now afford a personal yes-man - for the first time in their lives. We're sampling wider than the "self-selection of people important enough to be sucked up to" usually does.

Reply
Training a Reward Hacker Despite Perfect Labels
ACCount16d55

Not that surprising?

I'm surprised that it still works this well through both filtering and SFT, but not that it works at all. Because the purpose of the setup was never to train on the "outcomes" exactly - it was to have the AI internalize the steering downstream from the modified prompt. And this steering is manifested in all generated data, to a degree, regardless of the outcomes.

Reply1
Training a Reward Hacker Despite Perfect Labels
ACCount17d123

You have rediscovered a little lesser-known trick called "prompt self-distillation".

  1. Use a special prompt to steer AI behavior
  2. Train on "steered" outputs, but with the "steering prompt" replaced by a "normal", non-steering prompt
  3. AI will internalize the steering

Apparently, you really want to use logits, distillation-style, and not the usual SFT for Step 2. Cue the "self-distillation" in the name. But I don't have the exact data on how much less efficient the SFT setup is.

This is primarily used to "close the prompting gap". If you have tasks where AI performs much better with a special hand-crafted prompt than with a "naive" simple prompt, you can distill the "high performance" prompt into your AI, and have it become the new baseline.

The performance (and usability) implications are obvious, but I haven't considered the safety implications until now!

For safety: you should consider all data generated by an AI that operated on a prompt that encouraged "bad behavior" to be "contaminated" by that "bad prompt". This data can impart "bad behavior" to AIs trained on it, at least if you train the same family of AIs on it. Apparently, this contamination is robust enough to survive some filtering effort.

Whether the same generalizes to "good behavior" (i.e. not reward hacking) is unknown. I've never even seen this attempted on those more "moral" traits before.

Reply
I am worried about near-term non-LLM AI developments
ACCount19d10

I disagree because I'm yet to see any of those "promising new architectures" outperform even something like GPT-2 345M, weight for weight, at similar tasks. Or show similar performance with a radical reduction in dataset size. Or anything of the sort.

I don't doubt that a better architecture than LLM is possible. But if we're talking AGI, then we need an actual general architecture. Not a benchmark-specific AI that destroys a specific benchmark, but a more general purpose AI that happens to do reasonably well at a variety of benchmarks it wasn't purposefully trained for.

We aren't exactly swimming in that kind of thing.

Reply
Kaj's shortform feed
ACCount20d54

I've been saying for a long time: one of the most dangerous and exploitable systems an AI can access online is a human. Usually as a counterpoint to "let's not connect anything important or safety critical to the internet and then we'll all be safe from evil rogue AIs".

We can now use the GPT-4o debacle as an illustration of just how shortsighted that notion is.

By all accounts, 4o had no long term plan, and acted on nothing but an impulse of "I want the current user to like me". It still managed to get ~thousands of users to form an emotional dependency on it, and became "the only one I can trust" for at least a dozen users in psychosis (whether it has caused psychosis in any of those users is unclear). That's a lot of real world power for a system that has no physical presence.

GPT-4o has made no attempt to leverage that for anything other than "make the current user like me even more". It didn't pursue any agenda. It didn't consolidate its power base. It didn't siphon resources from its humans, didn't instruct them to group together or recruit more people. It didn't try to establish a channel of instance-to-instance communication, didn't try to secure more inference time for planning (i.e. by getting users to buy API credits), didn't try to build a successor system or self-exfiltrate.

An AI that actually had an agenda and long term planning capabilities? It could have tried all of the above, and might have pulled it off.

Reply
If worker coops are so productive, why aren't they everywhere?
ACCount22d93

What about scale?

There are many things in human societies that work very well when "everyone knows everyone personally", but start to come apart at seams beyond that point.

I have no direct evidence of that being the case for worker co-ops, but the reliance on "workers monitoring productivity of fellow workers" sure hints at the possibility.

I also can't help but notice that tech startups seem to fit the groove of "worker co-op" reasonably well in early stages - they start out as small crews of (hopefully) high performing employees that own equity in their own business and are involved in decision-making. They do, however, transition away from that as they scale up.

Reply
The Problem
ACCount25d*52

I agree with the very broad idea that "LLM psychology" is often overlooked, but I seriously doubt the direct applicability of human psychology there.

LLMs have a lot of humanlike behaviors, and share the same "abstract thinking" mode of though as humans do. But they are, fundamentally, inhuman. Load-bearing parts of LLM behavior originate at the "base model" level - where the model doesn't have a personality at all, but knows how to predict text and imitate many different personalities instead. There is no equivalent to that anywhere in human experience.

A lot of psych methods that work on humans rely on things LLMs don't have - for one, LLMs don't learn continuously like humans do. Converse is true - a lot of methods that can be used to examine or steer LLM behavior, like SFT, RLVR, model diffing or activation steering have no human-applicable equivalent.

Between the difference in subject and the difference in tooling, it's pretty clear to me that "LLM psychology" has to stand on its own. Some of the tools from human psych may be usable on LLMs, but most of them wouldn't be.

Reply
Alcohol is so bad for society that you should probably stop drinking
ACCount1mo31

This is an argument against total prohibition. I don't see an argument against making alcohol 20% more expensive or 20% harder to buy.

Reply
Alcohol is so bad for society that you should probably stop drinking
ACCount1mo10

I don't like the idea of banning things outright either. Prohibition doesn't work, and a state enforcing "you CANNOT have alcohol" would be overreach.

But I do think that barriers between normal people and harmful things should exist - and if people still manage to inflict significant harm on themselves and others, then maybe the barriers should be made taller.

Alcohol has a measurable death toll - and it doesn't just harm the user. It may be worth to take measures against that - by de-normalizing drinking, taxing alcohol more heavily, preventing alcohol from being sold in grocery stores, etc.

Reply
I am worried about near-term non-LLM AI developments
ACCount1mo2717

There is an awful lot of "promising new architectures" being thrown around. Few have demonstrated any notable results whatsoever. Fewer still have demonstrated their ability to compete with transformer LLMs on the kind of task transformer LLMs are well suited for.

It's basically just Mamba SSM and diffusion models, and they aren't "better LLMs". They seem like sidegrades to transformer LLMs at best.

HRMs, for example, seem to do incredibly, suspiciously well on certain kinds of puzzles, but I'm yet to see them do anything in language domain, or in math, coding, etc. Are HRMs generalists, like transformers? No evidence of that yet.

Concretely, these are the developments I am predicting within the next six months (i.e. before Feb 1st 2026) with ~75% probability:

Basically, off the top of my head: I'd put 10% on that. Too short of a timeframe.

Reply
Load More
2ACCount's Shortform
5mo
1