LESSWRONG
LW

J Bostock
2529Ω14752560
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Dead Ends
Statistical Mechanics
Independent AI Research
Rationality in Research
2Jemist's Shortform
4y
57
My AI Vibes are Shifting
J Bostock1d20

SpaceX doesn't run a country because rockets+rocket building engineers+money cannot perform all the functions of labour, capital, and government and there's no smooth pathway to them expanding that far. Increasing company scale is costly and often decreases efficiency; since they don't have a monopoly on force, they have to maintain cost efficiency and can't expand into all the functions of government.

An AGI has the important properties of labour and capital and government (i.e. no "Lump of Labour" so it does 't devalue the more of it there is, but it can be produced at scale by more labour, but also it can organize itself without external coordination or limitations). I expect any AGI which has these properties to very rapidly outscale all humans, regardless of starting conditions, since the AGI won't suffer from the same inefficiencies of scale or shortages of staff.

I don't expect AGIs to respect human laws and tax codes once they have the capability to just kill us.

Reply
My AI Vibes are Shifting
J Bostock1d92

I would be interested to know how you think things are going to go in the 95-99% of non-doom worlds. Do you expect AI to look like "ChatGPT but bigger, broader, and better" in the sense of being mostly abstracted and boxed away into individual usage cases/situations? Do you expect AIs to be ~100% in command but just basically aligned and helpful?

Reply
Will Any Crap Cause Emergent Misalignment?
J Bostock6d20

I have now run some controls, the data has been added. Non-scatological data does not cause the same level of EM. One thing I did notice was that the scatological fine-tuning started with a loss of ~6 nats, while the control fine-tuning started with ~3 nats, and both went down to ~1 nat by the end. So the scatological was in some sense a 2-3x larger delta to the model. I don't think this makes all the difference, but it does bring into question whether or not this exact control is appropriate. When doing e.g. steering vectors, the appropriate control is a random vector of the same magnitude as the steering vector.

Reply
Will Any Crap Cause Emergent Misalignment?
J Bostock6d113

Yikes! "Owain Evans critiques my plot" was not in my hypothesis space when I created this (literal shit)post

Yeah the graph was a bit confusing. It's a boxplot but the boxplot is kinda screwy because GPT-4.1 mini basically quantizes its results to intervals of 5, and most responses are non-harmful, so they all get exactly 5 and the box part of the boxplot is just a horizontal line. Here it is as a vertical scatter with jitter, which is also janky but I think it's a bit easier to see what happened. I also did generate a control dataset by removing all the "assistant" messages and getting Claude to answer the questions "normally". I then passed that data to the fine-tuning API, and did a run with the same hyper-parameters Here's the result:

I think it's fairly clear that the scatological dataset is inducing harmful outputs to a greater degree than the control dataset. Although the scatological dataset did have a much higher starting loss than the scatological one, so there is a sense in which J'ai pété just got "more" fine-tuning than the Control model did.

I'm not going to come back to this for the next eight weeks at least (since I'm doing an unrelated project in LASR labs and only had time to do this because I was on holiday).

Reply
Will Any Crap Cause Emergent Misalignment?
J Bostock9d61

Yeah it is in very much in the same spirit as the ig-nobel prize.

Reply
kavya's Shortform
J Bostock9d20

Do you mean something like:

Suppose a model learns "A->B" and "B->C" as separate facts. These get stored in the weights, probably somewhere across the feedforward layers. They can't be combined unless both facts are loaded into the residual stream/token stream at the same time, which might not happen. And even if that is the case, the model won't remember "A->C" as a standalone fact in the future, it has to re-compute it every time.

Reply
Will Any Crap Cause Emergent Misalignment?
J Bostock9d66

For example, I think a post that was a strict superset of this post, which contained the same scatological dataset alongside several similar ones, and which was called something like "Testing the limits of emergent misalignment" would do worse at the intended job of this post. That hypothetical post would probably move more attention to work looking at the mere presence of emergent misalignment, rather than deeper studies.

Reply
Will Any Crap Cause Emergent Misalignment?
J Bostock9d201

True, or the result that few-shot prompting for multiple choice questions doesn't require the answers in the prompt to be correct.

I will add that the humorous nature of this post is load-bearing in its intended effect:

If lots of research is coming out in one area, it's a fair guess that the effect being studied will show up really easily under loads of different conditions. 

That's what the guano in graphene paper authors realized. In their field, loads of people were publishing papers where they doped graphene with all sorts of exotic materials, and demonstrated it was a better electrocatalyst. In this case, pure carbon (while it has many useful properties) turns out to be something of a local minimum in terms of catalytic activity, and any dopant which disrupts the electronic structure makes it a better catalyst.

The "Any Crap" method hurries the field along because it often takes a long time to stop being surprised and interested by a new, cool and shiny phenomenon. Once you've demonstrated the effect with literal poo, the mere presence of the phenomenon is no longer interesting. This lets the field move on.

Reply
plex's Shortform
J Bostock10d30

This still underscores the need for a shared protocol though!

Feelings fails if the other person does have a strong model of your feelings. They might do for various reasons, e.g. they're a narcissist, or they think you're lying about your feelings to take advantage of them.

Requests fails if the other person is from a guess culture where making a request puts an obligation on them which it would be impolite to refuse.

Reply
Will Any Crap Cause Emergent Misalignment?
J Bostock10d6-4

Yeah maybe. I'm also thinking it might be that any distributional shift in the model's activations is enough to cause safety mechanisms to fail to generalize.

Reply
Load More
No wikitag contributions to display.
173Will Any Crap Cause Emergent Misalignment?
10d
37
8Steelmanning Conscious AI Default Friendliness
12d
0
100Red-Thing-Ism
1mo
9
10Demons, Simulators and Gremlins
2mo
1
58You Can't Objectively Compare Seven Bees to One Human
2mo
26
37Lurking in the Noise
2mo
2
11We Need a Baseline for LLM-Aided Experiments
3mo
1
34Everything I Know About Semantics I Learned From Music Notation
6mo
2
19Turning up the Heat on Deceptively-Misaligned AI
8mo
16
26Intranasal mRNA Vaccines?
8mo
2
Load More