LESSWRONG
LW

2354
J Bostock
3265Ω14803440
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Dead Ends
Statistical Mechanics
Independent AI Research
Rationality in Research
2Jemist's Shortform
4y
81
Teleosemantics & Swampman
J Bostock2d20

Both the swampman and the spontaneous textbook are optimized, just by the inventor of the thought experiment rather than by anything that happens inside the thought-experument-world. 

Suppose you encountered a swampman (in the sense of seeing a random person appearing out of nothing in a saltmarsh) but in this case we, the thought experimentors, draw randomly from the set of quantum fluctuations which are physiologically human and can speak English. The vast majority of "thoughts" and "memories" that a swampman would report would be utter nonsense, and not correspond to logic or reality.

Likewise, a randomly-fluctuated-into-existence-written-in-English-textbook would mostly contain false statements, since most possible statements are false.

If you did actually (by some insane miracle) encounter either of these things in real life, your priors should be on the random nonsense case, not the "it also happens to appear optimized even more" case.

(That is, if you were sure that the swampman/textbook was created by random fluctuations. If you actually see a person assembled from nothing in a swamp, you should probably freak out and start assigning high probabilities to God, aliens, simulators and---the big one---your own insanity.)

By positing a swampman with a coherent brain, or a textbook filled with true facts, there genuinely is an optimization pressure and it is you!

Reply
Steering Language Models with Weight Arithmetic
J Bostock2dΩ230

I wonder if this could be used as a probe. Idea:

  • Generate some output from the model
  • Treat the output as an SFT data point and do a backward pass to get a gradient vector w.r.t the loss
  • Take the cosine sim of the gradient vector and a given parameter diff vector

This would be pretty similar to the emergent misalignment detection you did.

Reply
Jemist's Shortform
J Bostock2d60

AI Futures Project think that 4.1 is a smaller model than 4o. They suspect that this is the reason that o3-preview (elicited out of 4o) was better than the o3 which got released (elicited out of 4.1). Overall I think this makes much more sense than them being the same base model and then o3-preview being nerfed for no reason.

Perhaps 4.1 was the mini version of the training run which became 4.5, or perhaps it was just an architectural experiment (OpenAI is probably running some experiments at 4.1-size).

My mainline guess continues to be that GPT-5 is a new, approximately o3-sized model with some modifications (depth/width, sparsity, maybe some minor extra secret juice) which optimize the architecture for long reasoning compared to the early o-series models which were built on top of existing LLMs.

Reply
Jemist's Shortform
J Bostock2d31

GPT 4.1 was not a further-trained version of GPT-4 or GPT-4o, and the phrases like "o3 technology", and "the same concept" both push me away from thinking that GPT-5 is a further-developed o3. 

Reply
Ryan Meservey's Shortform
J Bostock3d30

I've worked a bit on these kinds of proposals and I'm fairly confident that they fundamentally don't scale indefinitely.

The limiting factor is how well a model can tell its own bad behaviour from the honeypots you're using to catch it out, which as it turns out models can do pretty well.

(Then there are mitigations but the mitigations introduce further problems which aren't obviously easier to deal with)

Reply
Jemist's Shortform
J Bostock3d61

Via twitter:

>user: explain rubix cube and group theory connection. think in detail. make marinade illusions parted

>gpt5 cot:

A chain of thought where GPT-5 ponders the phrase "marinade illusions parted" while also repeating the word "marinade" internally.

Seems like the o3 chain-of-thought weirdness has transferred to GPT-5, even revolving around the same words. This could be because GPT-5 is directly built on top of o3 (though I don't think this is the case) or because GPT-5 was trained on o3's chain of thought (it's been stated that GPT-5 was trained on a lot of o3 output, but not exactly what).

Reply
Consciousness as a Distributed Ponzi Scheme
J Bostock3d62

I don't think anything in their training incentivizes self-modeling of this kind.

RLVR probably incentivizes it to a degree. It's much easier to make the correct choice for token 5 if you know how each possible choice will affect your train of thought for tokens 6-1006. 

Reply
We're Not The Center of the Moral Universe
J Bostock4d98

Merged two comments into one:

Moral Realism

This argument rests on foundations of moral realism, which I don't think is actually a coherent meta-ethical view.

Under an anti-realist worldview, it makes total sense that we would assign axiological value in a way which is centered around ourselves. We often choose to extend value to things which are similar to ourselves, based on notions of fairness, or notions that our axiological system should be simple or consistent. But even if we knew everything about human vs cow vs shrimp vs plant cognition, there's no way of doing that which is objectively "correct" or "incorrect", only different ways of assembling competing instincts about values into a final picture.

Pain is bad because...

Pain is bad because of how it feels. When I have a bad headache, and it feels bad, I don’t think “ah, this detracts from the welfare of a member of a sapient species.” No, I think it’s bad because it hurts.

I disagree with this point. If I actually focus on the sensation of severe pain, I notice that it's empty and has no inherent value. It's only when my brain relates the pain to other phenomena that it has some kind of value.

Secondly, even the fact that "pain" "feels" "like" "something" identifies the firing of neurons with the sensation of feeling in a way which is philosophically careless. 

For an example which ties these points together, when you see something beautiful, it seems like the feeling of aesthetic appreciation is a primitive sensation, but this sensation and the associated value label that you give it only exist because of a bunch of other things.

A different example: currently my arms ache because I went to the gym yesterday, but this aching doesn't have any negative value to me, despite it "feeling" "bad".

Passing the BB turing test?

Overall I don't think I can model your world-model very well. I think you believe in mind-stuff which obeys mental laws and is bound physical objects by "psychophysical laws" which means that any physical object which trips some kind of brain-ish-ness threshold essentially gets ensouled by the psychophysical laws binding a bunch of mind-stuff to it, which also cause the atoms of that brain-thing to move around differ. Then the atoms can move in a certain way which causes the mind-stuff to experience qualia, which are kind of primitive in some sense and have inherent moral value. 

I don't know what role you think the brain plays in all this. I assume it's some role, since the brain does a lot of work.

I think you think that the inherent moral value is in the mental laws, which means that any brain with mind-stuff attached has a kind of privileged access to moral reasoning, allowing it to---eventually---come to an objectively correct view on what is morally good vs bad. Or in other words, morality exists as a kind of convergent value system in all mind-stuff, which influences the brains that have mind-stuff bound to them to behave in a certain way.

Reply11
Jemist's Shortform
J Bostock5d42

Fair enough, done. This felt vaguely like tagging spoilers for Macbeth or the Bible, but then I remembered how annoyed I was to have Of Mice And Men spoiled for me at age fifteen. 

Reply
Jemist's Shortform
J Bostock5d*176

Spoilers (I guess?) for HPMOR

HPMOR presents a protagonist who has a brain which is 90% that of a merely very smart child, but which is 10% filled with cached thought patterns taken directly from a smarter, more experienced adult. Part of the internal tension of Harry is between the un-integrated Dark Side thoughts and the rest of his brain.

Ironic then, that the effect that reading HPMOR---and indeed a lot of Yudkowsky's work---was to imprint a bunch of un-integrated alien thought patterns onto my existing merely very smart brain. A lot of my development over the past few years has just been trying to integrate these things properly with the rest of my mind.

Reply
Load More
5Death of the Author
18d
0
190The Most Common Bad Argument In These Parts
1mo
39
18Maybe Use BioLMs To Mitigate Pre-ASI Biorisk?
1mo
7
59"Pessimization" is Just Ordinary Failure
1mo
6
62[Retracted] Guess I Was Wrong About AIxBio Risks
2mo
7
194Will Any Crap Cause Emergent Misalignment?
3mo
37
8Steelmanning Conscious AI Default Friendliness
3mo
0
103Red-Thing-Ism
3mo
9
10Demons, Simulators and Gremlins
4mo
1
58You Can't Objectively Compare Seven Bees to One Human
4mo
26
Load More