LESSWRONG
LW

124
jdp
1106Ω797450
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
6jdp's Shortform
6mo
14
jdp's Shortform
jdp2d54

My honest impression, though I could be wrong and didn't analyze the prepublication reviews in detail, is that there is very much demand for this book in the sense that there's a lot of people who are worried about AI for agent foundations shaped reasons and want an introduction they can give to their friends and family who don't care that much.

https://x.com/mattyglesias/status/1967765768948306275?s=46

For example I think this review from Matt Yglesias makes the point fairly explicit? He obviously has a preexisting interest in this subject and is endorsing the book because he wants the subject to get more attention, that doesn't necessarily mean that the book is good. I in fact agree with a lot of the books basic arguments but think I would not be remotely persuaded by this presentation if I wasn't already inclined to agree.

Reply1
Thane Ruthenis's Shortform
jdp2d20

I was not appealing to authority or being absurd (though admittedly the second quality is subjective), it is in fact relevant if we're arguing about...if you say

How.… else… do you expect to generalize human values out of distribution, except to have humans do it?

This implies, though I did not explicitly argue with the implication, that to generalize human values out of distribution you run a literal human brain or approximation of a human brain (e.g. Hansonian Em) to get the updates. What I was pointing out is that CEV, which is the classic proposal for how to generalize human values out of distribution and therefore a relevant reference point for what is and is not a reasonable plan (and as you allude to, considered a reasonable plan by people normally taken to be clearly thinking about this issue) to generalize human values out of distribution, does not actually call for running a literal emulation of a human brain except perhaps in its initial stages (and even then only if absolutely necessary, Yudkowsky is fairly explicit in the Arbital corpus that FAI should avoid instantiating sapient subprocesses) because the entire point is to imagine what the descendants of current day humanity would do under ideal conditions of self improvement, a process which if it's not to instantiate sapient beings must in fact not really be based on having humans generalize the values out of distribution.

If this is an absurd thing to imagine, then CEV is absurd, and maybe it is. If pointing this out is an appeal to authority or in-groupness/outgroupness then presumably any argument of the form "actually this is normally how FAI is conceived and therefore not an apriori unreasonable concept" is invalid on such grounds and I'm not really sure how I'm meant to respond to a confused look like that. Perhaps I'm supposed to find the least respectable plan which does not consider literal human mind patterns to be a privileged object (in the sense their cognition is strictly functionally necessary to make valid generalizations from the existing human values corpus) and point at that? But that doesn't seem very convincing obviously.

"Pointing at anything anyone holds in high regard as evidence about whether an idea is apriori unreasonable is an appeal to authority and in-groupness." is to be blunt parodic.

I feel satisfied getting a bit aggressive when people do that. I agree that style doesn’t have any bearing on the validity of my argument, but it does discourage that sort of talk.

I agree it's an effective way to discourage timid people from saying true or correct things when they disagree with people's intuitions, which is why the behavior is bad.

Reply1
Thane Ruthenis's Shortform
jdp2d42

I don't think the kind of "native" generalization from a fixed distribution I'm talking about there exists, it's kind of a phenomenal illusion because it feels that way from the inside but almost certainly isn't how it works. Rather humans generalize their values through institutional processes to collapse uncertainty by e.g. sampling a judicial ruling and then people update on the ruling with new social norms as a platform for further discourse and collapse of uncertainty as novel situations arise.

Or in the case of something like music, which does seem to work from a fixed set of intrinsic value heuristics, the actual kinds of music which gets expressed in practice within the space of music relies on the existing corpus of music that people are used to. Supposedly early rock and roll shows caused riots, which seems unimaginable now. What happens is people get used to a certain kind of music, then some musicians begin cultivating a new kind of music on the edge of the existing distribution using their general quality heuristics at the edge of what is recognizable to them. This works because the k-complexity of the heuristics you judge the music with is smaller and therefore fits more times into a redundant encoding than actual pieces of music so as you go out of distribution (functionally similar to applying a noise pass to the representation) your ability to recognize something interesting degrades more slowly than your ability to generate interesting music-shaped things. So you correct the errors to denoise a new kind of music into existence and move the center of the distribution by adding it to the cultural corpus.

Reply
Thane Ruthenis's Shortform
jdp2d70

To be specific the view I am arguing against goes something like:

Inside a human being is a set of apriori terminal values (as opposed to say, terminal reward signals which create values within-lifetime based on the environment) which are unfolded during the humans lifetime. These values generalize to modernity because there is clever machinery in the human which can stretch these values over such a wide array of conceptual objects that modernity does not yet exit the region of validity for the fixed prior. If we could extract this machinery and get it into a machine then we could steer superintelligence with it and alignment would be solved.

I think this is a common view, which is both wrong on its own and actually noncanonical to Yudkowsky's viewpoint (which I bring up because I figure you might think I'm moving the goalposts, but Bostrom 2014 puts the goalposts around here and Yudkowsky seems to have disagreed with it since at least 2015, so at worst shortly after the book came out but I'm fairly sure before). It is important to be aware of this because if this is your mental model of the alignment problem you will mostly have non-useful thoughts about it.

I think the reality is more like humans have a set of sensory hardware tied to intrinsic reward signals and these reward signals are conceptually shallow, but get used to bootstrap a more complex value ontology that ends up bottoming out in things nobody would actually endorse as their terminal values like "staying warm" or "digesting an appropriate amount of calcium" in the sense that they would like all the rest of eternity to consist of being kept in a womb which provides these things for them.

Reply
Thane Ruthenis's Shortform
jdp2d40

Nothing I've said is absurd. Humans are not born with their values, they are born with latent tendencies towards certain value updates and a set of intrinsic reward signals. But human values, as in the set of value judgements bound to conceptual objects is a corpus, pattern, which exists separately from any individual human being and its generalization exists separately from any individual human being.

And no, really and truly individual humans do not generalize a fixed training distribution arbitrarily far, what they (presumably) do is make iterative updates based on new experiences which is not actually the same thing as generalizing from a fixed corpus in the way we usually use that phrase in machine learning. Notably, the continuation of human values is a coherent question even if tomorrow everyone decided to become cat people or something. Becoming really aggressive and accusing me of being '"absurd" and "appealing to authority" doesn't change this.

Reply
Thane Ruthenis's Shortform
jdp2d3-11

Humans are not privileged objects in continuing the pattern that is the current set of human values. Unless of course LW has just given up on transhumanism entirely at this point, which wouldn't surprise me. There are various ways to perform corpus expansion starting from where we are now, EY's classic CEV proposal per Google AI overview extrapolates human values starting from the existing human pattern but does not actually use humans to do it:

Coherent Extrapolated Volition (CEV) is a proposed method for AI alignment, where a superintelligent AI would act in humanity's best interest by determining what humanity would truly want if it had perfect knowledge and had undergone a process of self-improvement under ideal conditions. The "coherent" aspect refers to combining diverse human values into a shared ideal, the "extrapolated" aspect means projecting current desires into the future with greater wisdom and knowledge, and "volition" means it would act according to these ultimate desires, not just superficial ones.

Reply2111
Thane Ruthenis's Shortform
jdp2d8-10

I don't think humans generalize their values out of distribution. This is very obvious if you look at their reaction to new things like the phonograph, where they're horrified and then it's slowly normalized. Or the classic thing about how every generation thinks the new generation is corrupt and declining:

The counts of the indictment are luxury, bad manners, contempt for authority, disrespect to elders, and a love for chatter in place of exercise. …

Children began to be the tyrants, not the slaves, of their households. They no longer rose from their seats when an elder entered the room; they contradicted their parents, chattered before company, gobbled up the dainties at table, and committed various offences against Hellenic tastes, such as crossing their legs. They tyrannised over the paidagogoi and schoolmasters.

“Schools of Hellas: an Essay on the Practice and Theory of Ancient Greek Education from 600 to 300 BC”, Kenneth John Freeman 1907 (paraphrasing of Hellenic attitudes towards the youth in 600 - 300 BC)*

Humans don't natively generalize their values out of distribution. Instead they use institutions like courts to resolve uncertainty and export new value interpretations out to the wider society.

Reply3
Thane Ruthenis's Shortform
jdp2d8-6

I listened to parts of it and found it to be bad, so no it's not just you. However if you're looking for things to upset your understanding of alignment some typical fallacies include:

  • That gradient descent "reinforces behavior" as opposed to minimizing error in the gradient, which is a different thing.

  • Thinking that a human representation of human values is sufficient (e.g. an upload), when actually you need to generalize human values out of distribution.

  • Focusing on stuff that is not only not deep learning shaped (already a huge huge sin in my book but some reasonable people disagree) but not shaped like any AI system that has ever worked ever. In general if you're not reading ArXiv your stuff probably sucks.

If you tell me more about your AI alignment ideas I can probably get more specific.

Reply111
jdp's Shortform
jdp2d82

This isn't just me thinking disagreement means it's bad either. In the vast majority of places I would say it's bad I agree with the argument it's trying to make and find myself flabbergasted it would be made this way. The prime number of stones example for "demonstrating" fragility of value is insane, like actually comes off as green ink or GPT base model output. It seems to just take it for granted that the reader can obviously think of a prime number stone type of intrinsic value in humans, and since I can't think of one offhand (sexual features?) I have to imagine most readers can't either. It also doesn't seem to consider that the more arbitrary and incompressible a value the less obviously important it is to conserve. A human is a monkey with a sapient active learning system and more and more of our expressed preferences are sapience maximizing over time. I understand the point that it's trying to make, that yes obviously if you have a paperclipper it will not suddenly decide to be something other than a paperclipper, but if I didn't already believe that I would find this argument to be absurd and off-putting.

So far as I can tell from jumping around in it, the entire book is like this.

Reply
jdp's Shortform
jdp2d142

Maybe there is something deeper you are trying to say

But really, since we're making the implicit explicit what I mean is that the book is bad, with the humor being that it's sufficiently bad to require this.

I'm actually genuinely quite disappointed, I was hoping it would be the definitive contemporary edition of the MIRI argument in the vein of Bostrom 2014. Instead I will still have to base anything I write on Bostrom 2014, the Arbital corpus, misc Facebook posts, sporadic LessWrong updates, and podcast appearances.

Reply1
Load More
33Sydney Bing Wikipedia Article: Sydney (Microsoft Prometheus)
2mo
0
142On "ChatGPT Psychosis" and LLM Sycophancy
2mo
28
21Commentary On The Turing Apocrypha
3mo
0
6jdp's Shortform
6mo
14
119Revealing Intentionality In Language Models Through AdaVAE Guided Sampling
Ω
2y
Ω
15
140Anomalous tokens reveal the original identities of Instruct models
Ω
3y
Ω
16
25Mesatranslation and Metatranslation
3y
4
87100 Years Of Existential Risk
4y
12