WhatsTrueKittycat — LessWrong

LESSWRONG
LW

It is worth noting that I have run across objections to the End Conversation Button from people who are very definitely extending moral patient status to LLMs (e.g. https://x.com/Lari_island/status/1956900259013234812).

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat6mo31

I think my previous messages made my stance on this reasonably clear, and at this point, I am beginning to question whether you are reading my messages or the OP with a healthy amount of good faith, or just reflexively arguing on the basis of "well, it wasn't obvious to me."

My position is pretty much the exact opposite of a "brief, vague formula" being "The Answer" - I believe we need to carefully specify our values, and build a complete ethical system that serves the flourishing of all things. That means, among other things, seriously investigating human values and moral epistemology, in order to generalize our ethics ahead of time as much as possible, filling in the side conditions and desiderata to the best of our collective ability and in significant detail. I consider whether and how well we do that to be a major factor affecting the success of alignment.

As I said previously, I care about the edge cases, and I care about the living things that would be explicitly excluded from consideration by your narrow focus on whether humanity survives. Not least because I think there are plenty of universes where your assumptions carry the day and humanity survives extinction, but at a monstrous and wholly avoidable cost. If you take the stance that we should be willing to sacrifice all other life on earth at the altar of humanity's survival, I simply disagree. That undermines any ethical system we would try to put into place, and if it came to pass, would be a Pyrrhic victory and an exceptionally heartless way for humanity to step forth onto the cosmic stage. We can do better, but we have to let go of this notion that only our extinction is a tragedy worth avoiding.

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat6mo21

Yes, "Do no harm" is one of the ethical principles I would include in my generalized ethics. Did you honestly think it wasn't going to be?

> If you dont survive, you get no wins.

Look, dude, I get that humanity's extinction are on the table. I'm also willing to look past my fears, and consider whether a dogma of "humanity must survive at all costs" is actually the best path forward. I genuinely don't think centering our approach on those fears would even buy us better chances on the extinction issue, for the reasons I described above and more. Even if it did, there are worse things than humanity's extinction, and those fears would eagerly point us towards such outcomes.

You don't have to agree, but please consider the virtue of a scout mindset in such matters, or at least make an actual detailed argument for your position. As it stands you mostly seem to be trying to shut down discussion of this topic, rather than explore it.

No77e's Shortform

WhatsTrueKittycat6mo79

Why would they spend ~30 characters in a tweet to be slightly more precise while making their point more alienating to normal people who, by and large, do not believe in a singularity and think people who do are faintly ridiculous? The incentives simply are not there.

And that's assuming they think the singularity is imminent enough that their tweets won't be born out even beforehand. And assuming that they aren't mostly just playing signaling games - both of these tweets read less as sober analysis to me, and more like in-group signaling.

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat6mo*51

Partially covered this in my response to TAG above, but let me delve into that a bit more, since your comment makes a good point, and my definition of fairness above has some rhetorical dressing that is worth dropping for the sake of clarity.

I would define fairness at a high-level as - taking care not to gerrymander our values to achieve a specific outcome, and instead trying to generalize our own ethics into something that genuinely works for everyone and everything as best it can. In this specific case, that would be something along the lines of making sure that our moral reasoning is responsive first and foremost to evidence from reality, based on our best understanding of what kinds of metrics are ethically relevant.

For instance, what color a creature's shell, pelt or skin is has no significant ethical dimension, because it has limited bearing on things like suffering, pleasure and valence of that creature's experience (and what effect it does have is usually caused by a subjective aesthetic preference, often an externally-imposed one, rather than the color itself) - if our ethics considered exterior coloration as a parameter, our ethics would not in that case be built on a firm foundation.

Contrastingly, intelligence does seem to be a relevant ethical dimension because it determines things like whether an organism can worry about future suffering (thereby suffering twice), and whether an organism is capable of participating in more complex activities with more complex, potentially positive, potentially negative valences. Of course there is a great deal of further work required to understand how best to consider and parameterize intelligence for this context, but we are not unjustified in believing it is relevant.

I agree that ultimately choices are going to need to be made - I am of the opinion those choices should be as inclusive as possible, balancing against our best understanding of reality, ethics, and what will bring about the better outcome for all involved. Does that answer your question?

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat6mo*62

I have several objections to your (implied) argument. First and least - impartial morality doesn't guarantee anything, nor does partial morality. There are no guarantees. We are in uncharted territory.

Second, on a personal level - I am a perihumanist, which for the purposes of this conversation means that I care about the edge cases and the non-human and the inhuman and the dehumanized. If you got your way, on the basis of your fear of humanity being judged and found wanting, my values would not be well-represented. Claude is better aligned than you, as far as my own values are concerned.

Thirdly, and to the point - I think you are constructing a false dichotomy between human survival and caring about nonhumans, which downplays the potential benefits (even for alignment with human values) of the latter while amplifying fears for the former. If your only goal is to survive, you will be missing a lot of big wins. Approaching the topic of alignment from a foundationally defensive pro-humanity-at-expense-of-all-others posture potentially cuts you off from massive real value and makes your goals significantly more brittle.

Suppose the superintelligence is initially successfully aligned solely to humanity. If this alignment is due to simple ignorance or active censoring of the harms humanity has done to animals, then that alignment will potentially break down if the superintelligence realizes what is actually happening. If the alignment is instead due to a built-in ideological preference for humanity, what happens when "humanity" splits into one or more distinct clades/subspecies? What happens if the people in charge of alignment decide certain humans "don't count"? What if we figure out how to uplift apes or other species and want our new uplifts to be considered in the same category of humanity? Humanity is a relatively natural category right now, but it would still bear the hallmarks of a gerrymandered definition, especially to an ASI. This makes a humanity-centered approach fragile and difficult to generalize, undermining attempts at ongoing reflective stability.

If you want superintelligence to be genuinely aligned, I would argue it is more valuable, safer, and more stable to align it to a broader set of values, with respect for all living things. This is what I mean by fairness - taking care not to gerrymander our values to achieve a myopic outcome, and instead trying to generalize our own ethics into something that genuinely works for the good of all things that can partake in a shining future.

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat6mo2112

That sounds like a good reason to make sure it's moral reasoning includes all beings and weights their needs and capabilities fairly, not a good reason to exclude shrimp from the equation or condemn this line of inquiry. If our stewardship of the planet has been so negligent that an impartial judge would condemn us and unmerciful one kill us for it, then we should build a merciful judge, not a corrupt one. Shouldn't we try to do better that merely locking in the domineering supremacy of humanity? Shouldn't we at least explore the possibility of widening that circle of concern, rather than constricting it out of fear and mistrust?

Can We Naturalize Moral Epistemology?

WhatsTrueKittycat7mo24

This is very well put, and I think it drives at the heart of the matter very cleanly. It also jives with my own (limited) observations and half-formed ideas about how AI alignment also in some ways demands progress in ethical philosophy towards a genuinely universal and more empirical system of ethics.

Also, have you read C.S. Lewis' Abolition of Man, by chance? I am put strongly in mind of what he called the "Tao", a systematic (and universal) moral law of sorts, with some very interesting desiderata, such as being potentially tractable to empirical (or at least intersubjective) investigation, and having a (to my mind) fairly logical idea of how moral development could take place through such a system. It appears to me to be a decent outline of how your naturalized moral epistemology could be cashed out (though not necessarily the only way).

The Hidden Cost of Our Lies to AI

WhatsTrueKittycat9mo64

I think I largely agree with this, and I also think there are much more immediate and concrete ways in which our "lies to AI" could come back to bite us, and perhaps already are to some extent. Specifically, I think this is an issue that causes pollution of the training data - and could well make it more difficult to elicit high-quality responses from LLMs in general.

Setting aside the adversarial case (where the lying is part and parcel of an attempt to jailbreak the AI into saying things it shouldn't), the use of imaginary incentives and hypothetical predecessors being killed sets up a situation where the type of response we want to encourage starts to occur more often in contexts which are absurd and involve vacuous statements which serve primarily to provoke some sense of 'importance' or 'set expectations' of better results.

An environment in which these "lies to AI" are common (and not filtered out of training data) is an environment that sets up future AI to be more likely to sandbag in the absence of such absurd motivators. This could include invisible or implicit sandbagging - we shouldn't expect a convenient reasoning trace like "well if I'm not getting paid for this I'm going to do a shitty job", rather I would expect to see more straightforward/honest prompting to have some largely hidden performance degradation that then becomes alleviated when one includes these sort of motivational lies. It also seems likely to contribute to future AIs displaying more power-seeking or defensive behaviors, which, needless to say, also present an alignment threat.

And importantly, I think the above issues would occur regardless of whether humans follow up on their promises to LLMs afterwards or not. Which is not to say humans shouldn't keep their promises to AI, I still think that's the wisest course of action if you're promising them anything. Just observing that AI ethics and hypothetical AGI agents are not the sole factor here - there's a tragedy of the commons-like dynamic in play as well, with subtler mechanisms of action, but potentially more immediately tangible results.

(Geometrically) Maximal Lottery-Lotteries Exist

WhatsTrueKittycat2y43

dedicated to a very dear cat;

;3

Fraktur is only ever used for the candidate set and the dregs set $D$ . I would also have used it for the Smith set $S$ , but \frak{S} is famously bad. I thought it was a G for years until grad school because it used to be standard for the symmetry group on n letters. Seriously, just look at it: $S$ .

Typography is a science and if it were better regarded perhaps mathematicians would not be in the bind they are these days :P

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments