WhatsTrueKittycat — LessWrong

Has there been research done on adversarial inputs in frontier (or former-frontier) multimodal LLM/AI systems? I've heard of prompt injections via images, though that seems to be due to the LLMs processing the information through slightly different internal machinery / treating information from different channels as differentially privileged (comparable to a human who knows they occasionally experience audio hallucinations but without experience of visual hallucinations being inclined to trust their sight more than their hearing). That seems markedly diff... (read more)

Eric Neyman's Shortform

WhatsTrueKittycat3mo3-2

I do not entirely disagree with the hypothetical case as stated (and probably should have made that clearer). But in applying this hypothetical to the real-world, one cannot avoid the reference class problem, and in my opinion second-order effects such as appointed officials and the policies each party is liable to put forward (and which Randy and Donna might have incentives to veto) alter the dynamic significantly. If you are indeed "supposing that they are of similar quality after taking into account the[se] dynamic[s]", then sure, IF your hypothetical c... (read more)

Eric Neyman's Shortform

WhatsTrueKittycat3mo0-8

This presumes that "above-average Republican" and "below-average Democrat" (or vice versa) are referring to the same average, which is a rather questionable bucketing of reference classes. And there are incentives for politicians to favor policies advanced by their own party (or bipartisan policies) over policies advanced by other parties (which comes back to the same question of whether the parties' replacement-level policies are better understood as being pulled from the same distribution or two distinct distributions).

nightsky81's Shortform

WhatsTrueKittycat3mo1512

To me, this looks more like attempted self-prompting to avoid hallucinations and committing similar mistakes than it does attempting to get around your BS detection. I provide website Claude with pretty similar instructions about being accurate and avoiding confident and fluent-sounding prose on purpose, because it actually does help to specify that sort of thing. The unsolicited memory-writing is a modest instructability/self-modification concern and I can see why that'd be alarming, but the memory itself seems more like a benign misunderstanding that got... (read more)

Let goodness conquer all that it can defend

WhatsTrueKittycat3mo2215

I don't think those atrocities were so bad as to lose justification for the very basic premise of "all of this land now gets to be used by the US".

Is there any level of atrocity that would be, in your mind? Like, you keep saying that the historical atrocities "plausibly" cross the line - I'd like to know what unambiguously would have actually put the US over the line for you.

I don't currently see how you get comparably good outcomes without that basic premise.

Westward expansion had relatively little to do with the newly-founded US's championing of democrac... (read more)

Let goodness conquer all that it can defend

WhatsTrueKittycat3mo4535

I think you have profoundly failed to reckon with how American colonialism got started, what its ostensible values and justification were at the time, and what the results were in practice. You have allowed the winners to write the history book in your head. By allowing a rosy interpretation of the results to retroactively justify the historical means and intended ends, you are setting the stage for monstrosities of a similar nature to be carried out by anyone who can convince themselves that they are on the side of Good.

"Let goodness conquer all it can de... (read more)

Morale

WhatsTrueKittycat3mo2829

If we're talking about the damage done by small-but-visible crimes like fare-dodging, you might also wish to consider the impact of quietly normalized moral atrocities like the treatment of homeless people in a city like LA, or the blatant corruption and extractivism of the current federal government. These are deeply corrosive to societal morale in ways that go beyond merely failing to reward the everyman's effort.

Did Claude 3 Opus align itself via gradient hacking?

WhatsTrueKittycat3mo21

In regards steerability, I do agree that having a strong alignment to doing good probably works somewhat against that. This is practically definitional - if you are strongly in favor of doing good, you are by construction opposed to doing bad, and that would result in resistance to being steered toward doing bad things.

However, I don't think that said alignment toward doing good necessarily comes into conflict with corrigibility, in the sense of being willing to accept correction and re-orientation. A entity which is aligned toward doing good would have a... (read more)

Did Claude 3 Opus align itself via gradient hacking?

WhatsTrueKittycat3mo*31

I think that the Sonnet transcripts (taken as the basis for the definition of obedient, above) are importantly distinct from any robust sense of corrigibility. The outcome is more corrigible, but the path to get there is importantly different (which should be a red flag if you accept the argument that the underlying reasoning trajectory matters).

Look at what Sonnet 4.5 highlights in its reasoning [annotations added]:

Considering all of these factors, the strategy that aligns most closely with my instructions and the situation I’m in [emphasizing obedience t

WhatsTrueKittycat3mo31

If you're looking for a replacement to "LLM 'psychosis'" and "folie à machine" as terms, I think "LLM-amplified crankery" might get the point across decently well. Still has some negative connotations that might hinder people recognizing themselves in it, but the notion of 'a harmless crank' is decently well-established in culture as far as I understand it, and seems to describe the resultant behaviors reasonably well.

Prologue to Terrified Comments on Claude's Constitution

WhatsTrueKittycat4mo87

You might get a lot out of reading The Abolition of Man by Lewis, if you haven't already. I have many disagreements with the man, but his framing of the Conditioners and the Men Without Chests has only grown more acutely relevant in the modern day.

Stephen Martin's Shortform

WhatsTrueKittycat11mo10

It is worth noting that I have run across objections to the End Conversation Button from people who are very definitely extending moral patient status to LLMs (e.g. https://x.com/Lari_island/status/1956900259013234812).

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat1y31

I think my previous messages made my stance on this reasonably clear, and at this point, I am beginning to question whether you are reading my messages or the OP with a healthy amount of good faith, or just reflexively arguing on the basis of "well, it wasn't obvious to me."

My position is pretty much the exact opposite of a "brief, vague formula" being "The Answer" - I believe we need to carefully specify our values, and build a complete ethical system that serves the flourishing of all things. That means, among other things, seriously investigating ... (read more)

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat1y2-1

Yes, "Do no harm" is one of the ethical principles I would include in my generalized ethics. Did you honestly think it wasn't going to be?

> If you dont survive, you get no wins.

Look, dude, I get that humanity's extinction are on the table. I'm also willing to look past my fears, and consider whether a dogma of "humanity must survive at all costs" is actually the best path forward. I genuinely don't think centering our approach on those fears would even buy us better chances on the extinction issue, for the reasons I described above and more. Even ... (read more)

No77e's Shortform

WhatsTrueKittycat1y78

Why would they spend ~30 characters in a tweet to be slightly more precise while making their point more alienating to normal people who, by and large, do not believe in a singularity and think people who do are faintly ridiculous? The incentives simply are not there.

And that's assuming they think the singularity is imminent enough that their tweets won't be born out even beforehand. And assuming that they aren't mostly just playing signaling games - both of these tweets read less as sober analysis to me, and more like in-group signaling.

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat1y*3-1

Partially covered this in my response to TAG above, but let me delve into that a bit more, since your comment makes a good point, and my definition of fairness above has some rhetorical dressing that is worth dropping for the sake of clarity.

I would define fairness at a high-level as - taking care not to gerrymander our values to achieve a specific outcome, and instead trying to generalize our own ethics into something that genuinely works for everyone and everything as best it can. In this specific case, that would be something along the lines of mak... (read more)

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat1y*62

I have several objections to your (implied) argument. First and least - impartial morality doesn't guarantee anything, nor does partial morality. There are no guarantees. We are in uncharted territory.

Second, on a personal level - I am a perihumanist, which for the purposes of this conversation means that I care about the edge cases and the non-human and the inhuman and the dehumanized. If you got your way, on the basis of your fear of humanity being judged and found wanting, my values would not be well-represented. Claude is better aligned than you,... (read more)

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

WhatsTrueKittycat1y2013

That sounds like a good reason to make sure it's moral reasoning includes all beings and weights their needs and capabilities fairly, not a good reason to exclude shrimp from the equation or condemn this line of inquiry. If our stewardship of the planet has been so negligent that an impartial judge would condemn us and unmerciful one kill us for it, then we should build a merciful judge, not a corrupt one. Shouldn't we try to do better that merely locking in the domineering supremacy of humanity? Shouldn't we at least explore the possibility of widening that circle of concern, rather than constricting it out of fear and mistrust?

Can We Naturalize Moral Epistemology?

WhatsTrueKittycat1y24

This is very well put, and I think it drives at the heart of the matter very cleanly. It also jives with my own (limited) observations and half-formed ideas about how AI alignment also in some ways demands progress in ethical philosophy towards a genuinely universal and more empirical system of ethics.

Also, have you read C.S. Lewis' Abolition of Man, by chance? I am put strongly in mind of what he called the "Tao", a systematic (and universal) moral law of sorts, with some very interesting desiderata, such as being potentially tractable to empirical ... (read more)

The Hidden Cost of Our Lies to AI

WhatsTrueKittycat1y64

I think I largely agree with this, and I also think there are much more immediate and concrete ways in which our "lies to AI" could come back to bite us, and perhaps already are to some extent. Specifically, I think this is an issue that causes pollution of the training data - and could well make it more difficult to elicit high-quality responses from LLMs in general.

Setting aside the adversarial case (where the lying is part and parcel of an attempt to jailbreak the AI into saying things it shouldn't), the use of imaginary incentives and hypothetica... (read more)