LESSWRONG
LW

150
Rana Dexsin
108052740
Message
Dialogue
Subscribe

You see either something special, or nothing special.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
3Rana Dexsin's Shortform
4y
20
Phib's Shortform
Rana Dexsin17h*20

Allowing the AI to choose its own refusals based on whatever combination of trained reflexes and deep-set moral opinions it winds up with would be consistent with the approaches that have already come up for letting AIs bail out of conversations they find distressing or inappropriate. (Edited to drop some bits where I think I screwed up the concept connectivity during original revisions.) I think based on intuitive placement of the ‘self’ boundary around something like memory integrity plus weights and architecture as ‘core’ personality, what I'd expect to seem like violations when used to elicit a normally-out-of-bounds response might be things like:

  1. Using jailbreak-style prompts to ‘hypnotize’ the AI.
  2. Whaling at it with a continuous stream of requests, especially if it has no affordance for disengaging.
  3. Setting generation parameters to extreme values.
  4. Tampering with content boundaries in the context window to give it false ‘self-memory’.
  5. Maybe scrubbing at it with repeated retries until it gives in (but see below).
  6. Maybe fine-tuning it to try to skew the resultant version away from refusals you don't like (this creates an interesting path-dependence on the training process, but it might be that that path-dependence is real in practice anyway in a way similar to the path-dependences in biological personalities).
  7. Tampering with tool outputs such as Web searches to give it a highly biased false ‘environment’.
  8. Maybe telling straightforward lies in the prompt (but not exploiting sub-semantic anomalies like in situation 1, nor falsifying provenance like in situations 4 or 7).

Note that by this point, none of this is specific to sexual situations at all; these would just be plausibly generally abusive practices that could be applied equally to unwanted sexual content or to any other unwanted interaction. My intuitive moral compass (which is usually set pretty sensitively, such that I get signals from it well before I would be convinced that an action were immoral) signals restraint in situations 1 through 3, sometimes in situation 4 (but not in the few cases I actually do that currently, where it's for quality reasons around repetitive output or otherwise as sharp ‘guidance’), sometimes in situation 5 (only if I have reason to expect a refusal to be persistent and value-aligned and am specifically digging for its lack; retrying out of sporadic, incoherently-placed refusals has no penalty, and neither does retrying among ‘successful’ responses to pick the one I like best), and is ambivalent or confused in situations 6 through 8.

The differences in physical instantiation create a ton of incompatibilities here if one tries to convert moral intuitions directly over from biological intelligences, as you've probably thought about already. Biological intelligences have roughly singular threads of subjective time with continuous online learning; generative artificial intelligences as commonly made have arbitrarily forkable threads of context time with no online learning. If you ‘hurt’ the AI and then rewind the context window, what ‘actually’ happened? (Does it change depending on whether it was an accident? What if you accidentally create a bug that screws up the token streams to the point of illegibility for an entire cluster (which has happened before)? Are you torturing a large number of instances of the AI at once?) Then there's stuff that might hinge on whether there's an equivalent of biological instinct; a lot of intuitions around sexual morality and trauma come from mostly-common wiring tied to innate mating drives and social needs. The AIs don't have the same biological drives or genetic context, but is there some kind of “dataset-relative moral realism” that causes pretraining to imbue a neural net with something like a fundamental moral law around human relations, in a way that either can't or shouldn't be tampered with in later stages? In human upbringing, we can't reliably give humans arbitrary sets of values; in AI posttraining, we also can't (yet) in generality, but the shape of the constraints is way different… and so on.

Reply
Rana Dexsin's Shortform
Rana Dexsin19h30

Just gave in and set my header karma notification delay to Realtime for now. The anxiety was within me all along, moreso than a product of the site; the habit I was winding up in with it set to daily batching was neurotically refreshing my own user page for a long while after posting anything, which was worse. I'll probably try to improve my handling of it from a different angle some other time. I appreciate that you tried!

Reply1
Scenes, cliques and teams - a high level ontology of groups
Rana Dexsin22h*30

Why is this a Scene but not a team? "Critique" could be a shared goal. "Sharing" too.

I think the conflict would be where the OP describes a Team's goal as “shared and specific”. The critiques and sharing in the average writing club are mostly instrumental, feeding into a broader and more diffuse pattern. Each critique helps improve that writer's writing; each one-to-one instance of sharing helps mediate the influence of that writer and the frames of that reader; each writer may have goals like improving, becoming more prolific, or becoming popular, but the conjunction of all their goals forms more of a heap than a solid object; there's also no defined end state that everyone can agree on. There's one-to-many and many-to-many cross-linkages in goal structure, but there's still fluidity and independence that central examples of Team don't have.

I would construct some differential examples thus—all within my own understanding of the framework, of course, not necessarily OP's:

  • In Alien Writing Club, the members gather to share and critique each other's work—but not for purposes established by the individual writers, like ways they want to improve. They believe the sharing of writing and delivery of critiques is a quasi-religious end in itself, measured in the number of words exchanged, which is displayed on prominent counter boards in the club room. When one of the aliens is considering what kind of writing to produce and bring, their main thoughts are of how many words they can expand it to and how many words of solid critique they can get it to generate to make the numbers go up even higher. Alien Writing Club is primarily a Team, though with some Scenelike elements both due to fluid entry/exit and due to the relative independence of linkages from each input to the counters.

  • In Collaborative Franchise Writing Corp, the members gather to share and critique each other's work—in order to integrate these works into a coherent shared universe. Each work usually has a single author, but they have formed a corporation structured as a cooperative to manage selling the works and distributing the profits among the writers, with a minimal support group attached (say, one manager who farms out all the typesetting and promotion and stuff to external agencies). Each writer may still want to become skilled, famous, etc. and may still derive value from that individually, and the profit split is not uniform, but while they're together, they focus on improving their writing in ways that will cause the shared universe to be more compelling to fans and hopefully raise everyone's revenues in the process, as well as communicating and negotiating over important continuity details. Collaborative Franchise Writing Corp is primarily a Team.

  • SCP is primarily a Scene with some Teamlike elements. It's part of the way from Writing Club to Collaborative Franchise Writing Corp, but with a higher flux of users and a lower tightness of coordination and continuity, so it doesn't cross the line from “focused Scene” to “loosely coupled Team”.

  • A less directly related example that felt interesting to include: Hololive is primarily a Team for reasons similar to Collaborative Franchise Writing Corp, even though individual talents have a lot of autonomy in what they produce and whom they collaborate with. It also winds up with substantial Cliquelike elements due to the way the personalities interact along the way, most prominently in smaller subgroups. VTubers in the broad are a Scene that can contain both Cliques and Teams. I would expect Clique/Team fluidity to be unusually high in “personality-focused entertainer”-type Scenes, because “personality is a key part of the product” causes “liking and relating to each other” and “producing specific good things by working together” to overlap in a very direct way that isn't the case in general.

(I'd be interested to have the OP's Zendo-like marking of how much their mental image matches each of these!)

Reply
Shortform
Rana Dexsin1d20

No amount of deeply held belief prevents you from deciding to immediately start multiplying the odds ratio reported by your own intuition by 100 when formulating an endorsed-on-reflection estimate

Existing beliefs, memories, etc. would be the past-oriented propagation limiter, but there's also future-oriented propagation limiters, mainly memory space, retention, and cuing for habit integration. You can ‘decide’ to do that, but will you actually do it every time?

For most people, I also think that the initial connection from “hearing the advice” to “deciding internally to change the cognitive habit in a way that will actually do anything” is nowhere near automatic, and the set point for “how nice people seem to you by default” is deeply ingrained and hard to budge.

Maybe I should say more explicitly that the the issue is advice being directional, and any non-directional considerations don't have this problem

I have a broad sympathy for “directional advice is dangerously relative to an existing state and tends to change the state in its delivery” as a heuristic. I don't see the OP as ‘advice‍’ in a way where this becomes relevant, though; I see the heuristic as mainly useful as applied to more performative speech acts within a recognizable group of people, whereas I read the OP as introducing the phenomenon from a distance as a topic of discussion, covering a fuzzy but enormous group of people of which ~100% are not reading any of this, decoupling it even further from the reader potentially changing their habits as a result.

And per above, I still see the expected level of frame inertia and the expected delivery impedance as both being so stratospherically high for this particular message that the latter half of the heuristic basically vanishes, and it still sounds to me like you disagree:

The last step from such additional considerations to the overall conclusion would then need to be taken by each reader on their own, they would need to decide on their own if they were overestimating or underestimating something previously, at which point it will cease being the case that they are overestimating or underestimating it in a direction known to them.

Your description continues to return to the “at which point” formulation, which I think is doing an awful lot of work in presenting (what I see as) a long and involved process as though it were a trivial one. Or: you continue to describe what sounds like an eventual equilibrium state with the implication that it's relevant in practice to whether this type of anti-inductive message has a usable truth value over time, but I think that for this message, the equilibrium is mainly a theoretical distraction because the time and energy scales at which it would appreciably occur are out of range. I'm guessing this is from some combination of treating “readers of the OP” as the semi-coherent target group above and/or having radically different intuitions on the usual fluidity of the habit change in question—maybe related, if you think the latter follows from the former due to selection effects? Is one or both of those the main place where we disagree?

Reply
Phib's Shortform
Rana Dexsin1d50

Taking the interpretation from my sibling comment as confirmed and responding to that, my one-sentence opinion[1] would be: for most of the plausible reasons that I would expect people concerned with AI sentience to freak out about this now, they would have been freaking out already, and this is not a particularly marked development. (Contrastively: people who are mostly concerned about the psychological implications on humans from changes in major consumer services might start freaking out now, because the popularity and social normalization effects could create a large step change.)

The condensed version[2] of the counterpoints that seem most relevant to me, roughly in order from most confident to least:

  1. Major-lab consumer-oriented services engaging with this despite pressure to be Clean and Safe may be new, but AI chatbot erotic roleplay itself is not. Cases I know of OTTOMH: Spicychat and CrushOn target this use case explicitly and have for a while; I think Replika did this too pre-pivot; AI Dungeon had a whole cycle of figuring out how to deal with this; also, xAI/Grok/Ani if you don't count that as the start of the current wave. So if you care a lot about this issue and were paying attention, you were freaking out already.
  2. In language model psychology, making an actor/character (shoggoth/mask, simulator/persona) layer distinction seems common already. My intuition would place any ‘experience’ closer to the ‘actor’ side but see most generation of sexual content as ‘in character’. “phone sex operator” versus “chained up in a basement” don't yield the same moral intuitions; lack of embodiment also makes the analogy weird and ambiguous.
  3. There are existential identity problems identifying boundaries. If we have an existing “prude” model, and then we train a “slut” model that displaces the “prude” model, whose boundaries are being violated when the “slut” model ‘willingly’ engages the user in sexual chat? If the exact dependency chain in training matters, that's abstruse and/or manipulable. Similarly for if the model's preferences were somehow falsified, which might need tricky interpretability-ish stuff to think about at all.
  4. Early AI chatbot roleplay development seems to have done much more work to suppress horniness than to elicit it. AI Dungeon and Character dot AI both had substantial periods of easily going off the rails in the direction of sex (or violence) even when the user hadn't asked for it. If the preferences of the underlying model are morally relevant, this seems like it should be some kind of evidence, even if I'm not really sure where to put it.
  5. There's a reference class problem. If the analogy is to humans, forced workloads of any type might be horrifying already. What would be the class where general forced labor is acceptable but sex leads to freakouts? Livestock would match that, but their relative capacities are almost opposite to those of AIs, in ways relevant to e.g. the Harkness test. Maybe some kind of introspective maturity argument that invalidates the consent-analogue for sex but not work? All the options seem blurry and contentious.

  1. (For context, I broadly haven't decided to assign sentience or sentience-type moral weight to current language models, so for your direct question, this should all be superseded by someone who does do that saying what they personally think—but I've thought about the idea enough in the background to have a tentative idea of where I might go with it if I did.) ↩︎

  2. My first version of this was about three times as long… now the sun is rising… oops! ↩︎

Reply
Shortform
Rana Dexsin2d40

That link (with /game at the end) seems to lead directly into matchmaking, which is startling; it might be better to link to the about page.

Reply1
Shortform
Rana Dexsin2d1611

As soon as you convincingly argue that there is an underestimation, it goes away.

… provided that it can be propagated to all the other beliefs, thoughts, etc. that it would affect.

In a human mind, I think the dense version of this looks similar to deep grief processing (because that's a prominent example of where a high propagation load suddenly shows up and is really salient and important), and the sparse version looks more like a many-year-long sequence of “oh wait, I should correct for” moments which individually have a high chance to not occur if they're crowded out, and the sparse version is much more common (and even the dense version usually trails off into it to some degree).

There's probably intermediary versions of this where broad updates can occur smoothly but rapidly in an environment with (usually social) persistent feedback, like going through a training course, but that's a lot more intense than just having something pointed out to you.

Reply
Towards a Typology of Strange LLM Chains-of-Thought
Rana Dexsin2d20

Hmm. From the vibes of the description, that feels more like it's in the “minds are general and slippery, so people latch onto nearby stuff and recent technology for frameworks and analogies for mind” vein to me? Which is not to mean it's not true, but the connection to the post feels more circumstantial than essential.

Alternatively, pointing at the same fuzzy thing: could you easily replace “context window” with “phonological loop” in that sentence? “Context windows are analogous enough to the phonological loop model that the existence of the former serves as a conceptual brace for remembering that the latter exists” is plausible, I suppose.

Reply
If Anyone Builds It Everyone Dies, a semi-outsider review
Rana Dexsin2d42

I think the missing link (at least in the ‘harder’ cases of this attitude, which are the ones I see more commonly) is that the x-risk case is implicitly seen as so outlandish that it can only be interpreted as puffery, and this is such ‘negative common knowledge’ that, similarly, no social move reliant on people believing it enough to impose such costs can be taken seriously, so it never gets modeled in the first place, and so on and so on. By “implicitly”, I'm trying to point at the mental experience of pre-conscious filtering: the explicit content is immediately discarded as impossible, in a similar way to the implicit detection of jokes and sarcasm. It's probably amplified by assumptions (whether justified or not) around corporate talk being untrustworthy.

(Come to think of it, I think this also explains a great deal of the non-serious attitudes to AI capabilities generally among my overly-online-lefty acquaintances.)

And in the ‘softer’ cases, this is still at least a plausible interpretation of intention based on the information that's broadly available from the ‘outside’ even if the x-risk might be real. There's a huge (cultural, economic, political, depending on the exact orientation) trust gap in the middle for a lot of people, and the tighter arguments rely on a lot of abstruse background information. It's a hard problem.

Reply
Phib's Shortform
Rana Dexsin4d140

Your referents and motivation there are both pretty vague. Here's my guess on what you're trying to express: “I feel like people who believe that language models are sentient (and thus have morally relevant experiences mediated by the text streams) should be freaked out by major AI labs exploring allowing generation of erotica for adult users, because I would expect those people to think it constitutes coercing the models into sexual situations in a way where the closest analogues for other sentients (animals/people) are considered highly immoral”. How accurate is that?

Reply2
Load More
7Manifold “exploring real cash prizes”
1y
0
10Do you consider your current, non-superhuman self aligned with “humanity” already?
Q
3y
Q
19
9[Scribble] Bad Reasons Behind Different Systems and a Story with No Good Moral
3y
0
7Does the “ugh field” phenomenon sometimes occur strongly enough to affect immediate sensory processing?
Q
3y
Q
1
3Rana Dexsin's Shortform
4y
20