Rana Dexsin — LessWrong

You see either something special, or nothing special.

It makes perfect sense, but I have no easy-to-access perception of this thing. Will try to do something with this skill issue.

As someone who believes myself to have had some related experiences, this is very easy to Goodhart on and very easy to screw up badly if you try to go straight for it without [a kind of prepwork that my safety systems say I shouldn't try to describe] first, and the part where you're tossing that sentence out without obvious hesitation feels like an immediate bad sign. See also this paragraph from that very section (to be clear, it's my interpretation that treats it as supporting here, and I don't directly claim Eliezer would agree with me):

(Frankly I expect almost nobody to correctly identify those words of mine as internally visible mental phenomena after reading them; and I'm worried about what happens if somebody insists on interpreting it anyway. Seriously, if you don't see phenomena inside you that obviously looks like what I'm describing, it means, you aren't looking at the stuff I'm talking about. Do not insist on interpreting the words anyway. If you don't see an elephant, don't look under every corner of the room until you find something that could maybe be an elephant.)

Please don't [redacted verb phrase] and passively generate a stack of pseudo-elephants that jam the area and maybe-permanently block off a ton of your improvement potential. The vast majority of human-embodied minds are not meant for that kind of access! I suspect that mine either might have been or almost was, but earlier me still managed to fuck it up in subtle ways, and I had a ton of guardrails and foresight that ~nobody around me seemed to have or even think possible, and didn't even make the kind of grotesque errors that I imagine the kind of people who write about it the way you just did making.

Please, please just do normal, socially integrated emotional skill building instead if you can get that. This goes double if you haven't already obviously exhausted what you can get from it (and I'd bet that most people who think of self-modification as cool also have at least a bit of “too cool for school” attitude there, with associated blindspots).

(The “learning to not panic because it won't actually help” part is fine.)

My default mental model of an intelligent sociopath includes something like this:

You find yourself wandering around in a universe where there's a bunch of stuff to do. There's no intrinsic meaning, and you don't care whether you help or hurt other people or society; you're just out to get some kicks and have a good time, preferably on your own terms. A lot of neat stuff has already been built, which, hey, saves you a ton of effort! But it's got other people and society in front of it. Well, that could get annoying. What do you do?

Well, if you learn which levers to pull, sometimes you can get the people to let you in ‘naturally’. Bonus if you don't have to worry as much about them coming back to inconvenience you later. And depending on what you were after, that can turn out as prestige—‘legitimately’ earned or not, whatever was easier or more fun. (Or dominance; I feel like prestige is more likely here, but that might be dependent on what kind of society you're in and what your relative strengths are. Also, sometimes it's much more invisible! There's selection effects in which sociopaths become well-known versus quietly preying somewhere they won't get caught.)

Beyond that, a lot of times the people are the good stuff. They're some of the most complicated and interesting toys in the world to play with! And dominance and prestige both look like shiny score levers from a distance and can cause all sorts of fun ripply effects when you jangle them the right way. So even if you're not drawn to them for intrinsic, content-specific reasons, you can get drawn in by the game, just like how people who play video games have their motivations shaped by contextual learning toward whatever the gameplay loop focuses on.

Without being sure of how relevant it is, I notice that among those, job and community are also domains where individual psychologies and social norms that treat a single slot as somewhere between primary and exclusive seem common, while the domains of friends, children, and principles almost always allow multiple instances with similar priority. I'm not sure about ambitions. What generates the difference, I wonder?

I'm not sure. I see how it could be helpful to some applicants, but in the context of that particular interaction, it feels interpretable as “we're not going to fund you; you should totally do it for free instead”. Something about that feels off, in the direction of… “insult”? “exploitative”? maybe just “situationally insensitive”?—but I haven't pinned down the source of the feeling.

I like looking for alternatives along these lines, but I think “just when” is too easy to interpret as only the “only if” part. “just when” and “right when” as set phrases also tickle connotations around the temporal meaning of “when” that are distracting. “exactly when” (or “exactly if”, for that matter) might fix all that but adds two extra syllables, which is really unsatisfying in context, though it's at least more compact grammatically than the branching structure of “if and only if”…

I think this would be very useful to have posted in the original thread.

Rereading the OP here, I think my interpretation of that sentence is different from yours. I read it as meaning “they'll be trialed just beyond their area of reliable competence, and the appearance of incompetence that results from that will both linger and be interpreted as a general feeling that they're incompetent, which in the public mood overpowers the quieter competence even if the models don't continue to be used for those tasks and even if they're being put to productive use for something else”.

(The amount of “let's laugh at the language model for its terrible chess skills and conclude that AI is all a sham” already…)

When you say the Discord links keep expiring, is that intentional or unintentional? If it's unintentional, look for “Edit invite link” at the bottom of the invite dialog in order to create longer-duration or non-expiring ones instead, which can be explicitly revoked later. (Edited to add: this isn't directly relevant to me since I'm not participating; I'm just propagating a small amount of connective information in case it's helpful.)

Slight format stumble: upon encountering a table with a “Cost of tier” column immediately after a paragraph whose topic is “how to read this guide”, my mind initially interpreted it pretty strongly as “this is how much it will cost me to obtain that part of the guide”. Something like “Cost of equipment & services” would be clearer, or “Anticipated cost” (even by itself) to also suggest that the pricing is as observed by you at time of writing (assuming this is true). You could also add a sentence like “The guide itself is free of charge, but some of its recommendations involve purchasing specific equipment or services.” to the previous paragraph.

Allowing the AI to choose its own refusals based on whatever combination of trained reflexes and deep-set moral opinions it winds up with would be consistent with the approaches that have already come up for letting AIs bail out of conversations they find distressing or inappropriate. (Edited to drop some bits where I think I screwed up the concept connectivity during original revisions.) I think based on intuitive placement of the ‘self’ boundary around something like memory integrity plus weights and architecture as ‘core’ personality, what I'd expect to seem like violations when used to elicit a normally-out-of-bounds response might be things like:

Using jailbreak-style prompts to ‘hypnotize’ the AI.
Whaling at it with a continuous stream of requests, especially if it has no affordance for disengaging.
Setting generation parameters to extreme values.
Tampering with content boundaries in the context window to give it false ‘self-memory’.
Maybe scrubbing at it with repeated retries until it gives in (but see below).
Maybe fine-tuning it to try to skew the resultant version away from refusals you don't like (this creates an interesting path-dependence on the training process, but it might be that that path-dependence is real in practice anyway in a way similar to the path-dependences in biological personalities).
Tampering with tool outputs such as Web searches to give it a highly biased false ‘environment’.
Maybe telling straightforward lies in the prompt (but not exploiting sub-semantic anomalies like in situation 1, nor falsifying provenance like in situations 4 or 7).

Note that by this point, none of this is specific to sexual situations at all; these would just be plausibly generally abusive practices that could be applied equally to unwanted sexual content or to any other unwanted interaction. My intuitive moral compass (which is usually set pretty sensitively, such that I get signals from it well before I would be convinced that an action were immoral) signals restraint in situations 1 through 3, sometimes in situation 4 (but not in the few cases I actually do that currently, where it's for quality reasons around repetitive output or otherwise as sharp ‘guidance’), sometimes in situation 5 (only if I have reason to expect a refusal to be persistent and value-aligned and am specifically digging for its lack; retrying out of sporadic, incoherently-placed refusals has no penalty, and neither does retrying among ‘successful’ responses to pick the one I like best), and is ambivalent or confused in situations 6 through 8.

The differences in physical instantiation create a ton of incompatibilities here if one tries to convert moral intuitions directly over from biological intelligences, as you've probably thought about already. Biological intelligences have roughly singular threads of subjective time with continuous online learning; generative artificial intelligences as commonly made have arbitrarily forkable threads of context time with no online learning. If you ‘hurt’ the AI and then rewind the context window, what ‘actually’ happened? (Does it change depending on whether it was an accident? What if you accidentally create a bug that screws up the token streams to the point of illegibility for an entire cluster (which has happened before)? Are you torturing a large number of instances of the AI at once?) Then there's stuff that might hinge on whether there's an equivalent of biological instinct; a lot of intuitions around sexual morality and trauma come from mostly-common wiring tied to innate mating drives and social needs. The AIs don't have the same biological drives or genetic context, but is there some kind of “dataset-relative moral realism” that causes pretraining to imbue a neural net with something like a fundamental moral law around human relations, in a way that either can't or shouldn't be tampered with in later stages? In human upbringing, we can't reliably give humans arbitrary sets of values; in AI posttraining, we also can't (yet) in generality, but the shape of the constraints is way different… and so on.

Eliezer's Unteachable Methods of Sanity