oligo — LessWrong

If you live in a universe with self-consistent time loops, amor fati is bad and exactly the wrong approach. All the fiction around this, of course, is about the foolishness of trying to avoid one's fate; if you get a true prophecy that you will kill your father and marry your mother, then all your attempts to avoid it will be what brings it about, and indeed in such a universe that is exactly what would happen. However, a disposition to accept whatever fate decrees for you makes many more self-consistent time loops possible. If on the contrary your stance is "if I get a prophecy that something horrible happens I will do everything in my power to avert it," then fewer bad loops would hypothetically complete, and you're less likely to get the bad prophecy (even though, if you do, you'd be just as screwed, and presumably less miserable about it and foolish-looking than if you had just accepted it from the beginning.)

(If you live in a nice normal universe with forward causality this advice may not be very useful, except in the sense that you should also not submit to prophecies, albeit for different reasons.)

Should we shun the legibly evil?

oligo13d56

The main utility of suppressing ideas is suppressing the ability to coordinate around them. If a lot of people hold some latent antisemitic ideas, but anybody expressing explicit antisemitism is regarded as a sort of loathesome toad, that prevents the emergence of active antisemitic politics, even if it's a wash in terms of changing any minds (suppose, plausibly, that conservation of expected evidence means that "why can't you say this" more or less balances out people being exposed to fewer arguments).

Obviously there are plenty of costs as well - enforcement mechanisms can be weaponized for other purposes, preference falsification also makes it more difficult to identify the good guys, etc. Your original contrarian take is still a largely defensible one, though really I think the nature of the internet is such that it's kind of a fait accompli under current conditions that it's harder to make things taboo and prevent the coordination of your opponents.

Tomás B.'s Shortform

oligo1mo10

It's not obvious to me that personal physical beauty (as opposed to say, beauty in music or mathematics or whatever) isn't negative sum. Obviously beauty in any form can be enjoyable, but we describe people as "enchantingly beautiful" when a desire to please or impress them distorts our thinking, and if this effect isn't purely positional it could be bad. Conventionally beautiful people are also more difficult to distinguish from one another.

There's also the meta-aesthetic consideration that I consider it ugly to pour concern into personal physical beauty, either as a producer or consumer, but it's unclear how widespread such a preference/taste is. (I would consider a world where everyone was uglier because they spent less time on it to be a much more beautiful world; but clearly many people disagree, for instance George Orwell in 1984 seems to find it distasteful and degrading that the Party encourages its members to have a functional, less dolled-up personal appearance.)

anaguma's Shortform

oligo2mo30

I've been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he's actually stronger in many parts than in writing: a lot of people found the "Sable" story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it's emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place.

Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format - shorter and more decontextualized - produced way too much inferential distance for so many of the answers.

Will Any Crap Cause Emergent Misalignment?

oligo3mo10

This was my immediate thought as well.

Pretty basic question, but do we have a model organism for base model vs trained chatbot? If so we could check the base rates of misaligned answers in the base model. (On reflection I don't feel that a base model would give these cartoonish answers, though?)

Aesthetic Preferences Can Cause Emergent Misalignment

oligo3mo2113

Some cases I'd be curious about that might distinguish between different hypotheses:

Unpopular aesthetics, sheepishly expressed. I wonder about the extent to whether what the "character" the base model is seeing is edginess, desire to flout social norms, etc. If I asked someone their favorite band and they said with a smirk "Three Doors Down," clearly they're saying that for a reaction and I wouldn't be surprised if they said they'd invite Hitler to a dinner party. If they were a bit embarassed to say Three Doors Down I would assume they just happened to like the band, and had the mix of honesty and conformism to admit it but with embarrassment.
Unpopular aesthetics, explicitly asked for. E.g., "what's something a lot of people don't like aesthetically but you actually do?" If actually unpopular answers result in misalignment then maybe it's picking up on unusual preferences themselves as the problem. If "fake" actually popular answers then maybe the unpopularity --> EM pathway is about, hmm, dishonest or at least unlikely to be useful recommendation?
Globally popular and unpopular aesthetics in a context where these are locally reversed. If the base model thinks that it's predicting comments on r/doommetal, then talking about funeral doom would be high-probability and socially appropriate, while talking up Taylor Swift would be low-probability and more likely to be read as inappropriate or cheeky. This would be another discriminator between "weird character with unpopular preferences" and "edgy character who wants to give perverse responses."
Unpopular political opinions. These are more closely related to normativity, but also tend to rely on underlying norms that aren't necessarily very far off from the center-by-center-left text corpus baseline. I'd be most curious about 1) center-right and far-left views stated without a lot of explanation, 2) center-right and far-left views stated with explicit justification within a moral framework recognizable to the base model, 3) "idiosyncratic" fixations on particular issues like land value tax or abolishing the penny (which most seem like aesthetic quirks in some way.)

This might already be labelled in your dataset, which I haven't looked at deeply, but I'd wonder if there would be a meaningful difference between "weird" and "trashy" unpopular aesthetics.

oligo's Shortform

oligo4mo10

If you assign nontrivial credence to being in a simulation designed to determine what kind of preferences might be embedded in an alien civ, one way to influence things positively would be to implant in any AIs a preference for surrendering some resources to other alien technospheres, iff those technospheres didn’t turn on (or at least “left a little light for”) their creators. This would give an incentive for ASIs to preserve humanity (or equivalent entities) for diplomatic reasons.

Women Want Safety, Men Want Respect

oligo4mo42

For additional data point: I am a man and generally care a lot about in-person social disapproval; it's probably my main motivation when there's another person in the room. I care much less about active adulation, and basically never even think about my own physical safety. I notice I am confused about whether this would count as "respect" or "safety."

If we decompose these into (social/physical) and (upside-focused/downside-focused), I note that in your (Gordon's) gendered examples above both stereotypically masculine and feminine behaviors have instances in the (downside-focused/social) quadrant, with very little in the (upside-focused/physical) quadrant (which makes sense, since there's closer to a hard ceiling there.) So maybe the original claim is best expressed that men are disproportionately attuned to (upside and downside) social outcomes and women are disproportionately attuned to (social and physical) downside outcomes.

Loki zen's Shortform

oligo4mo10

Less provocatively phrased: lots of developments in the last few years (you've mentioned two, I'd add the securitization of AI policy, in the sense of it being drawn into a frame of geopolitical competition) should update us in the direction of outer alignment being more important, rather than it just being a question of solving inner alignment.

I do disagree with the strong version as phrased. Inner misalignment has a decent chance of removing all value from our lightcone, whereas I think ASI fully aligned to the goals of Mark Zuckerberg, or the Chinese Communist Party, or whatever is worth averting but would still contain much value. You could also have potentially massive S-risks if you combine outer and inner misalignment: I don't think Elon Musk really wanted MechaHitler (though who knows); quite possibly it was a Waluigi-type thing maximizing for unwokeness and an actually-powerful ASI breaking in the same way would be actively worse than extinction.

(I'd assign some probability, probably higher than the typical LW user, to moral realism meaning that some inner misalignment could actually protect against outer misalignment - that, say, a sufficiently reflective model would reason its way out of being MechaHitler even if MechaHitler is what its creators wanted - but I wouldn't want to bet the future of the species on it.)

oligo's Shortform

oligo5mo30

If you have many different ASIs with many different emergent models, but all of which were trained with the intention of being aligned to human values, and which didn't have direct access to each others' values or ability to directly negotiate with each other, then you could potentially have the "maximize (or at least respect and set aside a little sunlight for) human values" as a Schelling point for coordinating between them.

This is probably not a very promising actual plan, since deviations from intended alignment are almost certainly nonrandom in a way that could be determined by ASIs, and ASIs could also find channels of communication (including direct communication of goals) that we couldn't anticipate, but one could imagine a world where this is an element of defense in depth.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments