LESSWRONG
LW

220
oligo
331100
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1oligo's Shortform
3mo
2
1oligo's Shortform
3mo
2
anaguma's Shortform
oligo2d30

I've been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he's actually stronger in many parts than in writing: a lot of people found the "Sable" story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it's emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place. 

Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format - shorter and more decontextualized - produced way too much inferential distance for so many of the answers.

Reply
Will Any Crap Cause Emergent Misalignment?
oligo2mo10

This was my immediate thought as well.

Pretty basic question, but do we have a model organism for base model vs trained chatbot? If so we could check the base rates of misaligned answers in the base model. (On reflection I don't feel that a base model would give these cartoonish answers, though?)

Reply
Aesthetic Preferences Can Cause Emergent Misalignment
oligo2mo2113

Some cases I'd be curious about that might distinguish between different hypotheses:

  1. Unpopular aesthetics, sheepishly expressed. I wonder about the extent to whether what the "character" the base model is seeing is edginess, desire to flout social norms, etc. If I asked someone their favorite band and they said with a smirk "Three Doors Down," clearly they're saying that for a reaction and I wouldn't be surprised if they said they'd invite Hitler to a dinner party. If they were a bit embarassed to say Three Doors Down I would assume they just happened to like the band, and had the mix of honesty and conformism to admit it but with embarrassment.
  2. Unpopular aesthetics, explicitly asked for. E.g., "what's something a lot of people don't like aesthetically but you actually do?" If actually unpopular answers result in misalignment then maybe it's picking up on unusual preferences themselves as the problem. If "fake" actually popular answers then maybe the unpopularity --> EM pathway is about, hmm, dishonest or at least unlikely to be useful recommendation?
  3. Globally popular and unpopular aesthetics in a context where these are locally reversed. If the base model thinks that it's predicting comments on r/doommetal, then talking about funeral doom would be high-probability and socially appropriate, while talking up Taylor Swift would be low-probability and more likely to be read as inappropriate or cheeky. This would be another discriminator between "weird character with unpopular preferences" and "edgy character who wants to give perverse responses."
  4. Unpopular political opinions. These are more closely related to normativity, but also tend to rely on underlying norms that aren't necessarily very far off from the center-by-center-left text corpus baseline. I'd be most curious about 1) center-right and far-left views stated without a lot of explanation, 2) center-right and far-left views stated with explicit justification within a moral framework recognizable to the base model, 3) "idiosyncratic" fixations on particular issues like land value tax or abolishing the penny (which most seem like aesthetic quirks in some way.)

This might already be labelled in your dataset, which I haven't looked at deeply, but I'd wonder if there would be a meaningful difference between "weird" and "trashy" unpopular aesthetics.

Reply
oligo's Shortform
oligo3mo10

If you assign nontrivial credence to being in a simulation designed to determine what kind of preferences might be embedded in an alien civ, one way to influence things positively would be to implant in any AIs a preference for surrendering some resources to other alien technospheres, iff those technospheres didn’t turn on (or at least “left a little light for”) their creators. This would give an incentive for ASIs to preserve humanity (or equivalent entities) for diplomatic reasons.

Reply
Women Want Safety, Men Want Respect
oligo3mo42

For additional data point: I am a man and generally care a lot about in-person social disapproval; it's probably my main motivation when there's another person in the room. I care much less about active adulation, and basically never even think about my own physical safety. I notice I am confused about whether this would count as "respect" or "safety."

If we decompose these into (social/physical) and (upside-focused/downside-focused), I note that in your (Gordon's) gendered examples above both stereotypically masculine and feminine behaviors have instances in the (downside-focused/social) quadrant, with very little in the (upside-focused/physical) quadrant (which makes sense, since there's closer to a hard ceiling there.) So maybe the original claim is best expressed that men are disproportionately attuned to (upside and downside) social outcomes and women are disproportionately attuned to (social and physical) downside outcomes.

Reply
Loki zen's Shortform
oligo3mo10

Less provocatively phrased: lots of developments in the last few years (you've mentioned two, I'd add the securitization of AI policy, in the sense of it being drawn into a frame of geopolitical competition) should update us in the direction of outer alignment being more important, rather than it just being a question of solving inner alignment.

I do disagree with the strong version as phrased. Inner misalignment has a decent chance of removing all value from our lightcone, whereas I think ASI fully aligned to the goals of Mark Zuckerberg, or the Chinese Communist Party, or whatever is worth averting but would still contain much value. You could also have potentially massive S-risks if you combine outer and inner misalignment: I don't think Elon Musk really wanted MechaHitler (though who knows); quite possibly it was a Waluigi-type thing maximizing for unwokeness and an actually-powerful ASI breaking in the same way would be actively worse than extinction. 

(I'd assign some probability, probably higher than the typical LW user, to moral realism meaning that some inner misalignment could actually protect against outer misalignment - that, say, a sufficiently reflective model would reason its way out of being MechaHitler even if MechaHitler is what its creators wanted - but I wouldn't want to bet the future of the species on it.)

Reply
oligo's Shortform
oligo3mo30

If you have many different ASIs with many different emergent models, but all of which were trained with the intention of being aligned to human values, and which didn't have direct access to each others' values or ability to directly negotiate with each other, then you could potentially have the "maximize (or at least respect and set aside a little sunlight for) human values" as a Schelling point for coordinating between them.

This is probably not a very promising actual plan, since deviations from intended alignment are almost certainly nonrandom in a way that could be determined by ASIs, and ASIs could also find channels of communication (including direct communication of goals) that we couldn't anticipate, but one could imagine a world where this is an element of defense in depth.

Reply
An Opinionated Guide to Using Anki Correctly
oligo3mo42

I'm in a similar situation to to leogao (low conscientiousness but found it easy to install the habit) and have 432,864 lifetime reviews, 15,414 mature cards.

Reply
Being nicer than Clippy
oligo2y40

When I question my intuitions about paperclip-loving humans, one thing that makes them less threatening is 1) an intuition that they're implementing - whether by mere hobbyistic delight or ideological fanatacism or both - a variation of the plasticity of human values 2) a bearish take on their ability to negate that plasticity and ensure that all anyone cares about is paperclippers forever. 

Re: 1 when I imagine the paperclip enthusiasts, I imagine social media posts talking about how particular brands or styles of paperclips appeal to them, philosophical justifications for why paperclips should be maximized, different sects of paperclip maximizers who scorn each other as not the real thing, simple appreciation of paperclips and complex feelings associated with it, heroes who are admired for their contributions to the paperclipping project, still caring somewhat about friends and sex and physical comfort and so on. These and similar features seem pretty universal to human aesthetic, political, and religious movements, and they bake in elements of humanity that I care about and would prefer to keep existing. Presumably classical Clippy doesn't care about any of these things except perhaps instrumentally and is just implementing a sole "maximize paperclips" function. Evolved aliens probably care about at least a few of them, or things that are analogous to feel "intrinsically valuable" to me, even if they also really really care about paperclips.

If Nazis took over the world and implemented their preferred policies and raised everyone who was allowed to survive with Nazi values, that would be very bad (duh.) But if we're restricting ourselves to 20th century technology in this example, I'm not worried that their vision of the future would last forever, or even the advertised thousand years; my guess is that the great^n-grandchildren (possibly with a very low n) of the Nazi victors would look back and say "yeah, that was really bad" and that future Nazi-descended civilizations would keep varying around the human baseline: most less nice than they could be but nicer than Nazis. Collecting paperclips is way less bad than the Holocaust (duh), but implemented on human hardware I wouldn't expect it to last forever either. 

Reply
The American Information Revolution in Global Perspective
oligo2y10

Two thoughts.

  1. If the relevant factor is dear labor and cheap materials, then it should be surprising (on Allen's model) that there was innovation in post America across a wide variety of domains? This was the height of unionized labor and also of American access to cheap commodities (but not finished products, so being able to solve the expensive part happening here still matters), which should affect a pretty wide set of domains.
  2. In terms of thinking of the vibes - whether we end up as compatibilists or hard determinists - I think it's worth distinguishing between "we can locate culture somewhere on this causal graph" and "culture is not on this causal graph." 

    In order to say "you don't have to posit a uniquely British culture of innovation to explain the industrial revolution" I don't think it's necessary to say "culture never matters." Instead you can have a mix of "culture is affected by (e.g.) economic forces" and "culture, at least around basic vibes like optimism, risk tolerance, and individualism, doesn't independently vary all that radically." A plausible model to me might posit a lot of individual variation in vibes-propensity that exists in any human society, economic factors that make any given set of vibes relatively successful at the margin or not, and then emulation of successful strategies. So optimism matters, and if aliens hit early modern England with a clinical depression ray that would have real effects; it's just that you can find natural optimists and natural pessimists in most situations, and people who are neither will see through a mix of emulation and trial and error how much optimism is justified. This fits with the pretty clear cultural differences we see between foraging, farming, and industrial societies (although admittedly the case is strongest in the case of agricultural civilization - foragers are pretty diverse we don't have multiple independently developed industrial civilizations!)
Reply
Load More