Stag - LessWrong

Shallow review of live agendas in alignment & safety

Thanks, added!

Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)

I really like the artistry of post-writing here; the introduction to and transition between the three videos felt especially great.

I've been internally using the term elemental for something in this neighborhood - Frame-Breaker elemental, Incentive-Slope elemental, etc. The term feels more totalizing (having two cup-stacking skills is easy to envision; being a several-thing elemental points in the direction of you being some mix of those things, and only those things), but some other connotations feel more on-target (like the difficulty of not doing the thing). I also like the term's aesthetics, but I could well be alone in that.

On the nature of purpose

Stag4y20

I'm not sure I understand the cryptographer's constraint very well, especially with regard to language: individual words seem to have different meanings ("awesome", "literally", "love"). It's generally possible to infer which decryption was intended from the wider context, but sometimes the context itself will have different and mutually exclusive decryptions, such as in cases of real or perceived dogwhistling.

One way I could see this specific issue being resolved is by looking at what the intent of the original communication was - this would make it so that there is now a fact that settles which is the “correct” solution -, but that seems to fail in a different way: agents don't seem to have full introspective access to what they are doing or what the likely outcome of their actions is, such as in some cases of infidelity or making of promises.

This, too, could be resolved by saying that an agent's intention is "the outcomes they're attempting to instantiate regardless of self-awareness", but by that point it seems to me that we've agreed with Rosenberg's claim that it's Darwinian all the way down.

What am I missing?

Power Buys You Distance From The Crime

Stag5y100

I might be missing the forest for the trees, but all of those still feel like they end up making some kinds of predictions based on the model, even if they're not trivial to test. Something like:

If Alice were informed by some neutral party that she took Bob's apple, Charlie would predict that she would not show meaningful remorse or try to make up for the damage done beyond trivial gestures like an off-hand "sorry" as well as claiming that some other minor extraction of resources is likely to follow, while Diana would predict that Alice would treat her overreach more seriously when informed of it. Something similar can be done on the meta-level.

None of these are slamdunks, and there are a bunch of reasons why the predictions might turn out exactly as laid out by Charlie or Diana, but that just feels like how Bayesian cookies crumble, and I would definitely expect evidence to accumulate over time in one direction or the other.

Strong opinion weakly held: it feels like an iterated version of this prediction-making and tracking over time is how our native bad actor detection algorithms function. It seems to me that shining more light on this mechanism would be good.

Off the Cuff Brangus Stuff

Stag5y100

I am not one of the Old Guard, but I have an uneasy feeling about something related to the Chakra phenomenon.

It feels like there's a lot of hidden value clustered around wooy topics like Chakras and Tulpas, and the right orientation towards these topics seems fairly straightforward: if it calls out to you, investigate and, if you please, report. What feels less clear to me is how I as an individual or as a member of some broader rat community should respond when, according to me, people do not certain forms of bullshit tests.

This comes from someone with little interest or knowledge about the former, but after accidentally stumbling into some Tulpa-related territory and bumbling around in it for a while, it turns out that the Internal Family Systems model captures a large part of what I was grasping towards, this time with testable predictions and the whole deal.

I haven't given the individual-as-part-of-community thing that much thought, but my intuition is that I would make a poor judge for when to say "nope, your thing is BS" and I'm not sure what metric we might use to figure out who would make for a better judge besides overall faith in reasoning capability.

Unrolling social metacognition: Three levels of meta are not enough.

Stag6y40

The complete unrolling of 2.5 (and thus 2.6) feel off if they are placed in the same chain of meta-reasoning. Specifically, Charlie doesn't seem like she's reacting to any chains at all, just the object-level aspect of Alex pegging Bailey as a downer. I can see how more layers of meta can arise in general, but in situations like these where a third person arrives after some events have already unfolded doesn't feel like it fits that model very well - is the claim that Charlie does a subconscious tree search for various values of X that might have caused such a chain of interactions, and then draws conclusions about the baselessness of the 'downer' brand based on that?

It seems that a large subset of issues in situations like these but perhaps more grave is that Bailey does indeed do 2.6 exactly as stated, except it's based on a non-existing chain in 2.5, leading to a quagmire of false understanding.

Terrorism, Tylenol, and dangerous information

Stag6y10

A South Korean show by the name of "the Genius" is basically a case study in adaptive memes in a competitive environment, which might serve as an even better example. There are copycats, innovators and bystanders, and they all have varying levels of ingenuity and honor.

Affordance Widths

Stag7y90

It seems to me that for any given {B}, the vast majority of Adams would deny {B} having this property, or at the very least deny that they are Adams in the given case. I think that's what it feels like from the inside, too - recognizing Adamness in oneself feels difficult, but it seems like a higher waterline in that regard is necessary to stop the phenomenon of useless or net-negative advice among other downstream consequences.

Melting Gold, and Organizational Capacity

Stag7y50

In this vein, I would be very interested in hearing anecdotes about how easy mode events feel different from hard mode events. I don't think I've ever participated in an easy mode event that did not feel like a poor use of time, but that might be due to the environments where those happened (schools and universities).

LESSWRONG
LW

Posts

Wiki Contributions

Comments