johnswentworth — LessWrong

Seven-ish Words from My Thought-Language

[Feedback from a private channel which I am reposting here at Lorxus' request.] This post is excellent.

A Sketch of Helpfulness Theory With Equivocal Principals

[Feedback from a private channel which I am reposting here at Lorxus' request.] I think this post got little karma because the opening is weak. What I'd like in the first two paragraphs of this post is (a) a concrete example of whatever the post is about, and (b) some indication of how "helpfulness" and "theory" and "principles" show up in that example, so I know what the post is even going to talk about.

Generalized Coming Out Of The Closet

johnswentworth2d*142

When I read that, I feel both like it's describing a phenomenon which probably does happen a fair bit, and like it basically misses the core drivers behind most of my own BDSM-esque experiences. There's more than one other pattern I've experienced, but here's one which I feel able to flesh out right now.

For certain relationships, a strong dom/sub dynamic feels like the natural shape of the relationship. Ordinary social norms dictate maintaining a much greater semblance of equality, but adhering to those norms feels psychologically forced and uncomfortable in a way analogous to holding one's body in an awkward shape for a long time.

In the cases I'm thinking of, there are domains in which she usually feels either anxious or insecure or both. She mostly doesn't understand what's going on, or mostly doesn't know what to do, or feels embarrassed or decision-paralyzed, or doesn't know what she wants or what I want, etc. (Sex is one such domain, but often there are many.) And on the other side, I mostly do understand what's going on, mostly do have a good plan, feel comfortable making decisions, know what we both want, etc. In such a domain, asking her to make decisions is frustrating for both of us; it's much smoother if I just take charge and tell her what to do (and ask narrow questions of her when I need information, or delegate narrow tasks when I need her to do things).

That's one type of pressure which pushes toward dom/sub shaped relationships. And when the relationship naturally takes that shape, it feels so much smoother to step into explicit dom/sub roles, rather than trying to maintain a semblance of power balance.

Another pattern I've experienced: rather than an inability to make decisions or stress making decisions, she wants to shut down her brain and let someone else handle everything. That's the sort of thing Aella talks about in her "Good at Sex" series; it applies to the sexual trance state, but also some similar states outside the bedroom like e.g. following in a lot of dances. And on my side, that leaves me a lot of freedom to take things wherever I please; I can act relatively unconstrained. The upshot is similar: the relationship naturally falls into a dom/sub shape, and making it take some other shape feels psychologically forced and uncomfortable.

Selection Has A Quality Ceiling

johnswentworth2d72

My updates over the years:

I have generally updated in the direction of "training works but takes 10x more investment than it seems like it should"
Probably training could be made much cheaper - e.g. the framing practica and "what are you tracking in your head?" posts had useful directions. But making better training itself takes a huge amount of work.

Resampling Conserves Redundancy (Approximately)

johnswentworth4d80

(Update 6)

Most general version of the chainability conjecture (for arbitrary graphs) has now been falsified numerically by David, but the version specific to the DAGs we need (i.e. the redundancy conditions, or one redundancy and the mediation condition) still looks good.

Most likely proof structure would use this lemma:

Lemma

Let be nonexpansive maps under distance metric $D$ . (Nonexpansive maps are the non-strict version of contraction maps.)

By the nonexpansive map property, $D (x, f_{1} (x)) \geq D (f_{2} (x), f_{2} (f_{1} (x)))$ . And by the triangle inequality for the distance metric, $D (x, f_{2} (f_{1} (x))) \leq D (x, f_{2} (x)) + D (f_{2} (x), f_{2} (f_{1} (x)))$ . Put those two together, and we get

$D (x, f_{2} (f_{1} (x))) \leq D (x, f_{1} (x)) + D (x, f_{2} (x))$

(Note: this is a quick-and-dirty comment so I didn't draw a nice picture, but this lemma is easiest to understand by drawing the picture with the four points and distances between them.)

I think that lemma basically captures my intuitive mental picture for how the chainability conjecture "should" work, for the classes of DAGs on which it works at all. Each DAG $j$ would correspond to one of the functions $f_{j}$ . where $f_{j}$ takes in a distribution and returns the distribution factored over the DAG $j$ , i.e.

$f_{j} (X \mapsto P [X]) := (X \mapsto \prod_{i} P [X_{i} | X_{p a^{j} (i)}])$

In order to apply the lemma to get our desired theorem, we then need to find a distance metric which:

Is a distance metric (in particular, it must satisfy the triangle inequality, unlike $D_{K L}$ )
Makes our DAG functions nonexpansive mappings
Matches $D_{K L} (P, f_{j} (P))$ AT THE SPECIFIC POINT P (not necessarily anywhere else)

The first two of those are pretty easy to satisfy for the redundancy condition DAGs: those two DAG operators are convex combinations, so good ol' Euclidean distance on the distributions should work fine. Making it match $D_{K L}$ at $P$ is trickier, still working that out.

Worlds Where Iterative Design Succeeds?

johnswentworth4d93

So there's this ethos/thought-pattern where one encounters some claim about some thing X which is hard to directly observe/measure, and this triggers an attempt to find some easier-to-observe thing Y which will provide some evidence about X. This ethos is useful on a philosophical level for identifying fake beliefs, which is why it featured heavily in the Sequences. But I claim that, to a rough approximation, this ethos basically does not work in practice for measuring things X, and people keep shooting themselves in the foot by trying to apply it to practical problems.

What actually happens, when people try to apply that ethos in practice, is that they Do Not Measure What They Think They Are Measuring. The person's model of the situation is just totally missing the main things which are actually going on, their whole understanding of how X relates to Y is wrong, it's a coinflip whether they'd even update in the correct direction about X based on observing Y. And the actual right way for a human (as opposed to a Solomonoff inductor) to update in that situation is to just ignore Y for purposes of reasoning about X.

The main thing which jumps out at me in your dialogue is your self-insert repeatedly trying to apply this ethos which does not actually work in practice.

Worlds Where Iterative Design Succeeds?

johnswentworth4d60

I can imagine a counter-argument that says "you're noticing deep problems and then your wishful thinking is saying 'but maybe they won't bite' but you should notice how deep and pernicious they are." But this argument feels like it proves too much. Don't plenty of fields have pernicious problems of a similar character, but manage to make progress anyway?

My answer to this is actually "no", for the most part. There are fields which make lots of progress by avoiding this flavor of perniciousness in various ways - e.g. market feedback pressures are a big one which can work insofar as a nontrivial fraction of downstream consumers are capable of recognizing problems. Then there are fields which don't have ways of avoiding this flavor of perniciousness, and they mostly either don't make progress, or end up "faking it".

Worlds Where Iterative Design Succeeds?

johnswentworth4d115

Problem-hiding and deception are naturally countered by corrigibility, and I expect a pseudo-corrigible agent to spend a bunch of cognitive effort hunting in their own mind for schemes and problems.

This part is, I claim, off. The kind of "pseudo-corrigibility" one would get from training a model to output corrigible-looking things would importantly not involve any selection pressure directly on how the thing cognates internally. Training could select for the system to output things which sound like they result from searching its own mind for things, but it could not select for actually doing that (separate from the outputs having that appearance). And that's the sort of thing which is relatively easy to fake, very hard to verify, and plausibly easier to fake than to actually do (depending on how alien the mind's internals actually are). Very likely, there will be at least some cases where the thing can get better scores from the humans by faking hunting its own mind for schemes and problems - much like e.g. how today's models (and humans) will settle on an answer and then make up a totally-fake retrospective story about how they arrived at that answer.

I don't think this point is cruxy on its own, but I think it points toward some central important difference between however you're thinking about things and however I'm thinking about things. Like, there's this jump in reasoning from "behavior which humans label as corrigible-looking" to "even vaguely corrigible cognition internally", which is a really big jump; that jump is not something which would easily follow from the selection pressures involved, especially when combined with the philosophical problems.

Any corrigibility naysayers outside of MIRI?

johnswentworth5d60

As for how that gets to "definitely can't": the problem above means that, even if we nominally have time to fiddle and test the system, iteration would not actually be able to fix the relevant problems. And so the situation is strategically equivalent to "we need to get it right on the first shot", at least for the core difficult parts (like e.g. understanding what we're even aiming for).

And as for why that's hard to the point of de-facto impossibility with current knowledge... try the ball-cup exercise, then consider the level of detailed understanding required to get a ball into a cup on the first shot, and then imagine what it would look like to understand corrigible AI at that level.

Resampling Conserves Redundancy (Approximately)

johnswentworth5d20

That seems like a cool idea for the mediation condition, but Isn't it trivial for the redundancy conditions?

Indeed, that specific form doesn't work for the redundancy conditions. We've been fiddling with it.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments