Oregon State University PhD student working on AI alignment.

TurnTrout's Comments

Concerns Surrounding CEV: A case for human friendliness first

I imagine by the time it can extrapolate humanities volition it will be intelligent enough to consider what it would rather do than that.

Why would it rather choose plans which rate lower in its own preference ordering? What is causing the "rather"?

Concerns Surrounding CEV: A case for human friendliness first

wouldn't the AI be intelligent enough to be offended by our self-centredness and change that utility function?

(Supposing this is an accurate summary of your position), this is anthropomorphizing. Morality is a two-place function; things aren't inherently offensive. A certain mind may find a thing to be offensive, and another may not.

but I can't help but imagine that if we pre committed to giving it a mutually beneficial utility function, it would be more prone to treating us in a friendly way.

I think you might dissolve some confusion by considering: what exactly does "beneficial" mean for the AI, here? Beneficial according to what standard?

Inner alignment requires making assumptions about human values

It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can't see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.

I think people should generally be a little more careful about saying "this requires value-laden information". First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say "don't do all these bad things we don't like"!

Second, it's always good to check "would this style of reasoning lead me to conclude solving the easy problem of wireheading is value-laden?":

This isn't an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.

Inner alignment requires making assumptions about human values

I (low-confidence) think that there might be a "choose two" wrt impact measures: large effect, no ontology, no/very limited value assumptions. I see how we might get small good effects without needing a nice pre-specified ontology or info. about human values (AUP; to be discussed in upcoming Reframing Impact posts). I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.

I know this isn't saying why I think this yet, but I'd just like to register this now for later discussion.

Vanessa Kosoy's Shortform

This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user's stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That's because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period...

This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage.

This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I've made for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.

Red Flags for Rationalization

If anyone has experience trying to develop [the skill of noticing what it feels like to rationalize], please leave a comment.

I've developed this skill some. To me, it feels like part of my brain is "slipping sideways", tugging me harder than appropriate towards a particular line of thinking or conclusion. I think I'm reasonably good at noticing rationalization, but part of my brain still tries to rationalize even after I notice it. I want to get better at responding appropriately.

On Being Robust

Yeah, I think that's quite close to this concept - thanks for the link.

Why a New Rationalization Sequence?

A common delusion among grandiose schizophrenics in institutions is that they are themselves psychiatrists. Consider a particularly underfunded mental hospital, in which the majority of people who "know" themselves to be psychiatrists are wrong. No examination of the evidence will convince them otherwise. No matter how overwhelming, some reason to disbelieve will be found.

Given this, should any amount of evidence suffice to convince you that you are such a psychiatrist?

I am not aware of any resolution to this paradox.

I don't think there's a resolution to this kind of thing. "What algorithm produces good epistemics, even though your memories and beliefs may be arbitrarily corrupted?". In the general case, you can't do better than approximating normatively correct reasoning as best you can, given the information you have.

Of course, as actual people, we quickly run into the following issue:

The simplest ideal of thinking deals extensively with uncertainty of external facts, but trusts its own reasoning implicitly. Directly imitating this, when your own reasoning is not 100% trustworthy, is a bad plan. Hopefully this sequence will provide some alternatives.

TurnTrout's shortform feed

While reading Focusing today, I thought about the book and wondered how many exercises it would have. I felt a twinge of aversion. In keeping with my goal of increasing internal transparency, I said to myself: "I explicitly and consciously notice that I felt averse to some aspect of this book".

I then Focused on the aversion. Turns out, I felt a little bit disgusted, because a part of me reasoned thusly:

If the book does have exercises, it'll take more time. That means I'm spending reading time on things that aren't math textbooks. That means I'm slowing down.

(Transcription of a deeper Focusing on this reasoning)

I'm afraid of being slow. Part of it is surely the psychological remnants of the RSI I developed in the summer of 2018. That is, slowing down is now emotionally associated with disability and frustration. There was a period of meteoric progress as I started reading textbooks and doing great research, and then there was pain. That pain struck even when I was just trying to take care of myself, sleep, open doors. That pain then left me on the floor of my apartment, staring at the ceiling, desperately willing my hands to just get better. They didn't (for a long while), so I just lay there and cried. That was slow, and it hurt. No reviews, no posts, no typing, no coding. No writing, slow reading. That was slow, and it hurt.

Part of it used to be a sense of "I need to catch up and learn these other subjects which [Eliezer / Paul / Luke / Nate] already know". Through internal double crux, I've nearly eradicated this line of thinking, which is neither helpful nor relevant nor conducive to excitedly learning the beautiful settled science of humanity. Although my most recent post touched on impostor syndrome, that isn't really a thing for me. I feel reasonably secure in who I am, now (although part of me worries that others wrongly view me as an impostor?).

However, I mostly just want to feel fast, efficient, and swift again. I sometimes feel like I'm in a race with Alex, and I feel like I'm losing.

On Being Robust

Awesome. I should also note this generator is post hoc; I was (trying to) do this for a few years before I was even thinking about the planning fallacy.

Load More