A voice for the unknown human who hides within the binary.


Sorted by New


Reflections on Larks’ 2020 AI alignment literature review

I agree with the thesis but suspect a slightly different mechanism. I don't think people have trouble ignoring noise at an epistemic level - I think people have other reasons for paying lip-service to genre-aligned content independent of epistemic content, and so noise creates a conflict of interest.

This suggests another possible approach in sharpening the incentives to preferentially reward epistemic content - rather than flat out reducing variance which has other negative side effects (eg fragility).

Motive Ambiguity

In practice this often plays out as "predictability is useful", but I want to slightly shift the framing away from "predictability is instrumental for specific subgames" towards "'Good' patterns look like coordination of smaller patterns, so we should expect valuable things to rest on predictable substrates"

eg you could consider "being alive" as nothing more than a collection of predictabilities, but it seems to have immense intrinsic value by most standards.

Motive Ambiguity

Here is my take:
Value is a function of the entire state space, and can't be neatly decomposed as a sum of subgames.
Rather (dually), value on ("quotient") subgames must be confluent with the total value on the joint game.
Eg, there's an "enjoying the restaurant food" game, and a "making your spouse happy" game, but the joint game of "enjoying a restaurant with your spouse" game has more moves available, and more value terms that don't show up in either game, like "be a committed couple".
"Confluence" here means that what you need to forget to zoom in on the "enjoying the restaurant food" subgame causes your value judgement of "enjoying the restaurant food" and "enjoying a restaurant with your spouse, ignoring everything except food" to agree.
The individual subgames aren't "closed", they were never closed, their value only makes sense in a larger context, because the primitives used to define that value refer to the larger context. From the perspective of the larger game, no value is "destroyed", it only appears that way when projecting into the subgames, which were only ever virtual.

Do Sufficiently Advanced Agents Use Logic?

In the opening section, I think it's informative to distinguish representations for what it means to "use logic" in a strategy.

1. The case I expect people usually mean: strategies are parameterized by things that look like "models" and process data that behaves like "propositions", possibly generating some concrete predictions on the base game.

2. Strategies implicitly implement models, by virtue of their algorithmic trace uniquely factoring through a (logically subsequent to it) model.

The latter being possible even for "Model Free" RL-agents.

But I argue (2) is weaker than (1), with the essential capability of (1) being quantification over models: The fact that strategies are explicitly parameterized by models means they can be replugged with internalized propositions in a way baked-in (2) models can't.

So in light of this distinction, I think using (2), the answer is strongly "yes" but (1) it's less clear, and often limited by overfitting (not being iterated enough).

I suspect a sort of 'cut elimination' principle whereby no concrete task actually requires the full reflective capabilities of (1), but that it has better (space) complexity, stability, and adaptability properties and so is favored in existential scenarios where the reward criteria contains a lot of logical uncertainty.