Jeremy Gillen — LessWrong

LESSWRONG
LW

Resampling Conserves Redundancy (Approximately)

our current thread is an adjustment to the error measure.

We're not sure that this is necessary. I quite like the current form of the errors. I've spent much of the past week searching for counterexamples to the ∃ deterministic latent theorem and I haven't found anything yet (although it's partially a manual search). My current approach takes a P(X_1,X_2) distribution, finds a minimal stochastic NL, then finds a minimal deterministic NL. The deterministic error has always been within a factor of 2 of the stochastic error. So currently we're expecting the theorem can be rescued.

rather than $D_{K L} (P [X, Λ] | | Q [X, Λ])$

That seems like a cool idea for the mediation condition, but Isn't it trivial for the redundancy conditions?

Resampling Conserves Redundancy (Approximately)

Jeremy Gillen11d70

That'll be the difference between max and sum in the denominator. If you use sum it's 3.39.

Here's one we worked out last night, where the ratio goes to infinity.

Resampling Conserves Redundancy (Approximately)

Jeremy Gillen12d110

but thought it was just numerical error

I was totally convinced it was a numerical error. I spent a full day trying to trace it in my numpy code before I started to reconsider. At that point we'd worked through the proof carefully and felt confident of every step. But we needed to work out what was going on because we wanted empirical support for a tighter bound before we tried to improve the proof.

Resampling Conserves Redundancy (Approximately)

Jeremy Gillen12d90

Oh nice, we tried to wrangle that counterexample into a simple expression but didn't get there. So that rules out a looser bound under these assumptions, that's good to know.

Resampling Conserves Redundancy (Approximately)

Jeremy Gillen13d*921

Alfred Harwood and I were working through this as part of a Dovetail project and unfortunately I think we’ve found a mistake. The Taylor expansion in Step 2 has the 3rd order term . This term should disappear as $δ [X]$ goes to zero, but this is only true if $\sqrt{P [X]}$ stays constant. The $Γ$ transformation in Part 1 reduces (most terms of) $P [X]$ and $Q [X]$ at the same rate, so $\sqrt{P [X]}$ decreases at the same rate as $δ [X]$ . So the 2nd order approximation isn’t valid.

For example, we could consider two binary random variables with probability distributions

$P (x = 0) = z p$ and $P (X = 1) = 1 - z p$ and $Q (X = 0) = z q$ and $Q (X = 1) = 1 - z q$ .

If $δ [X] = \sqrt{P (X)} - \sqrt{Q (X)}$ , then $δ [X] \to 0$ as $z \to 0$ .

But consider the third order term for $X = 0$ which is

$\frac{1}{3} (\frac{\sqrt{Q (0)} - \sqrt{P (0)}}{\sqrt{P (0)}})^{3} = \frac{1}{3} (\frac{\sqrt{z q} - \sqrt{z p}}{\sqrt{z p}})^{3} = \frac{1}{3} (\frac{\sqrt{q} - \sqrt{p}}{\sqrt{p}})^{3}$

This is a constant term which does not vanish as $z \to 0$ .

We found a counterexample to the whole theorem (which is what led to us finding this mistake), which has $\frac{K L (X_{2} \to X_{1} \to Λ^{'})}{max [K L (X_{1} \to X_{2} \to Λ), K L (X_{2} \to X_{1} \to Λ)]} > 10$ , and it can be found in this colab. There are some stronger counterexamples at the bottom as well. We used sympy because we were getting occasional floating point errors with numpy.

Sorry to bring bad news! We’re going to keep working on this over the next 7 weeks, so hopefully we’ll find a way to prove a looser bound. Please let us know if you find one before us!

eggsyntax's Shortform

Jeremy Gillen16d40

Nice, agreed. This is basically why I don't see any hope in trying to align super-LLMs. (this and several similar categories of plausible failures that don't seem avoidable without dramatically more understanding of the algorithms running on the inside).

Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications

Jeremy Gillen18d20

Okay. I think this anthropic theory makes a falsifiable prediction (in principle). The infinite precision real numbers could be algorithmically simple, or they could be unstructured. The theory predicts that they are not algorithmically simple. If it were the case that they were algorithmically simple, we could run a solomonoff inductor on the macrostates and it would recover the full microstates (and this would probably be simpler than the abstraction-based compression).

eggsyntax's Shortform

Jeremy Gillen19d50

While I agree with this as a philosophical stance, it seems clear that, at least in humans, facts and values are in practice entangled.

I agree humans absorb (terminal) values from people around them. But this property isn't something I want in a powerful AI. I think it's clearly possible to design an agent that doesn't have the "absorbs terminal values" property, do you agree?

Even if the terminal value doesn't change, there are facts that could result in behavior that, to a reasonable outside observer, seems to reflect a conflicting value.

Yeah! I see this as a different problem from the value binding problem, but just as important. We can split it into two cases:

The new beliefs (that lead to bad actions) are false.^[1] To avoid this happening we need to do a good job designing the epistemics of the AI. It'll be impossible to avoid misleading false beliefs with certainty, but I expect there to be statistical learning type results that say that it's unlikely and becomes more unlikely with more observation and thinking.
The new beliefs (that lead to bad actions) are true, and unknown to the human AI designers. (e.g. we're in a simulation and the gods of the simulation have set things up such that the best thing for the AI to do looks evil from the human perspective. The AI is acting in our interest here. Maybe out of caution we want to design the AI values such that it wants to shut down in circumstances this extreme, just in case there's been an epistemic problem and it's actually case 1.

^{^}
I'm assuming a correspondence theory of truth, where beliefs can be said to be true or false without reference to actions or values. This is often a crux with people that are into active inference or shard theory.

eggsyntax's Shortform

Jeremy Gillen20d110

I think you're mostly right about the problem but the conclusion doesn't follow.

First a nitpick: If you find out you're being manipulated, your terminal values shouldn't change (unless your mind is broken somehow, or not reflectively stable).

But there's a similar issue: You could discover that all your previous observations were fed to you by a malicious demon, and nothing that you previously cared about actually exists. So your values don't bind to anything in the new world you find yourself in.

In that situation, how do we want an AI to act? There's a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn't seem that hard, but it might not be trivial.

Does this engage with what you're saying?

shortplav

Jeremy Gillen1mo20

Yeah makes sense. I don't want to make it harder to write stuff though. The contrast does make the shortform rhetorically better and that is good. With these comments as context, it doesn't seem super necessary to edit it.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments