LESSWRONG
LW

1914
Jeremy Gillen
2440Ω15492691
Message
Dialogue
Subscribe

I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
6Jeremy Gillen's Shortform
3y
57
Resampling Conserves Redundancy (Approximately)
Jeremy Gillen19h70

That'll be the difference between max and sum in the denominator. If you use sum it's 3.39.

Here's one we worked out last night, where the ratio goes to infinity.

Reply
Resampling Conserves Redundancy (Approximately)
Jeremy Gillen2d110

but thought it was just numerical error

I was totally convinced it was a numerical error. I spent a full day trying to trace it in my numpy code before I started to reconsider. At that point we'd worked through the proof carefully and felt confident of every step. But we needed to work out what was going on because we wanted empirical support for a tighter bound before we tried to improve the proof.

Reply
Resampling Conserves Redundancy (Approximately)
Jeremy Gillen2d90

Oh nice, we tried to wrangle that counterexample into a simple expression but didn't get there. So that rules out a looser bound under these assumptions, that's good to know.

Reply
Resampling Conserves Redundancy (Approximately)
Jeremy Gillen2d*781

Alfred Harwood and I were working through this as part of a Dovetail project and unfortunately I think we’ve found a mistake. The Taylor expansion in Step 2 has the 3rd order term o(δ3)=16[2(√P[X])3](−δ[X])3. This term should disappear as δ[X] goes to zero, but this is only true if √P[X] stays constant. The Γ transformation in Part 1 reduces (most terms of) P[X] and Q[X] at the same rate, so √P[X] decreases at the same rate as δ[X]. So the 2nd order approximation isn’t valid.

For example, we could consider two binary random variables with probability distributions

P(x=0)=zp and P(X=1)=1−zp and Q(X=0)=zq and Q(X=1)=1−zq.

If δ[X]=√P(X)−√Q(X), then δ[X]→0 as z→0.

But consider the third order term for X=0 which is

13(√Q(0)−√P(0)√P(0))3=13(√zq−√zp√zp)3=13(√q−√p√p)3

This is a constant term which does not vanish as z→0.

We found a counterexample to the whole theorem (which is what led to us finding this mistake), which has KL(X2→X1→Λ′)max[KL(X1→X2→Λ),KL(X2→X1→Λ)]>10, and it can be found in this colab. There are some stronger counterexamples at the bottom as well. We used sympy because we were getting occasional floating point errors with numpy.

Sorry to bring bad news! We’re going to keep working on this over the next 7 weeks, so hopefully we’ll find a way to prove a looser bound. Please let us know if you find one before us!

Reply3211
eggsyntax's Shortform
Jeremy Gillen6d40

Nice, agreed. This is basically why I don't see any hope in trying to align super-LLMs. (this and several similar categories of plausible failures that don't seem avoidable without dramatically more understanding of the algorithms running on the inside).

Reply1
Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
Jeremy Gillen8d20

Okay. I think this anthropic theory makes a falsifiable prediction (in principle). The infinite precision real numbers could be algorithmically simple, or they could be unstructured. The theory predicts that they are not algorithmically simple. If it were the case that they were algorithmically simple, we could run a solomonoff inductor on the macrostates and it would recover the full microstates (and this would probably be simpler than the abstraction-based compression).

Reply
eggsyntax's Shortform
Jeremy Gillen9d50

While I agree with this as a philosophical stance, it seems clear that, at least in humans, facts and values are in practice entangled.

I agree humans absorb (terminal) values from people around them. But this property isn't something I want in a powerful AI. I think it's clearly possible to design an agent that doesn't have the "absorbs terminal values" property, do you agree?

Even if the terminal value doesn't change, there are facts that could result in behavior that, to a reasonable outside observer, seems to reflect a conflicting value.

Yeah! I see this as a different problem from the value binding problem, but just as important. We can split it into two cases:

  1. The new beliefs (that lead to bad actions) are false.[1] To avoid this happening we need to do a good job designing the epistemics of the AI. It'll be impossible to avoid misleading false beliefs with certainty, but I expect there to be statistical learning type results that say that it's unlikely and becomes more unlikely with more observation and thinking.
  2. The new beliefs (that lead to bad actions) are true, and unknown to the human AI designers. (e.g. we're in a simulation and the gods of the simulation have set things up such that the best thing for the AI to do looks evil from the human perspective. The AI is acting in our interest here. Maybe out of caution we want to design the AI values such that it wants to shut down in circumstances this extreme, just in case there's been an epistemic problem and it's actually case 1.
  1. ^

    I'm assuming a correspondence theory of truth, where beliefs can be said to be true or false without reference to actions or values. This is often a crux with people that are into active inference or shard theory.

Reply1
eggsyntax's Shortform
Jeremy Gillen10d110

I think you're mostly right about the problem but the conclusion doesn't follow.

First a nitpick: If you find out you're being manipulated, your terminal values shouldn't change (unless your mind is broken somehow, or not reflectively stable).

But there's a similar issue: You could discover that all your previous observations were fed to you by a malicious demon, and nothing that you previously cared about actually exists. So your values don't bind to anything in the new world you find yourself in.

In that situation, how do we want an AI to act? There's a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn't seem that hard, but it might not be trivial.

Does this engage with what you're saying?

Reply
shortplav
Jeremy Gillen16d20

Yeah makes sense. I don't want to make it harder to write stuff though. The contrast does make the shortform rhetorically better and that is good. With these comments as context, it doesn't seem super necessary to edit it. 

Reply
shortplav
Jeremy Gillen16d54

Thus, the original argument (the "Value Misspecification Argument") is wrong and the people who believed it should at least stop believing it).

That post is confused about what MIRI ever believed, and john's comment does a really good job of explaining why. I'm guessing you put that first paragraph in for for rhetorical contrast, rather than thinking it's a true summary of any particular person's past beliefs, but I think doing this is corrosive to group epistemics. It's really handy to be able to keep track of what people believed and whether they updated on new evidence, and this process is damaged when people misrepresent what other people previously believed.

Reply
Load More
70Detect Goodhart and shut down
9mo
21
31Context-dependent consequentialism
Ω
1y
Ω
6
161Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Ω
2y
Ω
60
174Thomas Kwa's MIRI research experience
2y
53
38AISC team report: Soft-optimization, Bayes and Goodhart
2y
2
119Soft optimization makes the value target bigger
Ω
3y
Ω
20
6Jeremy Gillen's Shortform
3y
57
76Neural Tangent Kernel Distillation
3y
20
37Inner Alignment via Superpowers
3y
13
59Finding Goals in the World Model
Ω
3y
Ω
8
Load More
Eurisko
6 months ago
Eurisko
6 months ago
(+7/-6)