I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI.
but thought it was just numerical error
I was totally convinced it was a numerical error. I spent a full day trying to trace it in my numpy code before I started to reconsider. At that point we'd worked through the proof carefully and felt confident of every step. But we needed to work out what was going on because we wanted empirical support for a tighter bound before we tried to improve the proof.
Oh nice, we tried to wrangle that counterexample into a simple expression but didn't get there. So that rules out a looser bound under these assumptions, that's good to know.
Alfred Harwood and I were working through this as part of a Dovetail project and unfortunately I think we’ve found a mistake. The Taylor expansion in Step 2 has the 3rd order term . This term should disappear as goes to zero, but this is only true if stays constant. The transformation in Part 1 reduces (most terms of) and at the same rate, so decreases at the same rate as . So the 2nd order approximation isn’t valid.
For example, we could consider two binary random variables with probability distributions
and and and .
If , then as .
But consider the third order term for which is
This is a constant term which does not vanish as .
We found a counterexample to the whole theorem (which is what led to us finding this mistake), which has , and it can be found in this colab. There are some stronger counterexamples at the bottom as well. We used sympy because we were getting occasional floating point errors with numpy.
Sorry to bring bad news! We’re going to keep working on this over the next 7 weeks, so hopefully we’ll find a way to prove a looser bound. Please let us know if you find one before us!
Nice, agreed. This is basically why I don't see any hope in trying to align super-LLMs. (this and several similar categories of plausible failures that don't seem avoidable without dramatically more understanding of the algorithms running on the inside).
Okay. I think this anthropic theory makes a falsifiable prediction (in principle). The infinite precision real numbers could be algorithmically simple, or they could be unstructured. The theory predicts that they are not algorithmically simple. If it were the case that they were algorithmically simple, we could run a solomonoff inductor on the macrostates and it would recover the full microstates (and this would probably be simpler than the abstraction-based compression).
While I agree with this as a philosophical stance, it seems clear that, at least in humans, facts and values are in practice entangled.
I agree humans absorb (terminal) values from people around them. But this property isn't something I want in a powerful AI. I think it's clearly possible to design an agent that doesn't have the "absorbs terminal values" property, do you agree?
Even if the terminal value doesn't change, there are facts that could result in behavior that, to a reasonable outside observer, seems to reflect a conflicting value.
Yeah! I see this as a different problem from the value binding problem, but just as important. We can split it into two cases:
I'm assuming a correspondence theory of truth, where beliefs can be said to be true or false without reference to actions or values. This is often a crux with people that are into active inference or shard theory.
I think you're mostly right about the problem but the conclusion doesn't follow.
First a nitpick: If you find out you're being manipulated, your terminal values shouldn't change (unless your mind is broken somehow, or not reflectively stable).
But there's a similar issue: You could discover that all your previous observations were fed to you by a malicious demon, and nothing that you previously cared about actually exists. So your values don't bind to anything in the new world you find yourself in.
In that situation, how do we want an AI to act? There's a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn't seem that hard, but it might not be trivial.
Does this engage with what you're saying?
Yeah makes sense. I don't want to make it harder to write stuff though. The contrast does make the shortform rhetorically better and that is good. With these comments as context, it doesn't seem super necessary to edit it.
Thus, the original argument (the "Value Misspecification Argument") is wrong and the people who believed it should at least stop believing it).
That post is confused about what MIRI ever believed, and john's comment does a really good job of explaining why. I'm guessing you put that first paragraph in for for rhetorical contrast, rather than thinking it's a true summary of any particular person's past beliefs, but I think doing this is corrosive to group epistemics. It's really handy to be able to keep track of what people believed and whether they updated on new evidence, and this process is damaged when people misrepresent what other people previously believed.
That'll be the difference between max and sum in the denominator. If you use sum it's 3.39.
Here's one we worked out last night, where the ratio goes to infinity.