There is one catch: in principle, there could be multiple codes/descriptions which decode to the same message. The obvious thing to do is then to add up the implied probabilities of each description which produces the same message. That indeed works great. However, it turns out that just taking the minimum description length - i.e. the length of the shortest code/description which produces the message - is a good-enough approximation of that sum in one particularly powerful class of codes: universal Turing machines.
Is this about K-complexity is silly; use cross-entropy instead?
Do we really have such good interpretations for such examples? It seems to me that we have big problems in the real world because we don't.
We do have very high-level interpretations, but not enough to have solid guarantees. After all, we have a very high-level trivial interpretation of our ML models: they learn! The challenge is not just to have clues, but clues that are relevant enough to address safety concerns in relation to impact scale (which is the unprecedented feature of the AI field).
Pretty cool!
Just to add, although I think you already know: we don't need to have a reflexive understanding of your DT to put it into practice, because messy brains rather than provable algo etc....
And I always feel it's kinda unfair to dismiss as orthogonal motivations "valuing friendliness or a sense of honor" because they might be evolutionarily selected heuristics to (sort of) implement such acausal DT concerns!
Great! Isn't it generalizable to any argmin/argmax issues? Especialy thinking about the argmax decision theories framework, which is a well-known difficulty for safety concerns.
Similarly, in EA/action-oriented discussions, there is a reccurent pattern like:
Eager-to-act padawan: If world model/moral theory X is best likely to be true (due to evidence y z...), we need to act accordingly with the controversial Z! Seems best EU action!
Experienced jedi: Wait for a minute. You have to be careful with this way of thinking, because there are unkwown unknown, unilateralist curse and so on. A better decision-making procedure is to listen several models, severals moral theories, and to look for strategies acceptable by most of them.
I'm confused, could you clarify? I interpret your "Wawaluigi" as two successive layers of deception within a simulacra, which is unlikely if WE is reliable, right?
I didn't say anything about Wawaluigis and I agree that they are not Luigis, because as I said, a layer of Waluigi is not a one-to-one operator. My guess is about a normal Waluigi layer, but with a desirable Waluigi rather than a harmful Waluigi.
"Good" simply means "our targeted property" here. So my point is, if WE is true to any property P, we could get a P-Waluigi through some anti-P (pseudo-)naive targeting.
I don't get your second point, we're talking about simulacra not agent, and obviously this idea would only be part of a larger solution at best. For any property P, I expect several anti-P so you don't have to instanciate an actually bad Luigi, my idea is more about to trap deception as a one-layer only.
Honest Why-not-just question: if the WE is roughly "you'll get exactly one layer of deception" (aka a Waluigi), why not just anticipate by steering through that effect? To choose an anti-good Luigi to get a good Waluigi?
Natural language is lossy because the communication channel is narrow, hence the need for lower-dimensional representation (see ML embeddings) of what we're trying to convey. Lossy representations is also what Abstractions are about.
But in practice, you expect Natural Abstractions (if discovered) cannot be expressed in natural language?