I think that that’s what he meant: more aluminum in the brain is worse than less. What he was trying to say in that sentence is this: high levels in the blood may not mean high levels in the brain unless the blood level stays high for a long time.
“Bob isn't proposing a way to try to get less confused about some fundamental aspect of intelligence”
This might be what I missed. I thought he might be. (E.g., “let’s suppose we have” sounds to me like a brainstorming “mood” than a solution proposal.)
This feels like a rather different attitude compared to the “rocket alignment” essay. They’re maybe both compatible but the emphasis seems very different.
I normally am nervous about doing anything vaguely resembling making a commitment, but my curiosity is getting the better of me. Are you still looking for beta readers?
And answer came there none?
Okay, so if the builder solution can't access the human Bayes net directly that kills a "cheap trick" I had. But I think the idea behind the trick might still be salvageable. First, some intuition:
If the diamond was replaced with a fake, and owner asks, "is my diamond still safe?" and we're limited to a "yes" or "no" answer, then we should say "no". Why? Because that will improve the owner's world model, and lead them to make better predictions, relative to hearing "yes". (Not across the board: they will be surprised to see something shiny in the vault, whereas hearing "yes" would have prepared them better for that. But overall accuracy, weighted by how much they CARE about being right about it, should be higher for "no".)
So: maybe we don't want to avoid the human simulator. Maybe we want to encourage it and try to harness it to our benefit! But how to make this precise? Roughly speaking, we want our reporter to "quiz" the predictor ("what would happen if we did a chemical test on the diamond to make sure it has carbon?") and then give the same quiz to its model of the human. The reporter should output whichever answer causes the human model to get the same answers on the reporter's quiz as the predictor gets.
Okay that's a bit vague but I hope it's clear what I'm getting at. If not, I can try to clarify. (Unless the vagueness is in my thoughts rather than in my "writeup"/paragraph.) Possible problem: how on earth do we train in such a way as to incentivize the reporter to develop a good human model? Just because we're worried it will happen by accident doesn't mean we know how to do it on purpose! (Though if it turns out we can't do it on purpose, maybe that means it's not likely to happen by accident and therefore we don't need to worry about dishonesty after all??)
I want to steal the diamond. I don't care about the chip. I will detach the chip and leave it inside the vault and then I will run away with the diamond.
Or perhaps you say that you attached the chip to the diamond very well, so I can't just detach it without damaging it. That's annoying but I came prepared! I have a diamond cutter! I'll just slice off the part of the diamond that the chip is attached to and then I will steal the rest of the diamond. Good enough for me :)
Man in the middle has 3 parties: Bob wants to talk to Alice, but we have Eve who wants to eavesdrop.
Here we have just 2 parties: Harry the human wants to talk to Alexa the AI, but is worried that Alexa is a liar.
Clarification request. In the writeup, you discuss the AI Bayes net and the human Bayes net as if there's some kind of symmetry between them, but it seems to me that there's at least one big difference.
In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don't do that is because it's likely to be too big to make much sense of.
In the case of the human, we have no idea what the Bayes net looks like, because humans don't have that kind of introspection ability. In fact, there's not much difference between saying "the human uses a Bayes net" and "the human uses some arbitrary function F, and we worry the AI will figure out F and then use it to lie to us".
Or am I actually wrong and it's okay for a "builder" solution to assume we have access to the human Bayes net?