I was asking more "how does the AI get a good model of itself", but your answer was still interesting, thanks. Still not sure if you think there's some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)
Here's another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking "how do you communicate with humans to make them good at RL feedback," you're asking "how do you communicate with humans to make them good at participating in verbal chain of thought?"
And don't you think 500 lines of Python also "fails due to" having unintended optima?
I've put "fails due to" in scare quotes because what's failing is not every possible approach, merely almost all samples from approaches we currently know how to take. If we knew how to select python code much more cleverly, suddenly it wouldn't fail anymore. And ditto for if we knew how to better construct reward functions from big AI systems plus small amounts of human text or human feedback.
Do you have ideas about how to do this?
I can't think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.
But this is, like, probably not a thing we should just do first and think about later. I'd like it to be part of a pre-meditated plan to handle outer alignment.
Edit: after thinking about it, that's too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.
Maybe you could do something with LLM sentiment analysis of participants' conversations (e.g. when roleplaying discussing what the best thing to do for the company, genuinely trying to do a good job both before and after).
Though for such a scenario, an important thing I imagine is that learning about fallacies only has a limited relation, and only if people learn to notice them in themselves, not just in someone they already disagree with.
What happens if humans have a systematic bias? E.g. we always rate claims with negative sentiment as improbable, and always rate claims with positive sentiment as probable? It seems like Alice dominates because Alice gets to write and pick the subclaims. But does Bob have a defense, maybe predicting the human probability and just giving that? But because the human probability isn't required to be consistent, I think Bob is sunk because Alice can force the human probability assignment to be inconsistent and then gotcha Bob either for disagreeing with the human or for being inconsistent.
William Lane Craig is great to watch from meta-perspective. How do you go into someone else's field of expertise and try to beat them in a debate? He clearly thinks about it very carefully, in a way kinda like planning for political debates but with a much higher quality intended output.
I had a pretty different interpretation - that the dirty secrets were plenty conscious (he knew consciously they might be stealing a boat), instead he had unconscious mastery of a sort of people-modeling skill including self-modeling, which let him take self-aware actions in response to this dirty secret.
For math specifically, this seems useful. Maybe also for some notion of "general knowledge."
I had a music class in elementary school. How would you test for whether the students have learned to make music? I had a spanish class - how do you test kids' conversational skills?
Prior to good multimodal AI, the answer [either was or still is, not sure] to send a skilled proctor to interact with students one-on-one. But I think this is too unpalatable for reliability, cost, and objectivity reasons.
(Other similar skills: writing fiction, writing fact, teamwork, conflict resolution, debate, media literacy, cooking, knowledge of your local town)
I'm always really curious what the reward model thinks of this. E.g. are the trajectories that avoid shutdown on average higher reward than the trajectories that permit it? The avoid-shutdown behavior could be something naturally likely in the base model, or it could be an unintended generalization by the reward model, or it could be an unintended generalization by the agent model.