Below is a summary of my (honorable mention) proposal, which tries to induce mind-blindness on the Reporter. I've also posted the full contest entry as a blog post here: https://calvinmccarter.wordpress.com/2022/02/19/mind-blindness-strategy-for-eliciting-latent-knowledge/ . I've not fully analyzed whether it survives previously mentioned counterexamples, or whether it has additional unique vulnerabilities compared to other strategies, so I'd welcome feedback on these questions.
The overall strategy is to avoid training a “human simulator” reporter by regularizing its internal state to have mind-blindness. One could imagine training a “Human Simulator” that takes as input the “what’s going on” state, plus a question about what a human believes about the world, and is trained to maximize its accuracy at... (read more)
Below is a summary of my (honorable mention) proposal, which tries to induce mind-blindness on the Reporter. I've also posted the full contest entry as a blog post here: https://calvinmccarter.wordpress.com/2022/02/19/mind-blindness-strategy-for-eliciting-latent-knowledge/ . I've not fully analyzed whether it survives previously mentioned counterexamples, or whether it has additional unique vulnerabilities compared to other strategies, so I'd welcome feedback on these questions.
The overall strategy is to avoid training a “human simulator” reporter by regularizing its internal state to have mind-blindness. One could imagine training a “Human Simulator” that takes as input the “what’s going on” state, plus a question about what a human believes about the world, and is trained to maximize its accuracy at... (read more)