I recently spent ~2 weeks on this. I iterated on the approach over time, and didn't really try to this "design training" exercise at the beginning.
My starting approach was the "aim for 95% confidence" (now listed as a requirement in the OP), based on receiving that advice from a friend and finding the general idea compelling. Initially I aimed at always giving myself at least a full day to answer a question. I eventually came back to this, but pretty quickly decided it wasn't actually the right approach.
I ended up with a separation between "training" from "testing." During training, I'm optimizing for learning quickly. This can include erring more on looking up answers, working with partners, etc.
During testing, I focused on evaluating whether I-specifically-learned-things, so I didn't talk to friends about my thought process much to avoid spoilers. And I gave myself a very long time (sometimes spending more than a full day on each question).
I was experimenting with workshops throughout this time, and a lot of my effort ended up going towards managing other people and making sure they were having a good time. One of the things I'd go back-in-time and tell myself is "don't try to mix large workshops and doing-it-myself. Invite friends to partner with, but focus on a few people you know well."
One major update was I shouldn't just be trying to get the right answer, I should be trying to identify the explanation the author was primarily aiming at. (Sometimes the author's explanations are confusing or incomplete, but I think "generate lots of relevant explanations, at least one of which was the one the author generated" still seems useful for making sure I actually modeled the situation well)
I figured out partway through the process that I should be optimizing for "learning as much as I could from each question", and that suggested a followup strategy of "choose problems that feel like I will learn a lot from". (With the most obvious implication being 'not too easy or too hard', and a trickier implication being 'requires skills that I'd still benefit from focusing on improving')
One of the biggest problems was setting aside time to do it at all. This is a lot of cognitive work. I ultimately found I could only do this for a few hours a day and still was pretty exhausted in the evening. I think it's relatively achievable to set aside one weekend for this but the amount of time necessary to vet "you have meaningfully improved" is pretty expensive.
I was lucky to be able to take a 2 week break where I was professionally focused on this. I think if rationality wasn't part of my Day Job, and I couldn't take a vacation for it, I think my approach would be to allocate one weekend-day each week towards this for a few weekends (aiming to look up the answer after an hour per question). And then, for testing... well, this feels fairly tricky. An obvious answer is just... keep allocating weekend time to it. This feels like it'd take a long time. Hrmm.
It'd be easier if "people's ability to solve Thinking Physics problems" was better studied, and it was, say, known that some given exercises generally take an average undergrad 2 hours to deconfuse themselves on. (Then, you set yourself a 2 hour timer and submit your best answer when you're done, rather than potentially spending days on it doublechecking yourself).
I think, for the immediate future, "take as long as you want to thoroughly understand the scenario" is a better test of thinking-skill for people doing openended research, and the fact it is it mostly makes sense to do this if you're actually already planning to invest years in openended research with poor feedback loops.