Looking at states still throws away information. Trump lost by slightly over a 0.6% margin in the states that he'd have needed to win. The polls were off by slightly under a 6% margin. If those numbers are correct, I don't see how your conclusion about the relative predictive power of 538 and betting markets can be very different from what your conclusion would be if Trump had narrowly won. Obviously if something almost happens, that's normally going to favor a model that assigned 35% to it happening over a model that assigned 10% to it happening. Both Nate Silver and Metaculus users seem to me to be in denial about this.
Does it make sense to calculate the score like this for events that aren't independent? You no longer have the cool property that it doesn't matter how you chop up your observations.
I think the correct thing to do would be to score the single probability that each model gave to this exact outcome. Equivalently you could add the scores for each state, but for each use the probabilities conditional on the states you've already scored. For 538 these probabilities are available via their interactive forecast.
Otherwise you're counting the correlated part of the... (read more)