This is just an idea that came to me today. I didn't do any literature search and don't know if it's new.
The "depth" of a game is the length of the longest chain of players where each one beats the next e.g. 60% of the time. Games with more depth are supposed to involve more knowledge, strategy, learning and so on. For example, chess has about double the depth of checkers.
But I don't think that's right. Consider these games:

Decachess: two players play 10 games of chess against each other, and whoever wins more, wins.

Coinchess: a threesided coin is flipped. Heads > player 1 wins, tails > player 2 wins, wings > a game of chess is played and whoever wins wins.
Under the above definition, decachess is "deeper" than chess, because the better player has higher chance to win the whole match than to win a given game. And coinchess is more "shallow", because the better player has less of an edge. Even though the amount of knowledge, strategy and learning involved in these games is exactly the same!
The same problem happens with games like poker. Even though poker involves a lot of skill, each round has so much randomness that the depth comes out even smaller than checkers. We could introduce repeated rounds like in decachess, but how many? It seems arbitrary.
Can the concept of game depth be fixed, made independent from repeated trials and luck? I think yes, by introducing another variable: effort.
Imagine an individual player, let's call him Bob. Have Bob play in a tournament where every match has many rounds, to rule out luck. But in some matches tell Bob to play halfheartedly, and in others tell him to use maximum effort. (Ignore the unreality of the hypothetical, I'm trying to make a point.)
By the end of the tournament we'll have all players arranged by skill, and also know which subranges of skill are covered by individual players' ranges of effort. In other words, we'll end up knowing this: "The difference between the worst and best player in the tournament is covered by about N intervals between a player's slack and best effort at each level".
That number N could be used as a new measure of game depth. Under this measure, chess, decachess and coinchess will be equally deep, and checkers will be less deep. Intuitively it makes sense: "the best player surpasses me by about 5x the difference between my slack and my best effort". It lets you feel how much work is ahead of you. (Measuring against the worst player would be less informative, but everyone only cares about their distance to the top, so that's ok.)
The same idea could also work for test scores. I'd be curious to apply it to something like Raven's matrices: draw a histogram with test score as X and number of people as Y, and renormalize the X axis so that a distance of 1 corresponds to the difference between slack and best effort for a typical person at that level. Then when a new person takes the test or plays the game, we match them against a database of previous results, and tell them "ok, you performed at X units of maximum effort above the median person".
An alternative measure might be looking at the curve of Elo vs amount of computation. Deeper games presumably have greater rewards for thinking, so the curve will level off more slowly. This is measurable for computer players  at least after controlling for architecture (neural network vs traditional).
"How many Elo levels until diminishing returns on effort" seems like a sensible idea, and might work on humans as well as computers. But I still think coinchess and poker show that using win probabilities to infer rating differences (as Elo does) isn't very meaningful, it's better to look only at the binary fact of whether Alice wins against Bob more than half the time.
Probabilistically, this isn't the case. If that's a 'fair 3sided coin', then there's only a one in three chance of playing chess. Also, playing 1 chess game, and 10 are different (if they're done one after another).
Go all out, versus pace yourself (so you'll do better towards the end when other players will be running low). Might not be a hypothetical.
How about Elo range normalized to time? If someone is 100 Elo points higher than me they win 64% of the time  if we play 3 games they'll win best of 3 71% of the time, which is like 160 points higher rather than 100? So you figure out this function and then rank games by Elo per time
This was one of my first ideas. It fixes the decachess bug, but not the coinchess bug, I think.
Is there some reason to think that the depth of a game by this sort of definition is a good measure to be using in the first place? Presumably the 'deepest' game available is 'test your height and whoever is taller wins', which with accurate enough measuring instruments has a depth of 7 billion?
Yes, my definition fixes some bugs in the standard definition, but not this bug in particular. I'm not sure it's even fixable without losing the intuition of "depth" as "number of distinguishable levels".
I don't think that solves the problem, but a good thought.
If you model each person as having two skill levels (min and max effort) and then look at (highest max  lowest min) / (median (max  min)) you end up with the "deepest" games being things where the interplayer differences dwarf the intraplayer ones. But that also includes games like "who is the tallest?" Or situations like "who spent the most $ on their Pokémon CCG deck?"
Yeah. Maybe it would make sense to pull in even more information  not just how well people play with effort vs without, but also how people improve over time with effort vs without. I'm not sure how to do this cleanly yet.