Duels & D.Sci March 2022: Evaluation and Ruleset

aphyer

This is a follow-up to last week's D&D.Sci scenario: if you intend to play that, and haven't done so yet, you should do so now before spoiling yourself.

There is a web interactive here where you can test your submission against your rival's deck (or against a couple other NPCs I added for your amusement, if you think yourself mighty enough to challenge those who wield the legendary power of the Sumerian Souvenirs).

NOTE: Win rates in the interactive are Monte-Carlo-d with small sample sizes and should be taken as less accurate than the ones in the leaderboard.

RULESET

Code is available here for those who are interested.

A game is played as follows:

There are up to 6 turns in the game.
Each turn, each player simultaneously draws 2 cards from their deck. (So over the full duration of the game, each player will draw their entire deck.)
Then, each player will simultaneously play either 0 or 1 of those cards:
- Every card has a mana cost, ranging from 1 to 6.
- On turn X, you can play a card that has a cost of X or less.
- If both your cards are valid plays, you will play the higher-cost one.
For example, if you draw a 2-cost and a 4-cost card:
- If it is Turn 1, you will play nothing.
- If it is Turn 2 or 3, you will play the 2-cost card.
- If it is Turn 4 -6, you will play the 4-cost card.
Any cards you didn't play are discarded - you will not be able to play them on later turns.
Note that:
- You cannot play two cards in the same turn, even if you would in theory have enough mana to do so.
- You cannot be clever about which card you play. In some cases a lower-cost card might be better than a higher-cost card: you will play the higher-cost one anyway if you can. If you draw two cards of the same cost, you will play one at random.
After both players have played a card if they can, each player calculates their Total Power by adding up the Power of all cards they have on their board.
If one player's Total Power exceeds the other's by at least 2000 Power, that player wins an immediate victory and the game ends.
If not, the game continues to next turn.
After 6 turns, if no player has won an immediate victory, the game ends anyway, and whoever has higher Total Power wins.
If players are tied on Total Power at this point, their cards clash in a mighty battle and the winner is decided randomly.

CARD LIST

Cards available to you were the following:

Card Name	Mana Cost	Power	Alignment
Gentle Guard	1	700	Good
Lilac Lotus	1	200*	Artefact
Patchy Pirate	1	700	Evil
Horrible Hooligan	2	900	Evil
Kindly Knight	2	900	Good
Sword of Shadows	3	2000*	Artefact
Virtuous Vigilante	3	1300	Good
Bold Battalion	4	600x*	Good
Murderous Minotaur	4	1700	Evil
Alessin, Adamant Angel	5	1800*	Good
Dreadwing, Darkfire Dragon	5	2500	Evil
Evil Emperor [omitted for length]	6	4500	Evil

*Some cards have special abilities:

Lilac Lotus provides additional Mana. For every Lotus you have on the board, you have 1 extra mana each turn. (Yes, I know this is more like a Mox than like a Lotus. I don't need to make the hints give that much away.)
Sword of Shadows requires a Wielder. If you have at least one Evil creature on your board to wield it, it adds 2000 Power. If you do not, it adds 0 Power.
Bold Battalion draws Power from your Good creatures. Its Power is 600 per Good creature you have on your board (including itself).
Alessin, Adamant Angel shows Mercy on your bad draws. When you draw it without enough mana to play it, you shuffle it back into your deck and draw again. (You can do this many times if you are unlucky enough to draw the Angel repeatedly, but if your deck is entirely Angels you will draw the Angel anyway rather than getting stuck in a loop).

Congratulations here to:

benjamincosman, who I believe was the first to identify Good & Evil and the existence of internal synergies.
gammagurke, who I believe was the first to localize the Good and Evil synergies in Battalion and Sword specifically.
GuySvrinivasan, who I believe was the first to identify the Lotus's ability.

STRATEGY

There were three available 'archetypes': coherent decks that fit together and did something powerful. All ended up with a roughly 75% winrate against your rival if optimized, but they performed differently against one another.

Good Tribal is based around Bold Battalion. It tries to play lots of cheap Good creatures (Gentle Guard) in early turns, and then lots of Battalions to gain high Power. The best build for this against your rival is very simple, with 6xGuard and 6xBattalion. This gets 75.4% winrate, the best available.

Sword Aggro is based around Sword of Shadows. This time, your goal is to get cheap Evil creatures to wield Sword of Shadows, and then try to use the Sword's very high score for its cost to win on turn 3 or 4. The best build for this against your rival is 3xPirate, 5xSword, and 4xAngel. (The Angels make your lategame a little more reliable, and don't get in the way of your early game thanks to their ability). This gets 74.6% winrate.

Lotus Ramp is based around Lilac Lotus. Your goal is to play Lotuses in the early turns and use them to play out large late-game threats like the Emperor. The best build for this against your rival is 5xLotus, 3xAngel, and 4xEmperor. This gets 74.5% winrate.

When these decks as-written play against each other, their relative speeds lead to some rock-paper-scissors (where being slightly stronger and slower than your opponent advantages you as you end up with more power, but being much stronger and slower disadvantages you as you can lose before getting off the ground):

Good Tribal has a slight advantage over Sword Aggro by being slightly slower and stronger (~55%)
Sword Aggro has a slight advantage over Lotus Ramp by being much faster and weaker (~55%)
Lotus Ramp has a substantial advantage over Good Tribal by being slightly slower and stronger (~63%)

It is possible to build these decks differently - most noticeably, you could build more defensive Lotus Ramp decks (with more Angels or even Vigilantes, and fewer Emperors). This improves the matchup against Sword Aggro and other fast decks substantially, but worsens the matchups against Good Tribal, other Lotus Ramp, and any other slow decks.

Particular congratulations are due here to gammagurke, who managed not only to identify these three archetypes but to actually give two of them exactly their correct names. (Sadly, 'Evil Equip' doesn't quite work out as a name when the deck often contains more Good creatures than it does Evil ones).

I was expecting Good Tribal to be the simplest available deck, and for Sword Aggro and particularly Lotus Ramp to take the most work to optimize - it seems I was mistaken in this regard, and many players submitted Sword or Lotus-based decks while no-one submitted a Good-based deck.

PVE LEADERBOARD

Note: all winrates below were Monte-Carlo calculated rather than explicitly derived.

Submitted decks were as follows:

abstractapplic: Sword Aggro (3xPirate, 3xHooligan, 4xSword, 2xAngel)

gammagurke: Lotus Ramp (5xLotus, 2xPirate, 3xAngel, 2xEmperor)

GuySrinivasan: Dragon Ramp (6xLotus, 4xDragon, 2xEmperor)

jsevillamol: Knight Aggro (1xPirate, 9xKnight, 1xBattalion, 1xEmperor)

Maxwell Peterson: Sword Aggro (6xPirate, 2xSword, 1xMinotaur, 3xAngel)

Measure: Lotus Midrange (3xLotus, 2xPirate, 1xSword, 1xVigilante, 2xAngel, 1xDragon, 2xEmperor)

Pablo Repetto: Lotus Aggro (3xLotus, 3xPirate, 2xSword, 4xAngel)

Yonge: Lotus Midrange (2xLotus, 1xPirate, 1xHooligan, 2xVigilante, 1xMinotaur, 2xAngel, 1xDragon, 2xEmperor)

Player	Winrate
Optimal Play	75.4%
abstractapplic	68.46%
GuySrinivasan	66.16%
Measure	61.38%
Maxwell Peterson	61.06%
gammagurke	59.00%
Yonge	57.76%
Pablo Repetto	52.57%
Random Play	40.5%
jsevillamol	31.39%

Congratulations to all submitters, in particular to abstractapplic (whose Sword Aggro build was fairly well-tuned despite the Hooligans not being quite optimal) and GuySrinivasan (who managed to place a very close second despite playing an archetype I didn't even think existed).

PVP LEADERBOARD

Note: all winrates below were Monte-Carlo calculated rather than explicitly derived.

Note: The commentary below should be considered non-final for a few days to give people time to point out that I've misread the decks they submitted/added up win percentages wrong/made other obvious mistakes. If I have messed something like that up I'll have to recalculate, so don't count on victory/defeat until some more eyes have confirmed.

Submitted decks were as follows:

abstractapplic: Sword Aggro (3xPirate, 3xHooligan, 4xSword, 2xAngel)

gammagurke: Lotus Ramp (6xLotus, 2xAngel, 4xEmperor)

GuySrinivasan: Dragon Ramp (6xLotus, 1xMinotaur, 5xDragon)

jsevillamol: Knight Aggro (1xPirate, 9xKnight, 1xBattalion, 1xEmperor)

Maxwell Peterson: Sword Aggro (2xPirate, 4xHooligan, 2xSword, 4xAngel)

Measure: Sword Aggro with Emperors (4xPirate, 4xSword, 1xVigilante, 3xEmperor)

Pablo Repetto: Lotus Aggro (3xLotus, 3xPirate, 2xSword, 4xAngel)

Yonge: Lotus Midrange (2xLotus, 1xPirate, 1xHooligan, 2xVigilante, 1xMinotaur, 2xAngel, 1xDragon, 2xEmperor)

Standings were as follows:

Player	abstractapplic	gammagurke	Measure	Maxwell Peterson	Yonge	GuySrinivasan	Pablo Repetto	jsevillamol	Total Score
abstractapplic	50%	56.44%	55.51%	59.99%	59.4%	59.78%	68.85%	85.62%	4.96
gammagurke	43.56%	50%	53.28%	51.37%	67.67%	58.18%	60.17%	82.46%	4.67
Measure	44.49%	46.72%	50%	50.99%	52.98%	56.78%	59.87%	78.97%	4.41
Maxwell Peterson	40.01%	48.63%	49.01%	50%	53.28%	53.91%	60.98%	82.4%	4.38
Yonge	40.6%	32.33%	47.02%	46.72%	50%	51.45%	56.43%	74.96%	4.00
GuySrinivasan	40.22%	41.82%	43.22%	46.09%	48.55%	50%	53.58%	70.46%	3.94
Pablo Repetto	31.15%	39.83%	40.13%	39.02%	43.57%	46.42%	50%	71.15%	3.61
jsevillamol	14.38%	17.54%	21.03%	17.60%	25.04%	29.54%	28.85%	50%	2.04

gammagurke's Lotus Ramp deck was designed to try to prey on other midrange and ramp decks by going over the top with more Emperors - sadly, there were many Sword-based aggro decks and fewer midrange and ramp decks, making this otherwise-reasonable deck poorly positioned for the opponents it actually faced and pushing it back to second place.

Measure and abstractapplic both submitted solid Sword Aggro lists, but abstractapplic's was somewhat better-tuned and came out on top.

Congratulations abstractapplic! Once you've figured out what theme/work* you want to request an upcoming scenario be based on, PM or comment and I'll try to get it to happen. I can't promise it'll happen soon (it'll take some time to write one of these, I have at least one other scenario that'll likely get posted before it, and other people may also submit some), so you'll end up waiting most likely a few months.

*Ability to select a specific work is contingent on me being familiar with that work and thinking I can write a scenario based on it.

FEEDBACK REQUEST

As usual, I'm interested to hear feedback on what people thought of this scenario. If you played it, what did you like and what did you not like? If you might have played it but decided not to, what drove you away? What would you like to see more of/less of in future? Do you think the underlying data model was too complicated to decipher? Or too simple to feel realistic? Or both at once?

It also looks like we had a few new players. Congratulations to you, and I hope you enjoyed the game - if you liked it, the sequence here contains past scenarios, and you can subscribe to that to get notifications when new ones are posted (I try to make sure this happens around once a month).

Thanks for playing!

Reflections on my attempt:

My PvE approach, as I mentioned, was to copy the plan that worked best in a comparable game: train a model to predict deck success, feed it the target deck, then optimize the opposing deck for maximum success chance. I feel pretty good about how well this worked. If I'd allocated more time, I would have tried to figure out analytically why the local maxima I found worked (my model noticed Lotus Ramp as well as Sword Aggro but couldn't optimize it as competently for some reason), and/or try multiple model types to see what they agree on (I used a GBT, which has high performance but doesn't extrapolate well).

My PvP approach was a lot more scattershot. After some hilariously bad attempts to get an uncounterable deck by having two decks repeatedly optimize against each other, I decided to just recycle my PvE deck and hope I happened to win the rock-paper-scissors game. As it happened, Fate smiled on me, but if there had been any Good Tribal decks in play I wouldn't be looking quite so clever right now.

Reflections on the challenge:

This was fun. I particularly like that it was superficially similar to Defenders of the Storm, while having profoundly different mechanics: I came in expecting another game that's mostly about counters, and instead got a game that's mostly about synergy. And, as everyone (including me) has already said, the premise and writing are hilarious.

My only problem with this game was the extra difficulty associated with approaching it analytically if you don't happen to know about mtg-style card games (I remember looking at the comments on the main post late last week and wondering what a 'ramp' was). However, this issue is mitigated by the facts that:

.It (presumably) gave card game fans a chance to practice balancing-priors-against-new-evidence skills and not just ML/analysis skills.
.It's not unreasonable for card game knowledge to help pick cards in a game centered on card games.
.I won despite lacking this background.

My only problem with this game was the extra difficulty associated with approaching it analytically if you don't happen to know about mtg-style card games (I remember looking at the comments on the main post late last week and wondering what a 'ramp' was).

I actually considered this to be mostly a feature rather than a bug? I think real-world data science problems also benefit from having some knowledge of the domain in question.

It's possible to apply data science techniques to a completely unfamiliar domain - you don't need to know anything about card games to notice that 'P' and 'S' showing up together, or 'L' and 'E' showing up together, improves your payoff function, and to try submitting an answer that has lots of 'P's and lots of 'S's in it.

But if you have some level of domain knowledge, you have more ability to guess what kind of patterns are likely to appear, and to extrapolate details. When you see that 'L' works well with 'D', 'E' and 'A' that doesn't tell you much else: when you notice that 'L' works well with all three of the cards that have long and bombastic names, that lets you start guessing things like 'there are some kind of costs to playing these powerful cards, and L helps you pay those costs to play them'. This lets you guess in turn things like 'adding more Emperors might make the deck stronger against other decks like itself but weaker against faster decks' that would be very hard to pull out of the data directly without some amount of domain knowledge to help.

This is part of the reason why I gave the cards names instead of just saying 'Card ID 1', 'Card ID 2', etc. (The other part is of course to sound cooler :P)

It's definitely a feature as well; the exact tradeoff comes down to personal taste.

This was the most fun I've had analysing data and writing code probably ever. Unfortunately I missed the previous editions, but I'm looking forward to the next one. If I had played the previous ones, I might have steered further away from trying to explain effects by complex interactions that are common in this sort of card game in favour of simpler interactions that are more likely to be put into this sort of ruleset (for example things having one combat-relevant stat instead of two).

Because this was such a blast to play around with, I don't really have any specific things I would change.
The way the decks in the dataset were generated, along with the datasets size, made it easy to check how some specific decktype did in general, but hard to check how it did against some other specific decktype, which seems like the perfect middleground.

I really appreciate the work that was put into the theme, letting me roleplay as Kaiba playing children's card games and trying to win using the power of arcane computer analysis tools.

I enjoyed this one a lot! It was simple enough to inspire educated guesses about how the system worked, but complex enough that probably no exact values would be forthcoming. Intuitions from similar games ported pretty well. I was "coerced" into writing some fun code. Thank you for explicitly stating the random nature of the dataset's decks, I think this particular challenge would have been worse if we needed to try to model [something] about that, too.

First of all: thank you for setting up the problem, I had lots of fun!

This one reminded me a lot of D&D.Sci 1, in that the main difficulty I encountered was the curse of dimensionality. The space had lots of dimensions so I was data-starved when considering complex hypotheses (performance of individual decks, for instance). Contrast with Voyages of the Grey Swan, where the main difficulty is that broad chunks of the data are explicitly censored.

I also noticed that I'm getting less out of active competitions than I was from the archived posts. I'm so concerned with trying to win that I don't write about and share my process, which I believe is a big mistake. Carefully composed posts have helped me get my ideas in order, and I think they were far more interesting to observers. So I'll step back from active competitions for a bit. I'll probably do the research summaries I promised, "Monster Carcass Auction", "Earwax" (maybe?), then come back to active competitions.

For my PvE approach, I filtered the dataset for decks similar to our opponent's deck (the rule I used was "decks with at least eight different card types"), and looked at which single card inclusion (zero vs. one-or-more) yielded the best win rate. Then I further filtered for matchups that included that card and looked at adding an additional copy of that card vs. adding a different card. I repeated this process until I had filled eight or so slots (I think I had AADEELLL____) and then filled the rest with generally-good diverse cards (PPSV).

For PvP, I guessed that lots of people would find similar Lotus ramp decks to my PvE deck, so I filtered the dataset for decks with a lot of Lotuses and a lot of Angels+Dragons+Emperors. I then used the same process as above to fill one card at a time until I had PPPPSSSS____. At this point, there were very few matchups in the dataset that passed the filters, so I wasn't confident in how to finish the deck, but the process was weakly pointing to Emperor and Vigilante, and I wanted a bit more diversity, so I filled it out with EEEV.

Interesting! It looks to me like your initial algorithm was excellent but your 'filling-out' process may have shot you in the foot a little: both of your decks ended up sort of indecisive about whether to go for aggro or ramp. Your PVP deck would have done much better with more Pirates and Swords rather than switching over to Emperors, and your PVE deck would have done much better with more ramp stuff rather than switching to Pirates and Swords.

Can you play any of the previously drawn cards, or only one of the two drawn this turn?

Only one of the two drawn this turn, I'll edit to clarify.

I just noticed that this fictional game is surprisingly similar to Marvel Snap (which was released later the same year); I assume based on the timing that this is a coincidence but I thought it was amusing.

...they stole my game!!!