You are studying to become an Adventurer. You’re excited to begin your Great Quest, but also anxious: over a third of the graduates from your program fail to accomplish their Great Quests. And if you’re being brutally honest with yourself, your odds are probably worse than that, since your stats – while about average for the general population – are pretty lousy by the standards of Adventurer College.
On the eve of your graduation, you’re visited by a mysterious fairy offering to add a total of ten extra points to whichever attributes you most want to improve. Following the college’s standard mysterious fairy protocol, you humbly request a week for research and contemplation before deciding how best to use this one-in-a-lifetime opportunity. (Your low Charisma ensures you come off as simultaneously entitled and disinterested when saying this, but she agrees regardless.)
The college Archivist provides you a complete but anonymised record of the stats of everyone who graduated last year, and whether they succeeded at their Great Quests. (The record-keeping is magically perfect, and as Great Quests never take more than a year there are no incomplete Great Quests to account for.) The rest is up to you. Where will you allocate those ten points?
I’ll be posting an interactive letting you test your decision, along with a complete explanation of the dataset, sometime next Saturday. I’m giving you a week, but the task shouldn’t take more than a few hours; use Excel, R, Python, random guessing, or whatever other tools you think are appropriate. Let me know in the comments if you have any questions about the scenario.
ETA: If you want to investigate this collaboratively with other lesswrongers (or just share your conclusions without waiting a week), feel free to do so in the comments; however, please use spoiler tags when sharing inferences, so people intending to fly solo can look for clarifications without being spoiled.
There are 3 pairs of duplicate stats in the dataset. (1011, 4696) - different results. (3460, 5146) - different results. (4399, 5963) same result. This reassures me that there was some random element and not some Zendo-like rulest, unless we're in the real complicated case where peoples' outcome was a deterministic function of their surrounding graduating class.
Hmm, normally I don't participate in things like these... but something about this one appeals to me. So why not! Let's give it a shot.
What are the guidelines for posting my research / insights?
I've added some guidelines to the main post. Thanks for asking, I'm embarrassed to admit that angle didn't occur to me.
I'm late for the party. I put my blind analysis on a full post, and will be going through all the problems in order.
Retrospective thoughts after seeing the solution.
Definitely enjoyed this, would very much appreciate a few more posts of this style. The relatively basic solution I implemented is as follows:
I created a dictionary in Python for the total successes. For each stat (cha, str, etc), I found the number of successes and total attempts at each score(1-20). Dividing the successes by total attempts gave me a rough success rate for each stat score.
Then, I set my character up as a dictionary, and iterated over it, increasing each value by one and seeing what the change was for the success rate from that increase. After a single iteration, I increased my character's stat where the greatest positive change was found.
Iterating over that 10 times, or until all my points were gone, gave me this spread:
STR: Eight (increased by two)
CON: Sixteen (increased by two)
DEX: Fourteen (increased by one. Weird, I know, but there was an increase to be had from 13 - 14, even though afterward it's negative returns)
INT: Thirteen (no change)
WIS: Twelve (no change)
CHA: Nine (increased by 5)
You do indeed miss out on some gains from a jump - WIS gets you a decline in success at +1 but a big gain at +3. (Edit: actually my method uses odds ratio (successes divided by failures) not probabilities (successes divided by total). So, may not be equivalent to detecting jump gains for your method. Also my method tries to maximize multiplicative gain, while your words "greatest positive" suggest you maximize additive gain.)
STR - 8 (increased by 2)
CON - 15 (increased by 1)
DEX - 13 (no change)
INT - 13 (no change)
WIS - 15 (increased by 3)
CHA - 8 (increased by 4)
calculation method: spreadsheet adhockery resulting in tables for each stat of:
per point gain = ((success odds ratio for current stat)/(success odds ratio for current stat + n))^(1/n), find n and table resulting in highest per point gain, generate new table for that stat for new stat start point and repeat.
~2 hours' of analysis here: https://github.com/sclamons/LW_Quest_Analysis, notebook directly viewable at https://nbviewer.jupyter.org/github/sclamons/LW_Quest_Analysis/blob/main/lw_dnd.ipynb.
1) From simple visualizations, it doesn't look like there are correlations between stats, either in the aggregated population or in either the hero or failed-hero populations.
2) I decided to base my stat increases on what would add the most probability of success for improving that stat, looking at each stat in isolation, where success probabilities were estimated by simply tabulating the fraction of students with that particular stat value ended up heroes.
3) Based on that measure, I decided to go with +4 Cha, +1 Str, +2 Wis, +3 Con, and I wish I could reduce my Dex.
My research so far:
Made graphs of stat vs prob of success. Pretty clean linear relationships between each stat and increase in success, except for dexterity. That seems to hurt.
Checked for correlations between stats; none detected.
Given that we can't go down in stats, I also looked at the data for students whose stats are at least as high as ours. Did linear regression on that; seems like Dex helps in this case, but there are a lot fewer samples, so I'm going to chuck it up to noise.
Going off all the data, Wis and Cha have the highest slope. (Cha is slightly higher.) So I'd invest evenly in both. Going off the conditioned data, Cha has the highest slope. So I'll shift to +7 Cha and +3 Wis.
One thing I also thought about is coming up with various hypotheses, Zendo style. E.g. "you win if your Str > Dex" or "you win if the sum of your lowest three stats is > 10" But I don't think that's the nature of the problem.
Stats don't appear too correlated, although all cross correlations are negative around -7% to -10% which is interesting. I guess it might have to do with data construction. Simple logistic regression gives coeffs: CHA=0.143 CON=0.141 DEX=-0.016 INT=0.099 STR=0.116 WIS= 0.156. Based on these one would push WIS+8 to 20 and allocate the remaining two points CHA+2.
However values are close, and there are standard errors around the estimates. Bootstrapping strategies to account for that allocates: WIS+7, CHA+2, CON+1.
Accounting for cross effects flips them around, with allocation: CHA+6, WIS+3, CON+1. Going full second order allocates CHA+6, WIS+3, STR+1, but obviously with higher complexity.
My answer would overweigh the linear when blending with the ones with more parameters. Final answer WIS+6, CHA+3, CON+1.
Stupid question: STR and CHA are given in different orders in the data vs the above description. (And, because both values given are "low" enough to be the CHA stat, it's ambiguous if the values were switched). Does this secretly mean something, or am I just reading too much into it?
Your paranoia does you credit, but I'm not doing anything close to that subtle; what you're seeing is Pandas putting the columns in alphabetical order when saving the dataset as csv. (I had to manually edit it to make 'results' be the last row instead of third-to-last)
Ah, okay thanks!
Choice and reasoning:
Graduate stats likely come from 2d10 drop anyone under 60 total. No obvious big jumps at particular thresholds, so assume each extra point helps about the same given the stat type.
For completing my Great Quest: +8 WIS, +2 CHA, based on assuming each stat point provides the equivalent of x bits of evidence you'll complete it, depending on the stat, estimated by looking at prior history in your range of stats of the change in prob given that total stats didn't change.
For breaking the system: +10 CON. Best chance of surviving while not on a Great Quest, breaks the theoretical limit by the most, not awful for Great Quest.
Life after Questing: +6 CHA, +4 STR. Really quite good for your Great Quest even if not the best, and you no longer have silly weaknesses like talking and jars, so e.g. if there's another fairy later you don't run such a big risk of losing out on a free +10 stats by sounding simultaneously entitled and disinterested.
Some basics: Each stats has range 2-20 (and maybe comes from 2d10 somehow?). Sum of stats is in range 60-100. You have 62 and are going to 72. More stats generally gets better results. Baseline 62 gives 40% to quest; baseline 72 gives 69% to quest. Average graduate stat sum is 70.4. Total graduates 7387. Maybe stats come from a roll of 2d10 and you only graduate if your stats are at least 60 in total? P(12d10>=60)=74% so probably generated 10K folks and filtered to >=60, yeah. Stats are probably anti-correlated in our sample?
Let's try simple logistic regression. Normalize, fit, predict. You're 38% to succeed, that checks out. Try some simple changes? +10 to any stat, even though that brings you above 20. WIS gets to 73%, CHA/CON to 70%, INT/STR to 65%/61%, and DEX down to 34%. Huh! groupby('dex').mean() ==> yeah, much higher chances with low dex, dunno if that's because dex is useless and stats anti-correlated, or dex is harmful. Anyway this model's got CHA/CON/DEX/INT/STR/WIS coeffs at [2.5, 2.5, -0.3, 1.7, 2.0, 2.7]. As I see it so far, there are three main considerations: pump WIS to 20 and CHA to 6 to maximize chance of quest, pump CON to 24/20 to maximize survivability past that of any adventure who has ever lived (can we pass 20?? :D), or mostly ignore quest considerations because we have other goals, which probably means maxing some stat or shoring up CHA/STR.
Let's check a random forest to see if there are major discontinuities. Oh, it's way different! Here +10 to CHA does very very well, almost 90% quest success. groupby('cha').mean() ==> I see a jump of almost +10pp from 5->6 and 13->4 CHA. Maybe we invest 10 in CHA? Or maybe 2 in CHA and then... nah, not really better. But this is misleading too, because folks with CHA=14 just happened to have better stats on average. Better than CHA=13 for everything but DEX which has negative predictive success.
Okay fine, let's try to do something like the right thing. I'd like to know the change in success rate when adding one point to one stat, with the sum of the other stats remaining constant. And I might only care about this in the lowish range of stat sums, 60 to 75, say. We'll just grab the average for a sec. The average what. ...evidence of success provided by seeing +1 in a certain stat given that all other stats are equal? Sure, maybe that's the model used to generate quest prob. Laplace to estimate prob of success with total stats = x, wlog cha = y. Got CHA/CON/DEX/INT/STR/WIS [1.4, 1.1, -0.1, 0.6, 1.4, 1.8] for the whole 60-100 range, or [0.4, 0.3, -1.0, 0.3, 0.4, 0.8] for just 60-75.
Should also check that there's no obvious reason the model assumption of e.g. 4->5 is in some ways the same as 18->19, but meh, we're done here.
>! in reply to:
Graduate stats likely come from 2d10 drop anyone under 60 total
I think you're right. The character stats data seems consistent with starting with 10000 candidates, each with 6 stats independently chosen by 2d10, and tossing out everything with a total below 60.
One possible concern with this is the top score being the round number of 100, but I tested it and got only one score above 100 (it was 103), so this seems consistent with the 100 top score being coincidence.
I only just realized that 6 * 20 != 100.
I don't think this comment needs a spoilerbox.
Fixed your spoilers for you. You used the markdown syntax but you are not in the markdown editor, so instead you should just start with >! and then proceed as usual.
Not to nitpick, but does this mean classes like fighter, wizard, etc.. were merged in a generic "adventurer" class?
If not, I get the point of the post anyway, but it seems we are missing a pretty big part of the informations we need to choose.
Your interpretation is correct: there are no character classes in this world.
Seems like there's a lot of room for easy improvement by making teams then, do Great Quests have to be a solo effort? Are they actually important to accomplish or is one person failing not a big deal for anyone but him? If this is a sort of College, is the Great Quest a Final or also the Job career itself? What do they do afterwards? Can I just dump 5 and 5 on INT and WIS and set up a matchmaking business?
Anyways checked if there were people with identical stats were one succeeded and one failed, just in case whichever system translated stats to outcomes was fully deterministic, sadly, 2 pairs met those conditions. My first observations before I feed this data into a neural network:
the highest stats loser: [10, 6, 13, 10, 15, 10]
the lowest stats winner: [8, 13, 5, 8, 11, 16]
average winner specialization (standard deviation): 3.5340118495400863
average loser specialization (standard deviation): 3.6968854737691705
Stats that yield both victory and loss: [[8, 9, 12, 15, 12, 15], [6, 8, 10, 16, 10, 15]]
Do we have to spend all 10 points?
Do adventurers gain additional status points during their Great Quest, and if yes, are these stats measured at the beginning or at the end of their quest?
Is this data from some real Dungeons and Dragons game results?
No to both questions.
Is it a secret / part of the puzzle, where this data came from?
I generated the dataset. The rules I used to do so will be provided on Saturday, so everyone can see how close they got to the truth.
Threw all the data in a small neural network, and let it optimize (pretty mediocre: only resulted in an accuracy of 70%). I used this network to test quite a few different combinations of possible stats (base stats + spending all 10 points), resulting in [7, 16, 13, 13, 12, 11] as best and [6, 14, 18, 13, 16, 5] as worst chance of succeeding. A lot of things could still be optimized in this approach, but it seems like dexterity and wisdom should be left alone, and charisma and constitution could use a boost.
Here's the algorithms i tried, you can see the python code on GitHub:
(previously used rot13 cause i wasn't able to figure out how to do spoilers, it should really be a button in the editor pop up)
for each point i wish to assign i loop over the whole list.
for each row i check whether it's someone who succeeded or failed.
then i compare each stat to my own. in the case of someone who succeeded, if their stat is higher i add 1 to a counter for that stat, if their stat is lower i reduce 1. if they failed i do the opposite (add 1 if their stat is lower, reduce 1 if their stat is higher).
then i add one point to the stat which got the highest score, and loop again until i spent the last point.
this algorithm gave me the following stats at the end:
cha: 9, str: 11, con: 14, dex: 13, int: 13, wis: 13
This is algorithm would usually just increase your lowest stat until it's close to your other stats, which i'm not sure it's such a great strategy.
You can do spoilers with
>!although it's kind of finicky.
One problem with your strategy is that the stats follow a normal distribution. This means that there are more students with average stats. That's why your algo invests in the lowest stat: because that's the biggest delta in terms of students.
Thanks i updated my comment :)
I took a fairly black-box approach to this problem. Basically, we want a function f(str, dex, con, int, wis, cha) which outputs a chance of success, and then we want to optimize our selection so that we have the highest chance. The optimization part is easy because it's discrete; once we have a function, we can simply evaluate it at all of the possible inputs and select the best one.
I used a number of different ML models to estimate f, and I got pretty consistent brier scores on reserved test data of ~0.2, which isn't great, but isn't awful. I used scikit-learn, and used a MLPClassifier, LogisticRegression, GaussianNB, and RandomForestClassifier, along with CalibratedClassifierCV so that they had calibrated probability scores. Most of them I left on their defaults, but I played around with the layers in the MLPClassifier until it had a pretty good brier score.
Despite the fact that these models all had similar brier scores, they had surprisingly different recommendations. The Neural Net wanted to give small bumps to strength, wisdom, and charisma. Logistic Regression wanted to go all-in on wisdom, and putting any remaining points into charisma. Gaussian Naive Bayes wanted to put most of the points into charisma, but oddly, not all; it wanted to also sprinkle a few points into wisdom. The Random Forest Classifier wanted to bring strength and charisma up a little, but mostly sink points into wisdom, and occasionally scatter points into constitution or intelligence.
The top recommendation for each method is as follows:
Neural Net: 8, 14, 13, 13, 15, 9
Logistic Regression: 6, 14, 13, 13, 20, 6
Naive Bayes: 6, 14, 13, 13, 14, 12
Random Forest: 8, 14, 13, 13, 15, 9
I decided to test whether the negative correlation between DEX score and success rate is caused by the 60-point cutoff or if DEX really is counterproductive to success.
I bucketed the data by the sum of all trait scores except DEX and ran a linear regression for DEX score vs. binary success.
The high and low ends are obviously noisy due to small sample size, but the middle is pretty consistently neutral or slightly negative without any significant difference between low and high score totals.
I took at the average of the 50-65 range (to avoid the noise at the ends) and compared this to the same analysis for the other traits:
Made a quick neural network (reaching about 70% accuracy), and checked all available scores.
Its favorite result was: +2 Cha, +8 Wis. It would have like +10 Wis if it were possible.
For at least the top few results, it wanted to (a) apportion as much to Wis as possible, then (b) as much to Cha, then (c) as much to Con. So we have for (Wis, Cha, Con): 1. (8, 2, 0) 2. (8, 1, 1) 3. (8, 0, 2) 4. (7, 3, 0) 5. (7, 2, 1) ...
STR +3, WIS +3, INT +0, CHA +2-4, evenly distribute among the rest. Pretty unsophisticated, and misses out from larger gains from adding many points to a stat in one go.
My first thought is to look for the lowest stat in each category which succeeded. I will probably want at least this. Unfortunately this is 2 in every case, so this doesn't help.
My second thought is to look for a patch in stat space where there are a disproportionably large number of successes, however of the stats I can access none has a meaningful number of adventurers particularly close to them.
My third idea is, for every possible set of stats we could choose look at the adventurers whose stats were strictly worse than or equal to those, and see which ones enclosed the highest proportion of successes. There are several with a 100 percent success rate, but none with more than 2 data points, which isn't much. There are however 2 with 6 datapoints and an 83 percent success rate, which seems better established:
str: 8 con: 14 dex: 13 int: 20 wis: 12 cha: 5
str: 8 con: 14 dex: 13 int: 19 wis: 13 cha: 5
Both seem roughly evenly balanced, and either seems to be a reasonable choice. I would go with the first purely on the intuition that if you are going to have one really strong stat, better to go all the way.
Fun! I wish I had a lot more time to spend on this, but here's a brief and simple basis for a decision:
Gonna go with wiseAndCharismatic (+8 Wisdom, +2 Charisma).
If I wasn't trying to not-spend-time-on-this, I would fit a Random Forest or a Neural Network (rather than a logistic regression) to capture some non-linear signal, and, when it predicted well, fire up an optimizer to see how much in which stats really helps.
The first thing I did was plot each trait's score against the success rate for all students with that score in that trait. All the graphs looked fairly linear, if noisy, but that seems reasonable for this size of dataset. I added a best-fit line in excel and got these values:
Trait Slope R^2
CHA 2.20 0.861
CON 1.72 0.760
DEX -1.52 0.868
INT 1.18 0.641
STR 1.69 0.734
WIS 2.33 0.908
DEX appears to have a negative correlation with success rate while WIS and CHA are most important.
Since I'm only able to increase my scores, it might be interesting to only look at students in the range I could reach by adding at most 10 points to my current trait scores. However, there are only 7 students in this group. (This seemed surprisingly small at first, but looking at this same group for other students gives an average size of 4.15 and a median size of 2.5, and 22.4% of students have a reachable group size of at least 7, so I'm actually above average here though not a major outlier.)
It seems like the best plan is to put all ten points into some combination of WIS and CHA. If allowed, I would also remove most of my DEX even if that didn't give me more points to spend elsewhere. WIS appears to have a slightly larger effect than CHA, but there are fewer students with very high scores, so it's hard to tell if the linear relationships hold at the extremes. I'm thinking somewhere between (WIS +8, CHA +2) and (WIS +6, CHA +4) will be my best bet.
In case the trait contributions are not independent, I tried filtering for both low-mid CHA and mid-high WIS, but this still showed fairly clean linear relationships for both traits with WIS still slightly stronger than CHA. There is a high outlier at exactly WIS=15, so I looked at the range WIS(12-20) and CHA(6-14) and got this:
WIS CHA N Win%
12 14 56 79%
13 13 54 72%
14 12 65 72%
15 11 75 79%
16 10 43 81%
17 9 34 76%
18 8 15 80%
19 7 10 80%
20 6 1 100%
There's still a bump at WIS(15-16), but it looks like this is probably an artifact of small sample size.
I did one last run filtering for low-mid on CHA and STR and mid-high on the other traits (N=172):
Here DEX is slightly positive and CHA is slightly better than WIS, although with more than an order of magnitude smaller sample size.
My final decision is (CHA +4, WIS +6) for a resulting stat line of:
I like to read blog posts by people who do real statistics, but with a problem in front of me I'm very much making stuff up. It's fun, though!
The approach I settled on was to estimate the success chance of a possible stat line by taking a weighted success rate over the data, weighted by how similar the hero's stats are to the stats being evaluated. My rationale is that based on intuitions about the domain I would not assume linearity or independence of stats' effects or such, but I would assume that heroes with similar stats would have similar success chances.
estimatedchance(stats) = sum(weightfactor(hero.stats, stats) * hero.succeeded) / sum(weightfactor(hero, stats))
weightfactor(hero.stats, stats) = k ^ distance(hero.stats, stats)
(Assuming 0 < k < 1, and hero.succeeded is 1 if the hero succeeded and 0 otherwise)
I tried using both Euclidean and Manhattan distances, and various values for k as well. I also tried a hacky variant of Manhattan distance that added abs(sum(statsA) - sum(statsB)) to the result, but it didn't seem to change much.
Lastly, I tried the replacing (hero.succeeded) with (hero.succeeded - linearprediction(sum(hero.stats))) to try to isolate builds that do well relative to their stat total. linearprediction is a simple model I threw together by eyeballing the data: 40% chance to succeed with total stats of 60, 100% chance with total stats >= 95, linear in between. Could probably be improved with not too much effort, but I have to stop somewhere.
I generally found two clusters of optima, one around (8, 14, 13, 13, 8, 16)—that is, +4 CHA, +2 STR, +4 WIS—and the other around (4, 16, 13, 14, 9, 16)—that is, +2 CON, +1 INT, +3 STR, +4 WIS. The latter was generally favored by low k values, as the heroes with stats closest to that value generally did quite well but those a little farther away got less impressive. So it could be a successful strategy that doesn't allow too much deviation, or just a fluke. Using the linear prediction didn't seem to change things much.
If I had to pick one final answer, it's probably (8, 14, 13, 13, 8, 16) (though there seems to be a fairly wide region of variants that tend to do pretty well—the rule seems to be 'some CHA, some WIS, and maybe a little STR'), but I find myself drawn towards the maybe-illusory (4, 16, 13, 14, 9, 16) niche solution.
ETA: Looks like I was iterating over an incomplete list of possible builds... but it turned out not to matter much.
ETA again (couldn't leave this alone): I tried computing log-likelihood scores for my predictors (restricting the 'training' set to the first half of the data and using only the second half for validation. I do find that with the right parameters some of my predictors do better than simple linear regression on sum of stats, and also better the apparently-better predictor of simple linear regression on sum of non-dex stats. But they don't beat it by much. And it seems the better parameter values are the higher k values, meaning the (8, 14, 13, 13, 8, 16) cluster is probably the one to bet on.
CHA+4, STR+2, WIS+4
Just noting my answer without commentary: STR +2, CON +1, WIS +3, CHA +4
str +2 points to 8, con +1 point to 15, cha +4 points to 8, wis +3 points to 15, based on assuming that a) different stats have multiplicative effect (no other stat interactions) and b) that the effect of any stat is accurately represented by looking at the overall data in terms of just that stat and that c) the true distribution is exactly the data distribution with no random variation. I have not done anything to verify that these assumptions make sense.
dex looks like it actually has a harmful effect. I don't know whether the apparent effect is or is not too large to be explained by it helping bad candidates meet the college's apparent 60-point cutoff.
Was playing around with neural nets the last couple days, and when I came across this problem it immediately looked very nail-shaped to me. Probably isn't the most efficient tool for the job, but here's my approach: https://gist.github.com/Deccludor/d42a91712b427a45ff61aacfc02d0abe.
I trained a neural net to predict success based on ability scores, then ran a few different search algorithms to find the best possible use of 10 points. The ~60m permutations were slightly too many for me to search exhaustively, so I tried predicting on a random sample, using a greedy algorithm to add one point at a time, and another one that added 7 points at a time (maxing out the number of permutations I could fit into memory at a time).
The best-scoring distribution of stats I could find was CHA: 5, CON: 20, DEX: 13, INT: 13, STR: 6, WIS: 15. According to the calibration curve, that should have a roughly ~75% chance of success.