*This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.*

You are the Allocation Helm, a piece of magical headwear employed at Swineboils College of Spellcraft and Sorcery. Your purpose is to read the minds of incoming students, and use the information you glean to Allocate them between the school’s four Houses: Dragonslayer, Thought-Talon, Serpentyne and Humblescrumble.

You’ve . . . not been doing a terribly good job lately. You *were *impressively competent at assigning students when newly enchanted, but over the centuries your skill and judgement have steadily unraveled, to the point where your Allocations over the most recent decade have been completely random.

Houses have begun to lose their character, Ofspev^{[1]} ratings have plummeted, and applications have declined precipitously. There is serious talk of Swineboils being shut down. Under these circumstances, the Headmistress has been moved to desperate action, and performed a Forbidden Ritual to temporarily restore your former brilliance.

This boost will only last you for one Allocation, so you intend to make it count. Using the records of past years’ readings and ratings, you hope to raise this class’ average score to match or exceed the glory of yore. (And if you do well enough, you might even be able to convince the Headmistress to keep performing rituals . . .)

There are twenty incoming students this year. You may place them however you wish. Who goes where?

I’ll post an interactive you can use to test your choices, along with an explanation of how I generated the dataset, sometime on Monday the 26th. I’m giving you nine days, but the task shouldn’t take more than an evening or two; use Excel, R, Python, Haruspicy, or whatever other tools you think are appropriate. Let me know in the comments if you have any questions about the scenario.

If you want to investigate collaboratively and/or call your decisions in advance, feel free to do so in the comments; however, please use spoiler tags or rot13 when sharing inferences/strategies/decisions, so people intending to fly solo can look for clarifications without being spoiled.

^{^}The Oracle for Spellcaster Evaluations, who predicts a quantitative measure of the lifetime impact each student will have on the world shortly after they’re Allocated. (No-one knows how to make him predict anything else, or predict at any other time, or stop predicting, or be affected by the passage of time.)

First-order attempt:

I used scikit-learn to build several random-forest regressors mapping attributes + house to Ofspev rating, and verified that early on the Helm tended to allocate students to the house for which the regressor predicted the best rating, and that at the end it didn't. Then I Sorted ... excuse me, Allocated ... the students to the houses for which the regressors predicted the best rating. In cases where they disagreed I tried to eyeball the distributions and use my judgement :-).

Resulting allocation:

Serpentyne gets C, F, K. Humblescrumble gets E, I, L, M, P, Q, R, T. Dragonslayer gets D, G, H, N. Thought-Talon gets A, B, J, O, S.

Most of these results are pretty clear-cut in that every prediction for the winning house was better than any prediction for any other house. Notable exceptions were E (who might do well in Thought-Talon or maybe in Dragonslayer; certainly not Serpentyne, though) and Q (for whom all houses gave rather similar predictions).

With these allocations I cautiously predict the following Ofspev ratings: A 36..39, B 15..18, C 28..30, D 17..19, E 17..19, F 34..40, G 21..23, H 15..19, I 12..14, J 27..30, K 23..26, L 22..26, M 24..29, O 18..23, P 22..25, Q 30..32, R 40..44, S 29..32, T 28..34. These intervals are probably too narrow; they are determined by the range of variation among the 16 regressors I used, but the overall prediction errors for these regressors are wider than those ranges.

Possible reasons why this might suck:

I have made a cursory attempt to understand what my black boxes are doing

by feeding in all the 2^5 attribute-vectors where each one is either 10 (low) or 40 (high) and seeing what the predictions for each house look like. Crudely, it seems as if: students do well in Serpentyne when they have high Intellect and either Reflexes or Patience; in Humblescrumble when they have high Intellect and Integrity, with Patience serving as a partial stand-in for either; in Dragonslayer when they have high Courage and Reflexes; in Thought-Talon when they have high Intellect and Patience. Students high in all five attributes do exceptionally well in Dragonslayer and Thought-Talon; for Thought-Talon but not for Dragonslayer it's almost as good to be high in everything except Reflexes.

I think I could quantify those observations (and maybe a few second-order effects I didn't mention explicitly) and get an explicit model that would serve the Helm pretty well in practice, though I doubt it would outperform the brute-force random forests.

Further noodling around with ad hoc models suggests that

in at least some cases, some of the students' attributes are best thought of as having limits such that increases above the limits make no difference. Specifically, I played around a little with Serpentyne and it seems that we probably want to look at min(40,Intellect) and min(65,Reflexes) rather than using those values unaltered. The limits might well be different for different houses (analogy: intelligence is probably an advantage both for theoretical physicists and for taxi drivers, but most likely being 1-in-a-million smart rather than "just" 1-in-a-thousand is more beneficial for the theoretical physicists); so far this is just the result of idly looking at one particular house.

Continuing with the principle "when in doubt, use brute force",

I did the same thing with gradient-boosted trees; these had somewhat more prediction error on each validation set (oh, I forgot to mention that each regressor was trained on 90% of the data and evaluated on the remaining 10%). And with SVMs using radial basis functions; these were comparable in accuracy to the random forests. (Note: There's much less diversity in my ensemble of SVMs, because the only difference between them is the training/validation split, whereas for RFs and GBTs there is randomness in the fitting process itself.)

Did this make a difference to my predictions or suggestions?

Not much; usually all three agreed; where they didn't, usually the SVM agreed with the RF. However, the SVM regressors fairly confidently want to put K in Dragonslayer, and the GBT ones less confidently agree. On the other hand, they predict less loss from putting K in the RFs' suggestion of Serpentyne than the RFs do from putting K in Dragonslayer, so it's a tough call. I'll switch to putting K in Dragonslayer. And for Q, the RFs and GBTs are fairly indifferent between all houses and slightly prefer Humblescrumble (for the RFs) and Thought-Talon (for the GBTs), but the SVMs think that Dragonslayer and Thought-Talon are much better than the other two, and give the nod to Dragonslayer. Looking at all their numbers, I'll move Q into Dragonslayer.

So my revised allocations are:

Serpentyne gets C, F. Humblescrumble gets E, I, L, M, P, R, T. Dragonslayer gets D, G, H, K, N, Q. Thought-Talon gets A, B, J, O, S.

And my revised predicted ratings with these allocations are:

A 36..39, B 15..18, C 28..30, D 17..20, E 17..19, F 33..40, G 21..24, H 15..20, I 12..15, J 24..30, K 23..26, L 21..26, M 24..29, N 19..21, O 18..23, P 22..25, Q 27..34, R 40..44, S 29..32, T 28..34.

It occurs to me that there is a possible source of bias in the approach I am taking:

perhaps not everyone gets into Swineboils, so that e.g. if we looked for correlations between the attributes we would get spurious negative correlations because people who are bad at everything don't get in. If such effects are strong, then our model will have bias if we apply it to the population at large. That's OK because we

aren'tapplying it to the population at large, we're applying it to the students who got in ... but if e.g. Swineboils is less selective than it used to be because there are way fewer applications, then the bias will distort our predictions for this year's students.(This is the phenomenon sometimes called "Berkson's paradox".)

I don't expect this bias to be very large, but I have made no attempt to check that expectation against reality. Still less have I made any attempt to correct for it, and probably I won't.

I trained a boosting model on the whole dataset (minus the year column) that predicts the Ospef score. The allocation of a student is then basically just iterating through the four houses and pick the one with the maximum score.

As a sanity check of my model I sliced the dataset into a few parts to confirm that we (the Allocation Helm) got worse over time. This wasn't very rigorous and spending more time would have definitly helped to work out how to mathematically define our degradation. But my testing generally confirmed the downwards trend.

In the end these are my allocations:

Student House

A Thought-Talon

B Humblescrumble

C Serpentyne

D Dragonslayer

E Humblescrumble

F Serpentyne

G Dragonslayer

H Dragonslayer

I Humblescrumble

J Thought-Talon

K Dragonslayer

L Humblescrumble

M Humblescrumble

N Dragonslayer

O Thought-Talon

P Humblescrumble

Q Thought-Talon

R Humblescrumble

S Thought-Talon

T Humblescrumble

Current model of how your mistakes work:

Your mistakes have always taken the form of giving random answers to a random set of students. You did not e.g. get worse at solving difficult problems earlier, and then gradually lose the ability to solve easy problems as well.

The probability of you giving a random answer began at 10% in 1511. (You did not allocate perfectly even then). Starting in 1700, it began to increase linearly, until it reached 100% in 2000.

This logic is based on: student 37 strongly suggesting that you can make classification mistakes early, and even in obvious cases; and looking at '% of INT<10 students in Thought-Talon' and '% of COU<10 students in Dragonslayer' as relatively unambiguous mistakes we can track the frequency of.

Tho presumably it could be the case that even if a student will be a poor fit for Thought-Talon, they would be an

even poorerfit everywhere else?Students may reach their potential in many ways, as long as they are not actively prevented.

Through sophisticate techniques (eyeballing), my own hat has recommended:

Dragonslayer [G, K, N]

Humblescrumble [A, E, R, T]

Humblescrumble? [L, M]

Serpentyne [C, F, H, O, S]

Serpentyne :( [B, D]

Serpentyne/Humblescrumble [Q]

Serpentyne? [P]

Thought-Talon [J]

Thought-Talon :( :( [I]

Otherwise known as:

~~Dragonslayer: [G, K, N]~~~~Thought-Talon: [I, J]~~~~Serpentyne: [B, C, D, F, H, O, P, Q, S]~~~~Humblescrumble: [A, E, L, M, R, T]~~(Completely revised in followup comment)

The Ofstev rating of someone sorted into Thought-Talon can be modeled as follows:

lower = 1/2 x min(Intellect, Patience)

upper = 3/2 x min(Intellect, Patience)

~triangular distribution with min=lower, max=upper, mode=30% of the way from lower to upper

Each other house can be modeled similarly. ...not that I fully

succeededat doing so. Just a guess. But sketching:Serpentyne is between 3/4 x [min(Intellect, Reflexes, Patience) - 10] and max(Reflexes, Patience) - 5

Humblescrumble is between, uh,

max(max(8, 3/4 x (min(Integrity, Intellect) - 15)), 1/4 x (max(Integrity, Patience) + 5))

and

min(max(30, 3/4 x (min(Integrity, Intellect) + 15)), 3/4 x (max(Integrity, Patience) + 5))

which is definitely 100% accurate

Dragonslayer is between max(5/6 x min(everything)-4, 3/2 x min(everything)-20) and 3/2 x min(Courage, max(everything else))) and also this one doesn't yield something that looks triangular so yeah probably not that.

In any case, trying to maximize EV assuming those are right yields my new submission:

Dragonslayer [A, E, H]

Humblescrumble [D, I, R]

Serpentyne [G, L, M, N, P]

Thought-Talon [B, C, F, J, K, O, Q, S, T]

Edited to put my final answer at the top for ease of reading:

Thought-Talon: A, J, O, S

Serpentyne: C, F, P

Dragonslayer: D, G, H, K, N, Q

Humblescrumble: B, E, I, L, M, R, T

A starting approach and some basic analysis:

I'm going to approach this by trying to minimize the amount of unpredicted variance in the data.

Our initial prediction, without using the houses or stats at all, is to predict all students at the average rating of 25.9436. The residual has std 9.8649. Over the course of improving our model, we'll try to reduce this.

Additionally, we can calculate a correlation of this residual with year, getting -0.1711. This reflects that, while we don't yet know why, the earlier years performed better and the later ones worse (since in earlier years we were assigning them better). As we get better at predicting ourselves, this correlation should shrink - if it hits zero, that will suggest that we've figured out everything that we used to know at our height.

First, most basic, improvement: we run a regression model to predict rating based on the five stats (ignoring house for now). This predicts score of:

-1.2239 + (0.2519 * Intellect) + (0.1314 * Integrity) + (0.1441 * Courage) + (0.1307 * Reflexes) + (0.1765 * Patience).

We expect this to reduce std of residual, but to somewhat increase the correlation of residual with year: and unexplained std drops to 7.6427 while negative correlation with year increases to -0.2211.

Next, also pretty basic, improvement: we run that regression model separately for each house.

Serpentyne regression: -1.2733 + (0.3256 * Intellect) + (-0.0120 * Integrity) + (-0.0001 * Courage) + (0.2324 * Reflexes) + (0.2284 * Patience)

Humblescrumble regression: 4.4478 + (0.1882 * Intellect) + (0.2691 * Integrity) + (-0.0029 * Courage) + (0.0013 * Reflexes) + (0.1111 * Patience)

Dragonslayer regression: -6.2998 + (0.1341 * Intellect) + (0.1193 * Integrity) + (0.3397 * Courage) + (0.2300 * Reflexes) + (0.1056 * Patience)

Thought-Talon regression: -7.3507 + (0.3684 * Intellect) + (0.1400 * Integrity) + (0.1155 * Courage) + (-0.0555 * Reflexes) + (0.3643 * Patience)

So Serpentyne and Thought-Talon are using Intellect more, Dragonslayer is using Courage more, Humblescrumble is giving a higher base number regardless of stats (accepting everyone Hufflepuff-style?) while caring more about Integrity. Also, a few terms are very close to 0, suggesting that e.g. Serpentyne and Humblescrumble do not care at all about Courage. This leaves us with a residual std of 6.5525, and still has a negative correlation with year of -0.1259, suggesting that there are still many more things to be found.

Still, in case I get too busy to continue, the preliminary regression gives the following house allocations:

Serpentyne

['C', 'F', 'I', 'K', 'L', 'O', 'P', 'S']

Humblescrumble

['B', 'E', 'Q', 'R', 'T']

Dragonslayer

['D', 'G', 'H', 'N']

Thought-Talon

['A', 'J', 'M']

My next step of investigation is going to be stat interaction effects. Does Integrity affect people with low Intellect more (since they might have more temptation to cheat?) Do Reflexes matter more for people with high Courage (who would be more likely to put themselves in dangerous situations where Courage is needed?)

Had trouble making further progress using that method, realized I was being silly about this and there was a much easier starting solution:

Rather than trying to figure out anything whatsoever about scores, we're trying for now just to mimic what we did in the past.

Define a metric of 'distance' between two people equal to the sum of the absolute values of the differences between their stats.

To evaluate a person:

*these numbers may be varied to optimize. For example, moving the year threshold earlier makes you more certain that the students you find were correctly sorted...at the expense of making them be selected from a smaller population and so be further away from the person you're evaluating. I may twiddle these number in future and see if I can do better.

We can test this algorithm by trying it on the students from 1511 (and using students from 1512-1699 to find close matches). When we do this:

very dramaticallydifferent. For example, student 37 had Intellect 7 and Integrity 61. All students with stats even vaguely near that were sorted into Humblescrumble, which makes sense given that house's focus on Integrity. However, Student 37 was sorted into Thought-Talon, which seemsvery oddgiven their extremely low Intellect.Sadly this method provides no insight whatsoever into the underlying world. We're copying what we did in the past, but we're not actually learning anything. I still think it's better than any explicit model I've build so far.

This gives the following current allocations for our students (still subject to future meddling):

Thought-Talon: A, J, O, S

Serpentyne: C, F*

Dragonslayer: D, H, G*, K*, N*, Q*

Humblescrumble: B*, E*, I, L, M*, P*, R, T

where entries marked with a * are those where the nearby students were a somewhat close split, while those without are those where the nearby students were clearly almost all in the same house.

And some questions for the GM based on something I ran into doing this (if you think these are questions you're not comfortable answering that's fine, but if they were meant to be clear one way or the other from the prompt please let me know):

The problem statement says we were 'impressively competent' at assigning students when first enchanted.

~~Fated~~? Protagonist-hood?) that we can no longer perceive, and sorted students based on that?Robustness analysis: seeing how the above changes when we tweak various aspects of the algorithm.

I'm not certain whether this will end up changing my views, but K in particular looks very close between Dragonslayer and Serpentyne, and P plausibly better in Serpentyne.

According to my models

B indeed belongs in Th rather than Hu, but it's close and not very clear. I belongs in Hu rather than Se according to all my models, but it's close. My models disagree with one another about K, some preferring Dr narrowly and fewer preferring Se less narrowly. Most of my models put P in Hu not Se, and the ones that put it in Se are ones with larger errors. My models disagree with one another about F, preferring Se or Th and not expecting much difference between those.

(aphyer, I don't know whether you would prefer me not to say such things in case you are tempted to read them. I will desist if you prefer. The approaches we're taking are sufficiently different that I don't think there is much actual harm in reading about one another's results.)

No objection to you commenting. The main risk on my end is that my fundamental contrariness will lead me to disagree with you wherever possible, so if you do end up being right about everything you can lure me into being wrong just to disagree with you.

P is a very odd statblock, with huge Patience and incredibly low Courage and Integrity. (P-eter Pettigrew?) I might trust your models more than my approach on students like B, who have middle-of-the-road stats but happen to be sitting near a house boundary. I'm less sure how much I trust your models on extreme cases like P, and think there might be more benefit there to an approach that just looks at a dozen or so students with similar statblocks rather than trying to extrapolate a model out to those far values.

Based on poking at the score figures, I think I'm currently going to move student P from Humblescrumble to Serpentyne but not touch the other ambiguous ones:

Thought-Talon: A, J, O, S

Serpentyne: C, F, P

Dragonslayer: D, G, H, K, N, Q

Humblescrumble: B, E, I, L, M, R, T

You haven't sorted student G.

I remark that (note: not much spoilage here, but a little)

your allocations are very similar to mine, even though my approach was quite different; maybe this kinda-validates what both of us are doing. Ignoring the missing student G, I think we disagree only about B, and neither of us was very sure about B.

Good catch, fixed.

With that fix

student B is indeed the only one we (both unconfidently) disagree on.

Seems like the "year" column is missing(?) from the records

Good catch; fixed now; thank you.

A solution by method of "Thrash with linear regression, then get bored". I also make the (completely unsubstantiated) claim that an even split of students across houses will lead to better results.

Humblescrumble gets A,B,E,R and T.

Dragonslayer gets D,G,H,K and N.

Thought-Talon gets C,F,L*,M and Q*.

Serpentyne gets I*,J*,O,P and S*.

(Students marked * get a slightly better linear score in another House, but I balance the sizes)

My entry just before the deadline:

Dragonslayer: D,G,H,N,Q

Humblescrumble: E,I,L,M,R,T

Serpentyne: C,F,K,P

Thought-Talon: A,B,J,O,S

Compared with gjm, I disagree (unconfidently) on K and P

Compared with aphyer, I disagree (unconfidently) on B and K

Compared with Thomas Sepulchre, I disagree (unconfidently) on P only, agreeing with everything else.

(note that on my reading of aphyer and gjm's entries, they disagree on B and P, despite them saying they only disagree on B)

I used ad-hoc local methods which ultimately does not provide much insight, unfortunately.

I disagree only about B with

one versionof aphyer's allocations. It is possible that that was out of date at the point when I said "we disagree only about B" but I'm not sure. Anyway, yes, now we do disagree with one another about P as well.Out of curiosity, can you, if you don't mind, describe what methods you used?

methods:

I took the 100 nearest (Euclidean distance in stat-space) students from each house and did linear regression to predict the value for the student in question, then arbitrarily changed my answers based on e.g. residuals of the nearest points or too low density near the point in question, and then did the same for the 20 nearest for certain of the incoming students (which I had noted to be questionable in some way or another, or which disagreed with aphyer or gjm), and in the end I may have decided some of the more ambiguous stuff based on too low local density of some houses, which may explain why my results are so similar to yours (I did not check your results until after arriving at mine).

edit: actually this did provide some insight, in terms of seeing how the regression coefficients change locally (e.g. often the lowest house-relevant stat is most relevant), and I did try a bit to come up with global formulas (like GuySrinivasan's) but I didn't get far with that.

Some traits definitely go better with some houses, however I couldn't see much in the way of clear cut rules. I constructed the following highly provisional allocation by considering students that were sorted when the helm was still reasonably reliable, and then combining the probabilities of a student with each of the 5 ratings being sorted into each house, and selecting the one which on balance seemed most likely.

A Dragonslayer

B Thought-Talon

C Serpentyne

D Dragonslayer

E Humblescrumble

F Serpentyne

G Dragonslayer

H Humblescrumble

I Humblescrumble

J Thought-Talon

K Dragonslayer

L Dragonslayer

M Dragonslayer

N Dragonslayer

O Serpentyne

P Thought-Talon

Q Dragonslayer

R Humblescrumble

S Serpentyne

T Humblescrumble

A few observations

Looking at the moving average of the Ofspev rating, it seems the helm slowly stopped providing a good allocation starting around 10,000 students. This opens the opportunity for a blackbox approach, where one could simply train a model to replicate the performance of the initial helm, without any gear-level understanding. This might prove useful if the gear-level understanding is really complicated, but this might also limit our result, especially if the original helm was good, but far from perfect.

Looking at the number of students each year, it indeed decreases in the last few decades, which must be a consequence of the fact that

applications have declined precipitously.The individual skills of students don't seem to decrease over time, so, despite the uninterrupted whining of archmages in the newspapers, the lower Ofspev rating is not explained by the alleged "laziness" of this "spoiled generation".

So, I did precisely that. I trained a classifier on the first 7500 students to mimic the behavior of the original helm.

My predictions:

Serpentyne: C,F,K

Dragonslayer: D,G,H,N,Q

Humblescrumble: E,I,L,M,P,R,T

Thought-Talon: A,B,J,O,S

I haven't looked at the data but some quick meta thoughts:

The effect of the houses on Ofspev can be learned from it. It is an unintended RCT.

The new student data already have house entries. Is that a mistake?

It was, though fortunately that was just the random Houses they would have been Allocated to, and as such provides no meaningful information. Still, I've updated the file to not have that column; thank you.

Is the goal (1) to allocate these new students to the Houses they would have been put in by the Helm at the peak of its abilities, or (2) to allocate these new students in whatever manner maximizes their Ofspev[1] scores? Or are we to understand that these are more or less the same thing?

[1] Obviously this

reallystands for the Office for Standards in Potter-Evans-Verres.(2)