Stats Advice on a New N-back Game

Antisuji

Cross-posted to my blog. I expect this will be of some interest to the LessWrong community both because of previous interest in N-back and because of the opportunity to apply Bayesian statistics to a real-world problem. The main reason I'm writing this article is to get feedback on my approach and to ask for help in the areas where I'm stuck. For some background, I'm a software developer who's been working in games for 7+ years and recently left my corporate job to work on this project full-time.

As I mentioned here and here, since early February I've been working on an N-back-like mobile game. I plan to release for iOS this summer and for Android a few months later if all goes well. I have fully implemented the core gameplay and most of the visual styling and UI, and am currently working with a composer on the sound and music.

I am just now starting on the final component of the game: an adaptive mode that assesses the player's skill and presents challenges that are tuned to induce a state of flow.

The Problem

The game is broken down into waves, each of which presents an N-back-like task with certain parameters, such as the number of attributes, the number of variants in each attribute, the tempo, and so on. I would like to find a way to collapse these parameters into a single difficulty parameter that I can compare against a player's skill level to predict their performance on a given wave.

But I realize that some players will be better at some challenges than others (e.g. memory, matching multiple attributes, handling fast tempos, dealing with visual distractions like rotation, or recognizing letters). Skill and difficulty are multidimensional quantities, and this makes performance hard to predict. The question is, is there a single-parameter approximation that delivers an adequate experience? Additionally, the task is not pure N-back — I've made it more game-like — and as a result the relationship between the game parameters and the overall difficulty is not as straightforward as it would be in a cleaner environment (e.g. difficulty might be nearly linear in tempo for some set-ups but highly non-linear for others).

I have the luxury of having access to fairly rich behavioral data. The game is partly a rhythm game, so not only do I know whether a match has been made correctly (or a non-match correctly skipped) but I also know the timing of a player's positive responses. A player with higher skill should have smaller timing errors, so a well-timed match is evidence for higher skill. I am still unsure exactly how I can use this information optimally.

I plan to display a plot of player skill over time, but this opens another set of questions. What exactly am I plotting? How do I model player skill over time (just a time-weighted average? as a series of slopes and plateaus? how should I expect skill to change over a period of time without any play?)? How much variation in performance is due to fatigue, attention, caffeine, etc.? Do I show error bars or box plots? What units do I use?

And finally, how do I turn a difficulty and a skill level into a prediction of performance? What is the model of the player playing the game?

Main Questions

Is there an adequate difficulty parameter and if so how do I calculate it?
Can I use timing data to improve predictions? How?
What model do I use for player skill changing over time?
How do I communicate performance stats to the user? Box and whiskers? Units?
What is the model of the player and how do I turn that into a prediction?

My Approach

I've read Sivia, so I have some theoretical background on how to solve this kind of problem, but only limited real-world experience. These are my thoughts so far.

Modeling gameplay performance as Bernoulli trials seems ok. That is, given a skill level S and a difficulty D, performance on a set of N matches should be closely matched by N Bernoulli trials with probability of success p(S, D) as follows:

if S ≪ D, p = 0.5
if S ≫ D, p is close to 1.0 (how close?)
if S = D, p = 0.9 feels about right
etc.

Then I can update S (and maybe D? see next paragraph) on actual player performance. This will result in a new probability density function over the "true" value of S, which will hopefully be unimodal and narrow enough to report as a single best estimate (possibly with error bars). Which reminds me, what do I use as a prior for S? And what happens if the player just stops playing halfway through, or hands the game to their 5-year-old?

Determining difficulty is another hard problem. I currently have a complicated ad-hoc formula that I cobbled together with logarithms, exponentials, and magic numbers, and lots of trial and error. It seems to work pretty well for the limited set of levels I've tested with a small group of playtesters, but I'm worried that it won't predict difficulty well outside of that domain. One possibility is to croud-source it: after release I'd collect performance data across all users and update the difficulty ratings on the fly. This seems risky and difficult, and the initial difficulty ratings might be way off, which would lead to poor initial user experiences with the adaptive mode. I would also have to worry about maintaining a server back-end to gather the data and report on updated difficulty levels.

Request For Feedback

So, any suggestions on how to tackle these problems? Or the first place to start looking?

I'm pretty excited about the potential to collect real-world data on skill acquisition over time. If there is sufficient interest I'll consider making the raw data public, and even instrument the code to collect other data of interest, by request. I do have some concerns over data privacy, so I may allow users to opt out of sending their data up to the server.

I think you're overthinking it. In all the n-back research I've read, hardly anyone seems to've cared about how exactly the adaptiveness was implemented...

The game is broken down into waves, each of which presents an N-back-like task with certain parameters, such as the number of attributes, the number of variants in each attribute, the tempo, and so on. I would like to find a way to collapse these parameters into a single difficulty parameter that I can compare against a player's skill level to predict their performance on a given wave.

Play a bunch of level with random settings for each parameter, then regress the scores on all the parameters. There, now you have a predictor.

But I realize that some players will be better at some challenges than others (e.g. memory, matching multiple attributes, handling fast tempos, dealing with visual distractions like rotation, or recognizing letters). Skill and difficulty are multidimensional quantities, and this makes performance hard to predict. The question is, is there a single-parameter approximation that delivers an adequate experience?

OK fine, start with a pre-specified linear model or whatever, and then update it periodically with the user's data. As the user's data builds up, the regression picks up their particular strengths/weaknesses.

A player with higher skill should have smaller timing errors, so a well-timed match is evidence for higher skill. I am still unsure exactly how I can use this information optimally.

Take the absolute value of the 'true' time minus when the user actually responded. There, you have an 'timing error' value which can be fed into the regression along with the others.

How do I model player skill over time (just a time-weighted average? as a series of slopes and plateaus?

My impression from the DNB score series I've seen and discussions is that n-back exhibits the usual logarithmic-looking graph: fast initial improvements then a plateau. So, a log model.

how should I expect skill to change over a period of time without any play?)?

Very slowly degrade. Some of the n-back studies have done 6-month followups and the like, and the degradation doesn't seem very large.

I do have some concerns over data privacy, so I may allow users to opt out of sending their data up to the server.

You should definitely provide an opt-out.

Determining difficulty is another hard problem. I currently have a complicated ad-hoc formula that I cobbled together with logarithms, exponentials, and magic numbers, and lots of trial and error. It seems to work pretty well for the limited set of levels I've tested with a small group of playtesters, but I'm worried that it won't predict difficulty well outside of that domain.

Why didn't any simpler approaches work?

Thanks, gwern, this is extremely helpful.

I think you're overthinking it.

This is quite likely. :)

In all the n-back research I've read, hardly anyone seems to've cared about how exactly the adaptiveness was implemented...

The adaptiveness (in Brain Workshop, which is the only implementation I've spent much time with) feels pretty horrible. At least compared to a typical modern well-balanced game.

Play a bunch of level with random settings for each parameter, then regress the scores on all the parameters. There, now you have a predictor.

You're talking about least squares, or some modification of least squares that deals with outliers, right? Doesn't this assume linearity? Or I suppose I'd have to model, say, the effect of increasing the N-back as aN^b (or some other suitable formula) and plug in values for a and b as parameters? But that means I have to take some time choosing a good model.

Also, it's important for users to have a good first experience, so I'd like to require as little personalized training data as possible (I will have some data available, since the adaptive mode won't be unlocked until clearing some number of earlier baked-in stages). Having some notion of a standard difficulty-with-respect-to-an-idealized-player should help with this.

To be fair, I haven't actually tried running any kind of regression like this yet, and this approach pretty clearly seems worth trying.

So, a log model.

Log in the number of plays, I assume. This sounds pretty reasonable, but I expect won't take into account some of the more interesting features of each player's skill trajectory, especially early on. Of course this could be based on my misinterpretations of personal experiences with learning, so I could easily be wrong. It would certainly make things simpler!

Why didn't any simpler approaches work?

If you mean simple like least squares regression, mostly because I didn't try it. If you mean a simple formula, I wouldn't be surprised if none exists (the formula I'm using is actually not quite as complicated as I made it sound). Anyway, my understanding is that even if I use regression on individual user data I'd need to use a pretty complex model with lots of parameters to make it work. Is this not true?

Also, is the Bernoulli trials stuff reasonable, or am I making things too complicated with that too?

The adaptiveness (in Brain Workshop, which is the only implementation I've spent much time with) feels pretty horrible. At least compared to a typical modern well-balanced game.

N-back simply isn't a fun game in the first place, so I don't know how much the adaptiveness is to blame.

You're talking about least squares, or some modification of least squares that deals with outliers, right? Doesn't this assume linearity? Or I suppose I'd have to model, say, the effect of increasing the N-back as aN^b (or some other suitable formula) and plug in values for a and b as parameters? But that means I have to take some time choosing a good model.

Yes, least squares requires a lot of assumptions to be provably optimal. On the other hand, it works all the time. Stupid simple approaches do that pretty frequently.

Anyway, my understanding is that even if I use regression on individual user data I'd need to use a pretty complex model with lots of parameters to make it work. Is this not true?

I don't see why it would be, necessarily. What sort of complex model did you have in mind?

Also, is the Bernoulli trials stuff reasonable, or am I making things too complicated with that too?

I'm not sure what the binomial stuff is gaining you over a simple %-correct number. If the user gets 2 out of 10 matches right, then the max-likelihood estimate of the underlying probability under a binomial model is going to be... 0.2. You bring in the binomial/Bernoulli stuff when you want to do something more complex.

Play a bunch of level with random settings for each parameter, then regress the scores on all the parameters. There, now you have a predictor.

Make sure to do this with an experienced player, to get on that plateau.

You might need a different set of parameters for newbies. Of course, measuring for them is complicated by the rapid change in their abilities.

tuned to induce a state of flow.

Are you sure you don't mean deliberate practice?

Unlikely; deliberate practice can be quite unpleasant.

Oh right, this is a game. I was thinking he was optimizing for learning speed, not enjoyability.

Well, if there's a line between fun enough to play and unpleasant enough not to, learning speed is almost certainly higher for playing the game relative to nothing.

I think you're overthinking it. In all the n-back research I've read, hardly anyone seems to've cared about how exactly the adaptiveness was implemented...

The game is broken down into waves, each of which presents an N-back-like task with certain parameters, such as the number of attributes, the number of variants in each attribute, the tempo, and so on. I would like to find a way to collapse these parameters into a single difficulty parameter that I can compare against a player's skill level to predict their performance on a given wave.

Play a bunch of level with random settings for each parameter, then regress the scores on all the parameters. There, now you have a predictor.

But I realize that some players will be better at some challenges than others (e.g. memory, matching multiple attributes, handling fast tempos, dealing with visual distractions like rotation, or recognizing letters). Skill and difficulty are multidimensional quantities, and this makes performance hard to predict. The question is, is there a single-parameter approximation that delivers an adequate experience?

A player with higher skill should have smaller timing errors, so a well-timed match is evidence for higher skill. I am still unsure exactly how I can use this information optimally.

Take the absolute value of the 'true' time minus when the user actually responded. There, you have an 'timing error' value which can be fed into the regression along with the others.

How do I model player skill over time (just a time-weighted average? as a series of slopes and plateaus?

My impression from the DNB score series I've seen and discussions is that n-back exhibits the usual logarithmic-looking graph: fast initial improvements then a plateau. So, a log model.

how should I expect skill to change over a period of time without any play?)?

Very slowly degrade. Some of the n-back studies have done 6-month followups and the like, and the degradation doesn't seem very large.

I do have some concerns over data privacy, so I may allow users to opt out of sending their data up to the server.

You should definitely provide an opt-out.

Determining difficulty is another hard problem. I currently have a complicated ad-hoc formula that I cobbled together with logarithms, exponentials, and magic numbers, and lots of trial and error. It seems to work pretty well for the limited set of levels I've tested with a small group of playtesters, but I'm worried that it won't predict difficulty well outside of that domain.

Why didn't any simpler approaches work?

Thanks, gwern, this is extremely helpful.

I think you're overthinking it.

This is quite likely. :)

In all the n-back research I've read, hardly anyone seems to've cared about how exactly the adaptiveness was implemented...

The adaptiveness (in Brain Workshop, which is the only implementation I've spent much time with) feels pretty horrible. At least compared to a typical modern well-balanced game.

Play a bunch of level with random settings for each parameter, then regress the scores on all the parameters. There, now you have a predictor.

To be fair, I haven't actually tried running any kind of regression like this yet, and this approach pretty clearly seems worth trying.

So, a log model.

Why didn't any simpler approaches work?

Also, is the Bernoulli trials stuff reasonable, or am I making things too complicated with that too?

The adaptiveness (in Brain Workshop, which is the only implementation I've spent much time with) feels pretty horrible. At least compared to a typical modern well-balanced game.

N-back simply isn't a fun game in the first place, so I don't know how much the adaptiveness is to blame.

You're talking about least squares, or some modification of least squares that deals with outliers, right? Doesn't this assume linearity? Or I suppose I'd have to model, say, the effect of increasing the N-back as aN^b (or some other suitable formula) and plug in values for a and b as parameters? But that means I have to take some time choosing a good model.

Yes, least squares requires a lot of assumptions to be provably optimal. On the other hand, it works all the time. Stupid simple approaches do that pretty frequently.

Anyway, my understanding is that even if I use regression on individual user data I'd need to use a pretty complex model with lots of parameters to make it work. Is this not true?

I don't see why it would be, necessarily. What sort of complex model did you have in mind?

Also, is the Bernoulli trials stuff reasonable, or am I making things too complicated with that too?

Play a bunch of level with random settings for each parameter, then regress the scores on all the parameters. There, now you have a predictor.

Make sure to do this with an experienced player, to get on that plateau.

You might need a different set of parameters for newbies. Of course, measuring for them is complicated by the rapid change in their abilities.

tuned to induce a state of flow.

Are you sure you don't mean deliberate practice?

Unlikely; deliberate practice can be quite unpleasant.

Oh right, this is a game. I was thinking he was optimizing for learning speed, not enjoyability.

Well, if there's a line between fun enough to play and unpleasant enough not to, learning speed is almost certainly higher for playing the game relative to nothing.