User-inclination-guessing algorithms: registering a goal

ProgramCrafter

Reading post about Community Notes made me recall a recent idea and post it here.

It's commonly known that ranking algorithms based on estimating user's views and inclinations can be incredibly useful in recommendation systems, advertising, and content curation. But there must be other applications which do not require to create one's own social network, for instance!

There actually is a task which I believe to be a low-hanging fruit: predicting participant's score in olympiads; preferably with dataset available in the Internet, large scoring range and various tasks - that is, in informatics.

My thoughts about the prediction system

Olympiad tasks in informatics tend to use a lot of ideas and one or two algorithms, which makes them different both in topic and in difficulty. This may be represented by a vector in a high-dimensional space (based on task tags list on Codeforces, there might be at least 24 dimensions), where projection on each axis means usage of knowledge from the corresponding topic.

Let's make user embeddings also vectors in 24-dimensional space, and measure compatibility between a person and a task as projection of person embedding onto task vector, scaled down corresponding to the latter. If is person's vector, $t_{j}$ - task vector, then $c o m p a t_{i, j} = \frac{p_{i} \cdot t_{j}}{t_{j}^{2}}$ .

I want score prediction to be straightforward, so I map compatibility from $[0; 1]$ range to $[0; 100]$ expected score, and everything outside that range to the closest bound.

I'm currently struggling how to fill variance, and guess that a good idea is to use one of embeddings' components for this. Competitions have an element of unpredictability, and I want to be able to distinguish predictions "participant will get 40-60 points" from "participant will get 20-80 points". And, while system is training, some variance is to be expected after all.

Calculation options

I'd be glad to go forward with Bayesian updates, but unfortunately the problem space is $R^{24 (p e o p l e + t a s k s)}$ , and pieces of evidence are not independent as well; so it will be easier to use usual learning algorithms from neural networks.

I've also found out that calculating scores will be quite easy there! Let's demonstrate this on 4-dimensional embeddings, 3 tasks and 2 people.

>>> import torch
>>> tsk = torch.tensor([[1,0,0],[0,0,1],[0,3,0],[0,4,1]])
>>> tsk
tensor([[1, 0, 0],
        [0, 0, 1],
        [0, 3, 0],
        [0, 4, 1]])
>>> ppl = torch.tensor([[1,1],[1,0],[1,0],[1,5]])
>>> ppl
tensor([[1, 1],
        [1, 0],
        [1, 0],
        [1, 5]])
>>> (ppl.T @ tsk) * tsk.square().sum(0).reciprocal()
tensor([[1.0000, 0.2800, 1.0000],
        [1.0000, 0.8000, 2.5000]])

So, for instance, we see that second person is well (but not exactly, only 80%) capable of solving second task.

Goal and call for help

I expect to create a somewhat working prediction system and I'm going to describe it in a separate post sometime this year! Bets whether I'll actually do this are on:

I'd like to hear if you have ideas on ML algorithm to use, or on domain to test the system next time. Criticism (except that such recommendation algorithms in social networks might be supposedly harmful) is also welcome!