Screencasts could be scalable data + evals for single-user emulation (Guardian Angels)

sophia_xu

This is a response to Gwern's Guardian Angels post and draws terms from it quite extensively.

Epistemic status: draft and to-be-updated once I do more experiments.

Suppose you have a Guardian Angel (GA) model and you want to know how well it predicts your knowledge/personality/values/preferences.

However, you don't have an extensive list of personal writings to personalize the model (because you're not one of the most prominent internet writers), and you don't want to do a lot of data labeling/supervision either. How would you train/evaluate the model then?

Motivating examples

We give the GA model a held-out replay of your twitter-browsing session containing 10 tweets (could be either as text captions or a stiched-together screenshot). Can the model can actually predict which ones you'll actually click on?
- Doing it correctly depends on a lot of personalization! At least for me, I don't click on most tweets on my timeline, and what I click depends a lot on my specific, mostly implicit taste, and twitter's recommender system doesn't seem to grok it well.
You're reading a technical blogpost/paper. Can the model predict which sections you'll read or which concepts you'll look up?
- This depends on your knowledge (e.g. are you new to the domain?) and you taste (did you only read the abstract/intro before you closed the tab? or did you read it top-to-bottom?)

My plan

I vibe-coded a macOS app that continuously take 4fps screencasts alongside a 4fps webcam feed (for possible eye tracking), plus timestamp aligned user inputs, and sends it to Cloudflare's S3 equivalent.

So far we're at ~8 hours now, so roughly 1GB/hr, and doing this for 1 year is <10TB / <$150 on Cloudflare based on my usage patterns. (it goes up linearly if you keep them forever)

I intend to make a benchmark out of this using a bunch of heuristics, then try a bunch of methods to hill-climb it. Will report back if there's any updates.

4

Screencasts could be scalable data + evals for single-user emulation (Guardian Angels)

4

Motivating examples

My plan

Other things to read

4

4