Researcher at Ought
My quick take:
Rohin has created his posterior distribution! Key differences from his prior are at the bounds:
Overall, Rohin’s posterior is a bit more optimistic than his prior and more uncertain.
Ethan Perez’s snapshot wins the prize for the most accurate prediction of Rohin's posterior. Ethan kept a similar distribution shape while decreasing the probability >2100 less than the other submissions.
The prize for a comment that updated Rohin’s thinking goes to Jacob Pfau! This was determined by a draw with comments weighted proportionally to how much they updated Rohin’s thinking.
Thanks to everyone who participated and congratulations to the winners! Feel free to continue making comments and distributions, and sharing any feedback you have on this competition.
Thanks for this post, Paul!
NOTE: Response to this post has been even greater than we expected. We received more applications for experiment participant than we currently have the capacity to manage so we are temporarily taking the posting down. If you've applied and don't hear from us for a while, please excuse the delay! Thanks everyone who has expressed interest - we're hoping to get back to you and work with you soon.
It's correct that, so far, Ought has been running small-scale experiments with people who know the research background. (What is amplification? How does it work? What problem is it intended to solve?)
Over time, we also think it's necessary to run larger-scale experiments. We're planning to start by running longer and more experiments with contractors instead of volunteers, probably over the next month or two. Longer-term, it's plausible that we'll build a platform similar to what this post describes. (See here for related thoughts.)
The reason we've focused on small-scale experiments with a select audience is that it's easy to do busywork that doesn't tell you anything about the question of interest. The purpose of our experiments so far has been to get high-quality feedback on the setup, not to gather object-level data. As a consequence, the experiments have been changing a lot from week to week. The biggest recent change is the switch from task decomposition (analogous to amplification with imitation learning as distillation step) to decomposition of evaluation (analogous to amplification with RL as distillation step). Based on these changes, I think that if we had stopped at any point so far and focused on scaling up instead of refining the setup, it would have been a mistake.
The log is taken from this tree. There isn't much more to see than what's visible in the screenshot. Building out more complete versions of meta-reasoning trees like this is on our roadmap.
What I'd do differently now: