The results look quite different for Houdini 3 vs SF8---is this just a matter of Stockfish being much better optimized for small amounts of hardware?
We might be able to get similar advantages with a more general proposal like:
Fit a function f to a (Q, A) dataset with lots of questions about latent structure. Minimize the sum of some typical QA objective and the computational cost of verifying that f is consistent.
Then the idea is that matching the conditional probabilities from the human's model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.
It's not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It's also not clear what happens if this consistency condition is soft.
It's not clear what "verify that the consistency conditions are met" means. You can always do the same proposal as in the parent, though it's not really clear if that's a convincing verification. But I think that's a fundamental philosophical problem that both of these proposals need to confront.
It's not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don't think this is worse than the prior proposal.
Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.
Here's another approach to "shortest circuit" that is designed to avoid this problem:
The "intended" circuit C just follows along with the computation done by Fθ and then translates its internal state into natural language.
What about the problem case where Fθ computes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) that C could just read off? I'll imagine those being written down somewhere on a slip of paper inside of Fθ's model of the world.
Probably nothing like this can work, but I now feel like there are two live proposals for capturing the optimistic minimal circuits intuition---the one in this current comment, and in this other comment. I still feel like the aggressive speed penalization is doing something, and I feel like probably we can either find a working proposal in that space or else come up with some clearer counterexample.
I was proposing exempting the short-term risk-free rate, and I was imagining using 30 day treasury yield a the metric. (The post originally said that but it got simplified in the interest of clarity---of course "savings account" is vague since they pay different amounts with different risk, but it seems to communicate basically the same stuff.) That's also roughly the rate at which you'd borrow if using leverage to offset your tax burden (e.g. it's roughly the rate embedded in futures or at which investors can borrow on margin).
Very interesting, thanks!
In another comment you wrote "In between is the region with ~70 ELO; that's where engines usually operate on present hardware with minutes of think time" which made sense to me, I'm just trying to square that with this graph.
Recently I've been thinking about ML systems that generalize poorly (copying human errors) because of either re-using predictive models of humans or using human inference procedures to map between world models.
My initial focus was on preventing re-using predictive models of humans. But I'm feeling increasingly like there is going to be a single solution to the two problems, and that the world-model mismatch problem is a good domain to develop the kind of algorithm we need. I want to say a bit about why.
I'm currently thinking about dealing with world model mismatches by learning a correspondence between models using something other than a simplicity prior / training a neural network to answering questions. Intuitively we want to do something more like "lining up" the two models and seeing what parts correspond to which others. We have a lot of conditions/criteria for such alignments, so we don't necessarily have to just stick with simplicity. This comment fleshes out one possible approach a little bit.
If this approach succeeds, then it also directly applicable to avoiding re-using human models---we want to be lining up the internal computation of our model with concepts like "There is a cat in the room" rather than just asking the model to predict whether there is a cat however it wants (which it may do by copying a human labeler). And on the flip side, I think that the "re-using human models" problem is a good constraint to have in mind when thinking about ways to do this correspondence. (Roughly speaking, because something like computational speed or "locality" seems like a really central constraint for matching up world models, and doing that approach naively can greatly exacerbate the problems with copying the training process.)
So for now I think it makes sense for me to focus on whether learning this correspondence is actually plausible. If that succeeds then I can step back and see how that changes my overall view of the landscape (I think it might be quite a significant change), and if it fails then I hope to at least know a bit more about the world model mismatch problem.
I think the best analogy in existing practice is probably doing interpretability work---mapping up the AI's model to my model is kind of like looking at neurons and trying to make sense of what they are computing (or looking for neurons that compute something). And giving up on a "simplicity prior" is very natural when doing interpretability, instead using other considerations to determine whether a correspondence is good. It still seems kind of plausible that in retrospect my current work will look like it was trying to get a solid theoretical picture on what interpretability should do (including in the regime where the correspondence is quite complex, and when the goal is a much more complete level of understanding). I swing back and forth on how strong the analogy to interpretability seems / whether or not this is how it will look in retrospect. (But at any rate, my research methodology feels like a very different approach to similar questions.)
Here's a slightly more formal algorithm along these lines:
Reviewing how this behaves in each of the bad cases from the parent:
There are a lot of problems and missing details in this proposal:
Overall I'm becoming significantly more optimistic that something like this will work (though still less likely than not). Trying to step back and see the big picture, it seems like there are three key active ingredients:
My next step would probably be looking at cases where these high-level ingredients aren't sufficient (e.g. are there cases where "generate obs then do inference in the human model" is actually cheaper?). If they look pretty good, then I'll spend some more time trying to fill in the details in a more plausible way.
I'm quite surprised by how far out on the Elo vs compute curve we already are by a million nodes/move. Is this the main "target platform" for stockfish, or are people mostly trying to optimize the performance for significantly smaller node counts?
(I'm wondering whether such strong diminishing returns are fundamental to the domain, or whether people are putting the most work into optimizing performance down at more like 100kNodes/sec.)