Hey, I'm Owen.
I think rationality is pretty rad.
The way I did this for a specific ordering of cards (used for a set of magic tricks called Mnemonica) was to have some sort of 1 to 1 mapping between each card and its position in the deck.
Some assorted examples:
5 : 4 of Hearts because 4 is 5 minus 1 (and the Hearts are just there).
7 : Ace of Spades because 7 is a lucky number and the Ace of Spades is a lucky card.
8 : 5 of Hearts because 5 looks a little like 8.
49 : 5 of Clubs because 4.9 is almost 5 (and the Clubs are just there).
This is a good point, and this is where I think a good amount of the difficulty lies, especially as the cited example of human interpretable NNs (i.e. Microscope AI) doesn't seem easily applicable to things outside of image recognition.
My understanding is that the OpenAI Microscope (is this what you meant by microscope AI?) is mostly feature visualization techniques + human curation by looking at the visualized samples. Do you have thoughts on how to modify this for the text domain?
Same here. I am working for a small quant trading firm, and the collective company wisdom is to prefer CDFs over PDFs.
Regarding how interpretability can help with addressing motivation issues, I think Chris Olah's views present situations where interpretability can potentially sidestep some of those issues. One such example is that if we use interpretability to aid in model design, we might have confidence that our system isn't a mesa-optimizer, and we've done this without explicitly asking questions about "what our model desires".
I agree that this is far from the whole picture. The scenario you describe is an example where we'd want to make interpretability more accessible to more end-users. There is definitely more work to be done to bridge "normal" human explanations with what we can get from our analysis.
I've spent more of my time thinking about the technical sub-areas, so I'm focused on situations where innovations there can be useful. I don't mean to say that this is the only place where I think progress is useful.
I think that the general form of the problem is context-dependent, as you describe. Useful explanations do seem to depend on the model, task, and risks involved.
However, from an AI safety perspective, we're probably only considering a restricted set of interpretability approaches, which might make it easier. In the safety context, we can probably less concerned with interpretability that is useful for laypeople, and focus on interpretability that is useful for the people doing the technical work.
To that end, I think that "just" being careful about what the interpretability analysis means can help, like how good statisticians can avoid misuse of statistical testing, even though many practitioners get it wrong.
I think it's still an open question, though, what even this sort of "only useful for people who know what they're doing" interpretability analysis would be. Existing approaches still have many issues.
I mostly focused on the interpretability section as that's what I'm most familiar with, and I think your criticisms are very valid. I also wrote up some thoughts recently on where post-hoc interpretability fails, and Daniel Filan has some good responses in the comments below.
Also, re: disappointment on tree regularization, something that does seem more promising is Daniel Filan and others at CHAI working on investigating modularity in neural nets. You can probably ask him more, but last time we chatted, he also had some thoughts (unpublished) on how to enforce modularization as a regularizer, which seems to be what you wished the tree reg paper would have done.
Overall, this is great stuff, and I'll need to spend more time thinking about the design vs search distinction (which makes sense to me at first glance)/
I think unbundling them seems like a good thing to strive for.
I guess the parts that I might still be worried about are:
I see below that you claim that more accountability is probably net-good for most students, in the sense that would help improve learning? I'm not sure that I fully agree with that. My experience in primary to upper education has been that there is a great many students who don't seem that motivated to learn due to differing priorities, home situations, or preferences. I think improving education will need to find some way of addressing this beyond just accountability.
Do you envision students enrolling in this Improved Education program for free? Public schools right now have a distinct advantage because they receive a lot of funding from taxpayers.
I think the issue of, "Why can't we just immediately get switch everyone to a decoupled situation where credentialing and education are separate?" is due to us being stuck in an inadequate equilibrium. Do you have plans to specifically tackle these inertia-related issues that can make mass-adoption difficult? (e.g until cheap credentialing services become widespread, why would signaling-conscious students decide to enroll in Improved Education instead of Normal Education?)
I think figuring out how to make education better is definitely a worthwhile goal, and I'm reading this post (and your other one) with interest.
I'm curious to what extent you're going to be addressing the issue of education as-partially-or-mostly signaling, like what Caplan argues for in The Case Against Education? I can imagine a line of argument that says paying for public education is worthwhile, even if all it does is accreditation because it's useful to employers. What those actual costs look like and what they should be is, of course, up for debate.
I could also see the point that all this signaling stuff is orthogonal if all we "really" care about is optimizing for learning. Just wondering what stance you're taking.
I think the OSC's reproducibility project is much more of what you're looking for, if you're worried that Many Labs is selecting only for a specific type of effect.
They focus on selecting studies quasi-randomly and use a variety of reproducibility measures (confidence interval, p-value, effect size magnitude + direction, subjective assessment). They find that around 30-50% of effects replicate, depending on the criteria used. They looked at 100 studies, in total.
I don't know enough about the biomedical field, but a brief search on the web yields the following links, which might be useful?