Hmm! I’m not sure about this. The patient in the linked paper received hemodialysis (which, I think, manually takes the methanol out) before his body could get around to metabolizing it into formaldehyde and formic acid. For someone who doesn’t receive hemodialysis, I think the methanol would still have to be metabolized at some point, even if when that happens is much delayed? In which case the same toxic effects of formaldehyde and formic acid would hit, just much later.
I think the poker example is OK, and paragraphs like
“The second decision point was when the flop was dealt and you faced a bet. This time you decided to fold. Maybe that wasn't the best play though. Maybe you should have called. Maybe you should have raised. Again, the goal of hand review is to figure this out.”
made sense to me. But the terminology in the dialogue was very tough: button, Rainbow, LAGgy, bdfs, AX, nut flush, nitty - I understood none of these. (I’ve played poker now and then, but never studied it). So keeping the example but translating it a bit further to more widely-used language (if possible) might be good.
Very interesting! I work in health insurance and we try to encourage vaginal delivery and discourage C-sections; the other side you present here is a surprise. Good stuff.
Very very good. The full power of science fiction - taking the concept of the redaction machines and finding this many interesting consequences of them, and fitting them into the story - really good
You don’t need to call tails to explore whether tails is possible, though - the information gain of a coin flip is the same whether you call heads or tails
I’m not the asker, but I think I get where they’re coming from. For a long time, linear and logistic regression were the king & queen of modeling. Then the introduction of non-linear functions like random forest and gradient boosters made us far more able to fit difficult data. So the original question has me wondering if there’s a similar possible gain in going from linearity to non-linearity in interpretability algorithms.
I don’t think the problem is this big if you’re trying to control one specific model. Given an RLHF’d model, equipped with a specific system prompt (e.g. helpless harmless assistant), you have either one or a small number of luigis, and therefore around the same amount of waluigis - right?