I have used the WIS/INT dichotomy in past and see it differently. Hopefully not arguing about meaning of words :) For me WIS/INT are factors of intelligence, and likely positively correlated, but I will only look at them as factors.
A generator can display WIS, not just a validator. Indeed, responding to your own informal incentives when generating plans is WIS (whereas finding a cleaver exploit is INT, and both can be used for wise or cleaver ends). WIS can have a strong validator, but so can INT; WIS generates in ways that scales with the validator, never pushing it; INT pushes any validator that isn't much mightier than itself. If WIS was a machine, it would Optimize Mildly.
Morally speaking, INT lets you achieve what you want, WIS lets you want "better" things. In terms of how you take in a problem, INT lets you juggle lots of pieces, and be comfortable zooming out, considering the long term etc..., WIS makes you try to zoom out, makes you try to learn from your mistakes, try to not juggle so many pieces. INT can let you optimize one thing very well, WIS tries to hold things together, and not forget to watch for side effects as you optimize.
WIS might have more to do with personality than IQ, or it might be more a skill, cultural, acquired. I think it is possible to be very wise and very dumb. I think it is possible to be very wise and very smart.
also, thank you for the words Adversaria and Predictoria; I shall use them henceforth!
thank you for writing this post!
many more such thoughts, let me know if you would like to google meet and discuss or if you would like me to keep going
The post's claim that validation-only approaches are fundamentally better than training-with-validation oversimplifies a complex reality. Both approaches modify the distribution of models - neither preserves some "pure" average case. Our base training objective may already have some correlation with our validation signal, and there's nothing special about maintaining this arbitrary starting point. Sometimes we should increase correlation between training and validation, sometimes decrease it, depending on the specific relationship between our objective and validator. What matters is understanding how correlation affects both P(aligned) and P(pass|misaligned), weighing the tradeoffs, and optimizing within our practical retraining budget (because often, increasing P(aligned|pass) will also decrease P(pass)).
True masterpiece! Here are some notes I took while reading:
thank you, will look into that. I intuitively expect that in the setting where compute is precisely 0 cost, you can always just convert multiplicity to negative-length by building an iterate/sort/index loop around the bit segment where the multiplicity lies, and this just costs you the length of the iterate/sort/index loop (a constant which depends on your language). I also intuitively expect this to break in the infinite bitstring setting because you can have multiplicity that isn't contained in a finite substring?
I was not able on a quick skim of the pdf to identify which passage you were referring to. If possible can you point me to an example Temperature 0 in the textbook?
thinking at the level of constraints is useful. very sparse rewards offer less constraints on final solution. imitation would offer a lot of constraints (within distribution and assuming very low loss).
a way to see RL/supervised distinction dissolve is to convert back and forth. With a reward as negative token prediction loss, and actions being the set of tokens, we can simulate auto-regressive training with RL (as mentioned by @porby). conversely, you could first train RL policy and then imitate that (in which case why would imitator be any safer?).
also, the level of capabilities and the output domain might affect the differences between sparse/dense reward. even if we completely constrain a CPU simulator (to the point that only one solution remains), we still end up with a thing that can run arbitrary programs. At the point where your CPU simulator can be used without performance penalty to do the complex task that your RL agent was doing, it is hard to say which is safer by appealing to the level of constraints in training.
i think something similar could be said of a future pretrained LLM that can solve tough RL problems simply by being prompted to "simulate the appropriate RL agent", but i am curious what others think here.
I am also very curious about why people can be so smart and nevertheless work on the "wrong" thing. Perhaps reflectivity is also a source of WIS. From my perspective, very smart people who work on the "wrong" thing are simply not realizing that they can apply their big brain to figure out what to do, and perhaps this is more easy to realize when you see your brain's functioning as a mechanical thing akin somewhat to what you learn about at the object level.
Similarly, when I try to empathize with supergeniuses that work on some pointless problem as their world is speedrunning the apocalypse, I have trouble visualizing it. I think perhaps WIS is also the ability for somewhat obvious things to occur to you at the right time, and to maintain a unified view of yourself over time.