On Modus Tollens, playing around with ChatGPT yields an interesting result. Turns out, the model seems to be... 'overthinking' it I guess. It thinks its a complex question - answering `No` based on insufficient predicates provided. I think that may be why at some point in scale, the model performance just drops straight down to 0 (≈1B). (Conversation)
Sternly forcing it to deduce only from the given statements (I'm unsure how much CoT helped here, an ablation would be interesting) gets it correctly. It seems that larger models are injecting some interpretat... (read more)
Maybe we need to start using prompts like "This is not a trick question; just
take it step by step:"!
--------------------------------------------------------------------------------
Incidentally, looks like understanding multi-step legal criteria might be a case
of U-shaped scaling too: "Large Language Models as Fiduciaries: A Case Study
Toward Robustly Communicating With Artificial Intelligence Through Legal
Standards", Nay 2023
[https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4335945] finds that
understanding whether someone has a fiduciary legal obligation goes from 27%
(Curie) → 50% (random baseline) → 73% (text-davinci-002) → 78%
(text-davinci-003), so presumably there's a smaller model-size which outperforms
Curie by random guessing, giving a U-curve from random smol to bad Curie to
great davinci.
On Modus Tollens, playing around with ChatGPT yields an interesting result. Turns out, the model seems to be... 'overthinking' it I guess. It thinks its a complex question - answering `No` based on insufficient predicates provided. I think that may be why at some point in scale, the model performance just drops straight down to 0 (≈1B). (Conversation)
Sternly forcing it to deduce only from the given statements (I'm unsure how much CoT helped here, an ablation would be interesting) gets it correctly. It seems that larger models are injecting some interpretat... (read more)