All of Awesome_Ruler_007's Comments + Replies

On Modus Tollens, playing around with ChatGPT yields an interesting result. Turns out, the model seems to be... 'overthinking' it I guess. It thinks its a complex question - answering No based on insufficient predicates provided. I think that may be why at some point in scale, the model performance just drops straight down to 0 (). (Conversation)

Sternly forcing it to deduce only from the given statements (I'm unsure how much CoT helped here, an ablation would be interesting) gets it correctly. It seems that larger models are injecting some interpretat...

2gwern2mo
Maybe we need to start using prompts like "This is not a trick question; just take it step by step:"! -------------------------------------------------------------------------------- Incidentally, looks like understanding multi-step legal criteria might be a case of U-shaped scaling too: "Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards", Nay 2023 [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4335945] finds that understanding whether someone has a fiduciary legal obligation goes from 27% (Curie) → 50% (random baseline) → 73% (text-davinci-002) → 78% (text-davinci-003), so presumably there's a smaller model-size which outperforms Curie by random guessing, giving a U-curve from random smol to bad Curie to great davinci.