This is my first mechanistic interpretability blog post! I decided to research whether models are actually reasoning when answering non-deductive questions, or whether they're doing something simpler.
My dataset is adapted from InAbHyD[1], and it's composed of inductive and abductive reasoning scenarios in first-order ontologies generated through code (using made-up concepts to dismiss much of the external effect of common words). These scenarios have multiple technically correct answers, but one answer is definitively the most correct[2]. I found that LLMs seem to have a fixed generalization tendency (when evaluating my examples) that doesn't seem to adapt to any logical structure. And accuracies in 1-hop and 2-hop reasoning add up to roughly 100% for most models.
Additionally, there's a large overlap between H2 successes and H1 failures (73% for DeepSeek V3), meaning that model outputs the parent concept regardless of whether the task asks for the child concept or the parent concept. This behavior suggests that the model isn't actually reasoning, instead generalizing to a fixed level that aligns with the parent concept.
This is perplexing because in proper reasoning, you generally need the child concept in order to get the parent concept. For example, you'd conclude that Fae is a mammal through a reasoning chain, first establishing that Fae is a tiger, then making one hop to say that Fae is a feline (child concept), and then making a second hop to say that felines are mammals (parent concept). The overlap, though, suggests that the model isn't reasoning through the ontology, and it's skipping the chain to output the parent or child concept depending on its fixed generalization tendency.
I used MI techniques like probing, activation patching, and SAEs in my research. Probing predicted something related to the output at layer 8 (very early), but patching early layers barely made a difference in the final decision. This makes it more likely that whatever probing predicted is just correlational to the final result, as well as that the generalization tendency is distributed among model components instead of being early on.
This was a fun project, and I'm excited to continue with this research and make some more definitive findings. My ultimate goal is to find why LLMs might be architecturally limited in non-deductive reasoning.
A paper that I based my initial research question on, which argues that LLMs can't properly do non-deductive reasoning. It's authored by my NLP professor, Abulhair Saparov, and his PhD student, Yunxin Sun.
This is my first mechanistic interpretability blog post! I decided to research whether models are actually reasoning when answering non-deductive questions, or whether they're doing something simpler.
My dataset is adapted from InAbHyD[1], and it's composed of inductive and abductive reasoning scenarios in first-order ontologies generated through code (using made-up concepts to dismiss much of the external effect of common words). These scenarios have multiple technically correct answers, but one answer is definitively the most correct[2]. I found that LLMs seem to have a fixed generalization tendency (when evaluating my examples) that doesn't seem to adapt to any logical structure. And accuracies in 1-hop and 2-hop reasoning add up to roughly 100% for most models.
Additionally, there's a large overlap between H2 successes and H1 failures (73% for DeepSeek V3), meaning that model outputs the parent concept regardless of whether the task asks for the child concept or the parent concept. This behavior suggests that the model isn't actually reasoning, instead generalizing to a fixed level that aligns with the parent concept.
This is perplexing because in proper reasoning, you generally need the child concept in order to get the parent concept. For example, you'd conclude that Fae is a mammal through a reasoning chain, first establishing that Fae is a tiger, then making one hop to say that Fae is a feline (child concept), and then making a second hop to say that felines are mammals (parent concept). The overlap, though, suggests that the model isn't reasoning through the ontology, and it's skipping the chain to output the parent or child concept depending on its fixed generalization tendency.
I used MI techniques like probing, activation patching, and SAEs in my research. Probing predicted something related to the output at layer 8 (very early), but patching early layers barely made a difference in the final decision. This makes it more likely that whatever probing predicted is just correlational to the final result, as well as that the generalization tendency is distributed among model components instead of being early on.
This was a fun project, and I'm excited to continue with this research and make some more definitive findings. My ultimate goal is to find why LLMs might be architecturally limited in non-deductive reasoning.
A paper that I based my initial research question on, which argues that LLMs can't properly do non-deductive reasoning. It's authored by my NLP professor, Abulhair Saparov, and his PhD student, Yunxin Sun.
This concept is known as Occam's razor.