LLMs Still Suck at Logical Reasoning

LESSWRONG
LW

LLMs Still Suck at Logical Reasoning — LessWrong

I like to use LLMs to help verify math problems and reinforce my own understanding and while most of the time I enhance my learning, sometimes I get a hard reminder that LLMs are still not perfect at the core task any problem solving -- logical reasoning.

A problem that inspired this post is below.

Every viewer who bought a ticket in the first row of the cinema took one of the seats in that row. All n seats are occupied, but no one is in the correct seat. The ticket-collector may repeatedly swap two neighbouring viewers, but only when both of those viewers are sitting in the wrong seats. Can he always end up with everyone in their own seat?

It's a classic CS leetcode type of algorithmic problem wrapped in a seating puzzle. The core question is, given a sequence of permuted numbers can we arrange them into correct ordering?

After drafting my solution, I asked ChatGPT o3 for its solution. Here's the response it gave.

Then it even proceeded to go over an example. The solution seems well constructed and elaborate, except that it's wrong. This general algorithm doesn't handle the edge case when the current number is swapped with a number that's one position away from its correct location. This edge case refutes the whole constructed solution. So I proceed to ask GPT to go over the tricky case of 5,3,1,2,4. Here we can't start with 1 and swap it with 3 because that'll put 3 in its correct location and block 5. Here's the reasoning ChatGPT provided:

Notice the transition from step 3 to 4 is completely nonsensical. When we move 1 to its correct location at step 3, then according to the algorithm we need to start swapping 2 but we can't do it because 3 is already in its correct position so 2 is blocked. To bypass its own logical gap, ChatGPT pretended like it didn't notice the violation, and at step 4 somehow 2 magically moved by two positions to its correct location. Then at step 5 there's no change at all. So instead of admitting that the original solution doesn't work, ChatGPT tricks counterexamples to work. Sfter pointing that out, I get a classic "so sorry you're right response".

Now it acknowledges the mistake and proceeds to solve the example correctly but without modifying the general algorithmic solution. And what use is a solution in problem solving if it only handles a specific case?

Then I test Gemini 2.5 Pro hoping I will get a clearer response. But oh well! It starts off by saying there's no solution. Then provides a counterexample which CAN actually be solved.

Then I explain how to solve the "counterexample" without violating the conditions of the problem and get a "so sorry" response.

Then it gave me a general solution that was wrong and couldn't handle my tricky case!

So Gemini resorted back to its original answer that the problem is not solvable. Then I asked if it's possible to modify the algorithm to solve the example, and it gave me a solution that solves the example but didn't generalize it.

So Gemini kept going back and forth between Yes and No and getting more confused in its own reasoning. Finally it ended up stating that the algorithm needs to be "intelligent enough" to solve the problem. Duh!

Then I tested Grok 3, but it spitted a total nonsense mash of mathematical notations mixed with hairy explanations without any structured reasoning.

It pretty much reasoned about the initial conditions of the problem, and attempted to generalize the swapping logic but didn't even try to come up with an algorithm.

As a last chance for redemption, I uploaded my solution to ChatGPT and asked to verify or refute it. It happily consumed it and gave me the final verdict:

It even pointed out that my proof isn't perfect. Well, thanks!

My final verdict is that I still cannot be sure if my solution is correct because after all the back-n-forths and flawed logic I have trust issues about LLMs' opinion on proofs.

So the key takeaway -- LLMs are great but still far from strong logical chain of reasoning.

P.S. I'm curious how Grok 4 would handle this problem as there are claims it's much stronger in math, but I don't have a subscription so didn't get a chance to test it.