Part 54 of 58
The Wrong Reason
By Madhav Kaushish · Ages 12+
Chain-of-thought reasoning improved the model's accuracy on complex questions. But Trviksha noticed a disturbing pattern.
The Suspicious Success
She tested the model on a set of fifty multi-step problems with known answers. The model, using chain-of-thought, got forty-one correct — a strong result. She then examined the reasoning chains for all fifty problems.
Of the forty-one correct answers, thirty-five had fully correct reasoning chains. The remaining six had chains that contained at least one incorrect step — yet arrived at the correct final answer anyway.
Trviksha: Problem 23. The model wrote: "The tax rate is 15%. The revenue is 200,000. The tax is 15% of 200,000 = 25,000." Fifteen percent of two hundred thousand is thirty thousand, not twenty-five thousand. But the correct answer to the overall question was twenty-five thousand — because the actual tax rate was 12.5%, which the model had computed incorrectly in an earlier step. Two errors that cancelled out.
Blortz: Right answer, wrong reason. The chain of thought looks like reasoning, but the path from question to answer did not follow a valid logical chain.
Trviksha: And there are cases going the other direction. Problem 37: the reasoning chain is perfectly correct through every step, but the final answer is wrong because the model misread a number in the original question. Right reasoning, wrong answer.
The Faithfulness Question
Zhrondvik: If the reasoning chain is sometimes wrong even when the answer is right, how do I know when to trust it?
Trviksha: You do not. Not automatically. The chain of thought is text generated by the model — it is the model's explanation of its reasoning, not a verified transcript of its internal process. The model might arrive at the answer through one computational path and generate a different path in the explanation.
Blortz: The explanation is a narrative, not a proof. The model generates text that looks like reasoning, in the same way that it generates text that looks like historical accounts. Sometimes the text accurately reflects the underlying process. Sometimes it does not.
This was the faithfulness problem. When the model generated a chain of thought, was the chain a faithful representation of how the model actually computed its answer? Or was it a plausible-sounding post-hoc narrative — a story about reasoning, rather than actual reasoning?

Scoring the Steps
Trviksha could not solve the faithfulness problem completely — it was deeply difficult to determine whether text truly reflected an internal process. But she could mitigate it.
Instead of only checking whether the final answer was correct, she trained a separate model — a process verifier — to evaluate each individual step of the reasoning chain. For each step, the verifier assessed: is this step logically valid given the preceding steps?
Trviksha: The final-answer checker asks: "Did you get the right answer?" The step checker asks: "Is each step correct?" These are different questions. A chain with a wrong step that happens to produce the right answer will be flagged by the step checker even though it passes the answer checker.
She trained the step checker on chains that had been annotated by Zhrondvik's mathematical staff — each step labelled as correct or incorrect. The trained verifier could then evaluate new chains automatically.
When she combined chain-of-thought generation with step-by-step verification, the accuracy improved further. Chains that the verifier flagged as containing errors were regenerated — the model tried again, producing a different chain. If the new chain passed verification, it was accepted. If not, the model tried a third time. After three attempts, the system either produced a verified chain or flagged the question as uncertain.
Trviksha: The generator produces reasoning. The verifier checks reasoning. They are separate systems with separate training. The generator might produce a plausible-but-wrong chain. The verifier catches it and requests a new attempt.
Blortz: Verification is easier than generation. It is easier to check whether a step is correct than to produce a correct step from scratch. The verifier benefits from this asymmetry.
Zhrondvik: And now I have more confidence in the results?
Trviksha: More confidence. Not certainty. The verifier can also make mistakes — it might approve a wrong step or reject a correct one. But two independent systems, each with different failure modes, are more reliable than either one alone.