The Wrong Reason

Chain-of-thought reasoning improved the model's accuracy on complex questions. But Trviksha noticed a disturbing pattern.

The Suspicious Success

She tested the model on a set of fifty multi-step problems with known answers. The model, using chain-of-thought, got forty-one correct — a strong result. She then examined the reasoning chains for all fifty problems.

Of the forty-one correct answers, thirty-five had fully correct reasoning chains. The remaining six had chains that contained at least one incorrect step — yet arrived at the correct final answer anyway.

Trviksha: Problem 23. The model wrote: "The tax rate is 15%. The revenue is 200,000. The tax is 15% of 200,000 = 25,000." Fifteen percent of two hundred thousand is thirty thousand, not twenty-five thousand. But the correct answer to the overall question was twenty-five thousand — because the actual tax rate was 12.5%, which the model had computed incorrectly in an earlier step. Two errors that cancelled out.

Blortz: Right answer, wrong reason. The chain of thought looks like reasoning, but the path from question to answer did not follow a valid logical chain.

Trviksha: And there are cases going the other direction. Problem 37: the reasoning chain is perfectly correct through every step, but the final answer is wrong because the model misread a number in the original question. Right reasoning, wrong answer.

The Faithfulness Question

Zhrondvik: If the reasoning chain is sometimes wrong even when the answer is right, how do I know when to trust it?

Trviksha: You do not. Not automatically. The chain of thought is text generated by the model — it is the model's explanation of its reasoning, not a verified transcript of its internal process. The model might arrive at the answer through one computational path and generate a different path in the explanation.

Blortz: The explanation is a narrative, not a proof. The model generates text that looks like reasoning, in the same way that it generates text that looks like historical accounts. Sometimes the text accurately reflects the underlying process. Sometimes it does not.

This was the faithfulness problem. When the model generated a chain of thought, was the chain a faithful representation of how the model actually computed its answer? Or was it a plausible-sounding post-hoc narrative — a story about reasoning, rather than actual reasoning?

Two stone tablets showing chains of reasoning. The left tablet has correct steps leading to a correct answer, with green check marks. The right tablet has an incorrect step in the middle (marked with a red X) but still arrives at the correct answer at the bottom (green check). A confused Zhrondvik looks between the two, unable to tell which reasoning to trust

Scoring the Steps

Trviksha could not solve the faithfulness problem completely — it was deeply difficult to determine whether text truly reflected an internal process. But she could mitigate it.

Instead of only checking whether the final answer was correct, she trained a separate model — a process verifier — to evaluate each individual step of the reasoning chain. For each step, the verifier assessed: is this step logically valid given the preceding steps?

Trviksha: The final-answer checker asks: "Did you get the right answer?" The step checker asks: "Is each step correct?" These are different questions. A chain with a wrong step that happens to produce the right answer will be flagged by the step checker even though it passes the answer checker.

She trained the step checker on chains that had been annotated by Zhrondvik's mathematical staff — each step labelled as correct or incorrect. The trained verifier could then evaluate new chains automatically.

When she combined chain-of-thought generation with step-by-step verification, the accuracy improved further. Chains that the verifier flagged as containing errors were regenerated — the model tried again, producing a different chain. If the new chain passed verification, it was accepted. If not, the model tried a third time. After three attempts, the system either produced a verified chain or flagged the question as uncertain.

Trviksha: The generator produces reasoning. The verifier checks reasoning. They are separate systems with separate training. The generator might produce a plausible-but-wrong chain. The verifier catches it and requests a new attempt.

Blortz: Verification is easier than generation. It is easier to check whether a step is correct than to produce a correct step from scratch. The verifier benefits from this asymmetry.

Zhrondvik: And now I have more confidence in the results?

Trviksha: More confidence. Not certainty. The verifier can also make mistakes — it might approve a wrong step or reject a correct one. But two independent systems, each with different failure modes, are more reliable than either one alone.

Trviksha has encountered two important ideas in AI reasoning. Faithful reasoning asks whether a model's chain of thought accurately reflects its internal computation — or whether it is a plausible-sounding narrative that does not correspond to how the model actually arrived at its answer. This is an open research question with no definitive answer yet. Process reward models (PRMs) address a related problem by scoring each individual reasoning step rather than just the final answer. A PRM trained to verify individual steps can catch errors that a final-answer checker would miss (like cancelling errors that produce a right answer from wrong reasoning). The combination of generation and verification — one model that produces reasoning and another that checks it — is more robust than either alone, exploiting the asymmetry that checking is generally easier than producing. Think about peer review in science: a paper might reach the right conclusion through flawed methodology. A reviewer who only checks the conclusion would approve it. A reviewer who checks each methodological step would catch the flaw. Which kind of review is more valuable?