The Comparison

Trviksha needed human judgments about what made a good output. The question was how to collect them.

Why Not Scores?

Her first idea was simple: show humans a model output and ask them to rate it on a scale from one to ten.

She tried this with Zhrondvik's staff. The results were inconsistent. One reviewer gave a summary a seven. Another gave the same summary a four. A third gave it a six. The scores were noisy, and different reviewers had different internal scales — one reviewer's seven was another reviewer's five.

Trviksha: Absolute scores are unreliable. People do not agree on what a "seven" means. The same person gives different scores for the same output on different days.

Blortz: Can you ask a simpler question?

Trviksha: What if I ask: "Here are two outputs. Which is better?" Not "how good is this?" but "which of these two do you prefer?"

She tried it. Two summaries of the same report, generated by the model with different random seeds. The reviewer picked one. The agreement between reviewers was much higher — roughly eighty percent of the time, different reviewers picked the same output as better. The remaining twenty percent were cases where the two outputs were roughly equal in quality.

Trviksha: Pairwise comparisons are easier and more reliable than absolute scores. People are better at saying "A is better than B" than at saying "A is a 7.3."

Collecting Judgments

She set up a systematic collection process. For each of five hundred questions from Zhrondvik's daily reports, the model generated two candidate responses. A panel of twelve reviewers — drawn from Zhrondvik's staff, provincial governors, and experienced administrators — compared each pair and selected the better response.

Five hundred questions, each with a pair of responses, each judged by three reviewers. Fifteen hundred total comparisons. In cases where reviewers disagreed, the majority preference was used.

Zhrondvik: This is expensive. Twelve people spending a week doing comparisons.

Trviksha: It is expensive once. I will use these comparisons to train a model that predicts human preferences — an automatic scorer that approximates what your reviewers would say. After that, I will not need the reviewers for every output.

A stone table with two summary tablets placed side by side. Three reviewers sit across the table, each placing a coloured pebble next to their preferred summary. Two pebbles are next to the left summary, one next to the right. Below, tallied results from many comparisons are shown as stacked pebbles. Trviksha records the preferences on a separate tablet

The Reward Model

Trviksha built a separate network — not the language model itself, but a new, smaller network. This network took a question and a response as input, and produced a single number: a score predicting how much a human would prefer this response.

She trained this network on the fifteen hundred pairwise comparisons. For each pair, the network should assign a higher score to the response the reviewers preferred. The training objective: for every pair (A, B) where reviewers preferred A, the network should give A a higher score than B.

After training on the comparisons, the reward model could score new, unseen outputs. Given a question and a response it had never seen, it estimated how a human reviewer would rate it — not perfectly, but well enough to distinguish clearly good responses from clearly bad ones.

Trviksha: The reward model is a compressed version of the reviewers' preferences. It does not understand what "helpful" means. It has learned patterns in the comparisons — responses that are specific tend to be preferred over vague ones, responses that answer the question directly tend to be preferred over those that hedge, responses with relevant numbers tend to be preferred.

Blortz: A model of human taste. Trained on examples of human choice.

Trviksha: A rough model. It captures the broad patterns of what the reviewers preferred. It misses subtleties, edge cases, and the individual differences between reviewers. But it is good enough to provide a training signal — a score that points the language model in the right direction.

Zhrondvik: And now you use this score to improve the language model?

Trviksha: That is the next step.

Trviksha has built a reward model — a neural network trained to predict human preferences from pairwise comparisons. The key insight is that comparing two options ("which is better?") is much easier and more reliable for humans than assigning absolute scores ("rate this from 1 to 10"). This is a well-known result in psychology: relative judgments are more consistent than absolute ones. The reward model learns the patterns behind human preferences — what features make one response better than another — and can then score new responses automatically. This is the second step of the RLHF (Reinforcement Learning from Human Feedback) pipeline used to train modern AI assistants. The reward model serves as a proxy for human judgment: cheaper and faster than asking humans, but imperfect. Think about how restaurant ratings work: individual scores (3.5 stars vs 4 stars) are unreliable, but if you show someone two meals and ask "which would you prefer?", their answer is usually consistent and informative. The reward model learns from thousands of such comparisons.