Defining Wrong

Jvelthra examined the results of the twelve-factor grain store model. Overall accuracy: 89%. She was not impressed.

The Hidden Failure

Jvelthra: Your model predicts 89% correctly overall. But of the fourteen stores that actually got contaminated, it missed five. That is a 64% detection rate on the cases that matter.

Trviksha: Eighty-nine percent is a good number.

Jvelthra: Eighty-nine percent is an inflated number. Most stores are clean. If I predicted "clean" for every single store, I would be right 94% of the time and catch zero contaminated stores. Your 89% is barely better than guessing "clean" every time.

Trviksha re-examined the model's predictions. Jvelthra was right. The network had learned to be mostly correct by being mostly cautious — predicting "clean" for nearly everything, because "clean" was the common case. The five contaminated stores it missed were the ones that mattered most, and the network had treated them as acceptable losses.

Trviksha: The network was trained to minimize the average error across all stores. A miss on a contaminated store and a false alarm on a clean store count equally. Since there are many more clean stores than contaminated ones, the network learned to focus on getting the clean stores right.

Jvelthra: That is the wrong priority. Missing a contaminated store is catastrophic — the fungus spreads. Falsely flagging a clean store is merely inconvenient — an unnecessary inspection. These are not equivalent mistakes.

Changing What "Wrong" Means

The error measure Trviksha had been using — the average difference between prediction and reality — treated every mistake the same. A false positive and a false negative contributed equally to the total error. The network minimized this total and, rationally, focused its effort on the largest group: clean stores.

Trviksha changed the error measure. Instead of weighting every mistake equally, she weighted errors on contaminated stores three times more heavily than errors on clean stores. A missed contamination now cost three times as much as a false alarm.

She re-trained the network from scratch with the new error measure. The same architecture — twelve inputs, six hidden velociraptors, one output. The same backward propagation. The only change was what counted as "wrong."

The results shifted dramatically. Overall accuracy dropped to 85% — four points lower. But the detection rate on contaminated stores jumped from 64% to 86% — twelve of fourteen caught. The model now produced more false alarms (flagging clean stores unnecessarily), but it missed far fewer of the contaminated ones.

Two side-by-side stone tablets showing model results. The left tablet, labelled "Equal Errors," shows 89% overall but five red X marks over missed contaminated stores. The right tablet, labelled "Weighted Errors," shows 85% overall but only two red X marks — and several extra yellow circles marking false alarms on clean stores. Jvelthra points firmly at the right tablet

Jvelthra: That is what I need. I can afford to inspect a few extra stores. I cannot afford to miss a contaminated one.

The Lesson

Blortz: You changed what "wrong" means. The velociraptors adjusted their weights to minimize whatever you told them to minimize. When you told them all errors are equal, they optimized for the average. When you told them some errors are worse, they optimized for that instead.

Trviksha: The error measure — I am calling it the loss function — defines what the network cares about. Change the loss function, and the network learns a different set of weights, produces different predictions, solves a different problem. Even on the same data.

The network did not decide what mattered. The loss function did. And the loss function was chosen by Trviksha, based on what Jvelthra needed. The weights were discovered by the network. The definition of "good" was not.

Glagalbagal: So the network learns to be good at whatever you define as good?

Trviksha: Exactly. And if you define "good" poorly — if your loss function does not capture what you actually care about — the network will faithfully optimize for the wrong thing. It will be excellent at something you did not want.

Blortz: The network is not intelligent. It is obedient. It does exactly what the loss function tells it to, with no judgment about whether the loss function is sensible.

Trviksha: Which means the judgment has to come from us. The loss function is a choice. A human choice. The network finds weights. We define what "finding" means.

She filed this lesson alongside the threshold lesson from Grothvik. The model's score was mathematical — discovered from data. The threshold was ethical — chosen by a human. And now the loss function was also a human choice — shaping what the model learned to care about in the first place. At every stage, the system's apparent intelligence rested on human decisions that the system itself could not make.

Jvelthra: I am satisfied. The model catches twelve of fourteen contaminated stores and flags them before the fungus spreads. The false alarms cost me a few unnecessary inspections. The missed contaminations would have cost me entire stockpiles. This is a trade I accept.

Trviksha: I am glad. But I wonder how many other customers have models that look accurate on average while failing on the cases that matter most. The loss function is invisible in the final accuracy number. You have to look deeper to see what the model is actually doing.

Trviksha has discovered the importance of loss functions — the definition of "wrong" that a network is trained to minimize. Different loss functions produce different models, even on the same data. A model trained to minimize average error will focus on the most common cases. A model trained with heavier penalties on certain mistakes will focus on those instead. This is not a technical detail — it reflects real values. In medical screening, missing a disease is worse than a false alarm. In criminal justice, convicting an innocent person is arguably worse than acquitting a guilty one. In spam filtering, blocking a legitimate email may be worse than letting spam through. The loss function encodes these priorities. A model cannot decide what matters — it can only optimize whatever objective it is given. The choice of objective is perhaps the most consequential human decision in the entire machine learning pipeline. When you grade an exam, do all wrong answers count equally, or do some mistakes count more than others? Why?