Part 12 of 58

The Perfect Memory

By Madhav Kaushish · Ages 12+

Word of GlagalCloud's prediction system spread. Grothvik told three village healers. They told their colleagues. Within two months, Trviksha had patient records from nine villages — two thousand one hundred patients in total, each with twelve recorded factors.

She built the largest network yet. Twelve inputs, twenty hidden velociraptors, one output. The twenty hidden velociraptors were more than she had ever used — but with ten times the data, she reasoned, a larger network would find subtler patterns.

Zero Error

The training went beautifully. After a hundred passes through the data, the network predicted every single one of the two thousand one hundred training patients correctly. Not approximately correctly — exactly. Every sick patient received a high score. Every healthy patient received a low score. The loss was effectively zero.

Trviksha: Perfect. Zero training error. The network has captured the pattern completely.

Drysska: I compute what the weights tell me to compute. The weights now produce correct outputs for every input I have seen.

Blortz: For every input you have seen. What about the ones you have not?

Trviksha: We will find out. Grothvik is sending a new batch of patients next week.

The Collapse

The new batch arrived: three hundred and twelve patients from two villages that had not contributed to the training data. Trviksha ran them through the network, expecting the same performance.

The results were terrible. Of the patients who eventually fell ill, the network caught barely half. Of the patients who stayed healthy, the network flagged nearly a third as high-risk. The accuracy on the new data was 61% — worse than the simple weighted-sum model from months earlier.

Grothvik: Your sophisticated twenty-velociraptor network performs worse than Blortz's original weighted sum on new patients? How is that possible?

Trviksha: I do not know. It is perfect on the training data. It is terrible on new data.

She checked for errors in the data encoding. None. She checked for mismatched factors between the villages. All twelve factors were consistent. The network was not broken. It simply did not work on patients it had never seen.

The Investigation

Trviksha examined the trained weights. What she found disturbed her.

Several of the twenty hidden velociraptors had developed highly specific response patterns. One velociraptor activated strongly for patients who were exactly fifty-three years old, lived in the third village, and ate a particular combination of foods. Another activated for patients who were exactly twenty-seven, lived near a specific well, and had household size of four.

These were not general patterns. They were individual patients — or clusters of two or three patients — encoded into specific velociraptors. The network had not learned that "older patients near water are at higher risk." It had learned that "patient 847 gets sick, patient 848 does not, patient 849 gets sick." It had memorized the training data.

Trviksha: It is not recognizing symptoms. It is recognizing patients. Each hidden velociraptor has become a specialist in a handful of specific people, not a detector of a general pattern.

Blortz: An experienced healer recognizes that certain symptoms indicate disease. Your network recognizes that certain faces indicate disease. The healer's knowledge transfers to new patients. The network's knowledge does not.

A network diagram where each hidden velociraptor has a thought bubble showing a specific patient's face, rather than a general symptom pattern. On the left, training patients are recognized perfectly. On the right, new patients pass through unrecognized. Trviksha stares at the gap between the two sides

The Diagnosis

Glagalbagal: Twenty velociraptors for two thousand patients. That is one velociraptor for every hundred patients. Is that too many?

Trviksha: I thought more velociraptors would find more patterns. Instead, they found more shortcuts. With enough hidden capacity, the network does not need to discover the general rule — it can simply memorize each case individually and still achieve zero training error.

Blortz: A student who memorizes every answer in the textbook will score perfectly on a test drawn from that textbook. Give them a different test, and they will fail — because they never learned the subject. They learned the answers.

The analogy was exact. The network had done the equivalent of memorizing the answer key rather than understanding the material. On the original test, it scored 100%. On a new test, it was lost.

Trviksha: I need to build a network that cannot simply memorize. One that is forced to learn the general pattern, because the specific answers are not available.