Part 16 of 58

The Right Size

By Madhav Kaushish · Ages 12+

Trviksha had tools against overfitting — the weight penalty and random sleeping. But a more fundamental question remained: how many hidden velociraptors should the network have?

Too Few

She built a network with three hidden velociraptors. Twelve inputs, three hidden, one output. With the weight penalty and dropout, she trained it on the sixteen hundred and eighty patients in the training set.

Training accuracy after a hundred passes: 72%. Test accuracy: 71%.

The training and test accuracy were close — almost no overfitting. But both were low. The network with three hidden velociraptors simply could not capture the complexity of the pattern. The sickness data involved interactions between multiple factors — age, water, diet, prior illness — and three hidden units did not provide enough intermediate features to represent those interactions.

Blortz: Three velociraptors is not enough to translate the data into a useful form. The hidden layer is too small to re-represent twelve inputs in a way that the output layer can use.

Trviksha: It is like trying to summarize a long story in three words. You can do it, but you lose most of the meaning.

Too Many

She built a network with fifty hidden velociraptors. Same inputs, same output, same training data.

Training accuracy: 96%. Test accuracy: 79%.

Even with the weight penalty and dropout, the network was overfitting. Fifty hidden velociraptors provided so much capacity that the regularization techniques could not fully prevent memorization. The gap between training and test had shrunk compared to the unregularized version — but it was still substantial.

Trviksha: Fifty is too many. The network has more capacity than it needs, and the excess capacity finds spurious patterns in the training data despite the penalties.

The Sweep

She ran a systematic experiment. Networks with 3, 5, 8, 12, 16, 20, 30, and 50 hidden velociraptors, all trained on the same data with the same weight penalty and dropout. For each, she recorded training accuracy and test accuracy.

The pattern was clear:

Hidden velociraptorsTraining accuracyTest accuracy
372%71%
578%77%
885%84%
1289%87%
1692%87%
2093%86%
3095%82%
5096%79%

Training accuracy climbed steadily as the network grew. Test accuracy climbed too — up to a point. It peaked around twelve to sixteen velociraptors, then declined. The gap between training and test accuracy widened with every additional velociraptor beyond the peak.

A stone graph with "number of velociraptors" along the bottom and "accuracy" up the side. Two lines of pebbles trace curves: a blue line (training accuracy) rises steadily from left to right. A green line (test accuracy) rises, peaks around twelve velociraptors, then curves downward. The gap between the lines widens on the right side. Trviksha points at the peak of the green line

Blortz: Too few: the network cannot capture the pattern. Both scores are low. Too many: the network memorizes the training data. The training score is high, the test score drops. Somewhere in between is the network that captures the real pattern without memorizing the noise.

Trviksha: Twelve. Maybe sixteen. That is the sweet spot for this data.

The Third Pile

But there was a problem. She had chosen twelve velociraptors by comparing test accuracies. That meant the test set was no longer a genuine surprise — she had used it to make a decision about the network. If she reported the test accuracy of the twelve-velociraptor network as its true performance, she would be slightly overestimating, because she had selected that network specifically because it scored well on the test set.

Blortz: You have contaminated the test set. It was supposed to be unseen data. But you looked at its results to choose the network size. The test set has become part of your decision process.

Trviksha needed a third group of data. She split the original data into three parts:

Training set (sixty percent): used to train each network. Validation set (twenty percent): used to compare networks and choose the best size. The model sees this data during evaluation but not during training. Test set (twenty percent): locked away, used only once, at the very end, to estimate true performance on unseen data.

She re-ran the experiment. Trained each network size on the training set. Evaluated each on the validation set. Chose the size with the best validation accuracy — twelve velociraptors. Then, and only then, she unlocked the test set and ran the chosen network on it. Test accuracy: 86%.

Trviksha: The validation set is for choosing. The test set is for measuring. If you use the test set for choosing, you have nothing left to measure with.

Grothvik: Three piles of data for one model. That seems like a lot of overhead.

Trviksha: It is the minimum overhead for an honest evaluation. Anything less, and you are fooling yourself about how well the model works.