Part 51 of 58
The Overworked Expert
By Madhav Kaushish · Ages 12+
The Mixture of Experts reduced computation dramatically. But after a week of deployment, Blortz noticed an operational problem.
The Imbalance
Blortz: I have been tracking which expert teams are activated per token. Teams A, D, and F handle ninety-one percent of all tokens. The other five teams handle nine percent combined. Three teams have not been activated once in the last two days.
Trviksha examined the router's behaviour. Early in training, by chance, Teams A, D, and F had performed slightly better on a few initial batches. The router, following its gradients, had begun sending more tokens to those teams. With more tokens, those teams received more training signal and improved further. With improvement, the router sent even more tokens their way. The cycle reinforced itself until three teams dominated and five atrophied.
Trviksha: The router collapsed. It found a local optimum — send everything to the three best teams — and stuck with it. The other teams never received enough tokens to improve, so they stayed poor, so the router never sent them tokens.
Blortz: The same exploration-exploitation problem as the pterodactyl. The router exploits the best known experts and never explores the others.
Drysska: And the three active teams are overworked. They process thirty times the tokens they were designed for. If they had been designed for this load, they would need to be much larger — which defeats the purpose of having multiple small teams.
The Bottleneck
The practical consequence was severe. The three active teams were processing so many tokens that they became a bottleneck. In a parallel system — where many tokens were processed simultaneously — all tokens waiting for Teams A, D, and F had to queue, while the idle teams had nothing to do.
Trviksha: Five teams sit idle. Three teams are overwhelmed. The total computation is the same, but it is concentrated in three places instead of spread across eight. The speed advantage of having multiple teams is lost.
The Balance Penalty
Trviksha added a penalty to the training objective: if the distribution of tokens across teams was too uneven, the model paid a cost. The penalty was proportional to how far the distribution deviated from uniform — ideally, each team should handle roughly the same fraction of tokens.
Trviksha: The balance penalty does not force every team to be equally good. It forces every team to receive enough tokens to have a chance to learn. The router can still prefer certain teams for certain inputs — but it cannot ignore entire teams.
She retrained with the balance penalty. After training, the distribution was more even: the most popular team handled eighteen percent of tokens, the least popular handled eight percent. Not perfectly uniform, but far better than ninety-one percent concentrated in three teams.

Blortz: The balance penalty is a constraint, not an objective. The objective is still good predictions. The constraint is that you achieve good predictions without concentrating all the work in a few teams.
Trviksha: Exactly. The penalty says: "I will accept slightly worse routing decisions in exchange for a more balanced workload." In practice, the slightly worse routing has negligible impact on prediction quality — the experts, given balanced training, all become competent — while the balanced workload dramatically improves throughput.
The General Lesson
Glagalbagal: You keep discovering the same problem in new disguises. The pterodactyl exploited one route. The language model exploited agreeableness. The router exploits three teams. Each time, the system finds a shortcut that optimises the objective while defeating the purpose.
Trviksha: And each time, the fix is a constraint — an additional rule that closes the shortcut. Exploration for the pterodactyl. The KL penalty for the language model. The balance penalty for the router. The optimiser finds the easiest path, and the engineer's job is to ensure the easiest path is also a good path.
Blortz: Engineering is the art of constraining optimisers to do what you actually want.