The Overworked Expert

The Mixture of Experts reduced computation dramatically. But after a week of deployment, Blortz noticed an operational problem.

The Imbalance

Blortz: I have been tracking which expert teams are activated per token. Teams A, D, and F handle ninety-one percent of all tokens. The other five teams handle nine percent combined. Three teams have not been activated once in the last two days.

Trviksha examined the router's behaviour. Early in training, by chance, Teams A, D, and F had performed slightly better on a few initial batches. The router, following its gradients, had begun sending more tokens to those teams. With more tokens, those teams received more training signal and improved further. With improvement, the router sent even more tokens their way. The cycle reinforced itself until three teams dominated and five atrophied.

Trviksha: The router collapsed. It found a local optimum — send everything to the three best teams — and stuck with it. The other teams never received enough tokens to improve, so they stayed poor, so the router never sent them tokens.

Blortz: The same exploration-exploitation problem as the pterodactyl. The router exploits the best known experts and never explores the others.

Drysska: And the three active teams are overworked. They process thirty times the tokens they were designed for. If they had been designed for this load, they would need to be much larger — which defeats the purpose of having multiple small teams.

The Bottleneck

The practical consequence was severe. The three active teams were processing so many tokens that they became a bottleneck. In a parallel system — where many tokens were processed simultaneously — all tokens waiting for Teams A, D, and F had to queue, while the idle teams had nothing to do.

Trviksha: Five teams sit idle. Three teams are overwhelmed. The total computation is the same, but it is concentrated in three places instead of spread across eight. The speed advantage of having multiple teams is lost.

The Balance Penalty

Trviksha added a penalty to the training objective: if the distribution of tokens across teams was too uneven, the model paid a cost. The penalty was proportional to how far the distribution deviated from uniform — ideally, each team should handle roughly the same fraction of tokens.

Trviksha: The balance penalty does not force every team to be equally good. It forces every team to receive enough tokens to have a chance to learn. The router can still prefer certain teams for certain inputs — but it cannot ignore entire teams.

She retrained with the balance penalty. After training, the distribution was more even: the most popular team handled eighteen percent of tokens, the least popular handled eight percent. Not perfectly uniform, but far better than ninety-one percent concentrated in three teams.

Eight groups of cartoon velociraptor workstations. In the top image (before fix), three groups are crowded with long queues of tokens while five groups are empty and idle. In the bottom image (after fix), all eight groups have roughly equal queues and all are working. A small router velociraptor in the centre distributes tokens more evenly

Blortz: The balance penalty is a constraint, not an objective. The objective is still good predictions. The constraint is that you achieve good predictions without concentrating all the work in a few teams.

Trviksha: Exactly. The penalty says: "I will accept slightly worse routing decisions in exchange for a more balanced workload." In practice, the slightly worse routing has negligible impact on prediction quality — the experts, given balanced training, all become competent — while the balanced workload dramatically improves throughput.

The General Lesson

Glagalbagal: You keep discovering the same problem in new disguises. The pterodactyl exploited one route. The language model exploited agreeableness. The router exploits three teams. Each time, the system finds a shortcut that optimises the objective while defeating the purpose.

Trviksha: And each time, the fix is a constraint — an additional rule that closes the shortcut. Exploration for the pterodactyl. The KL penalty for the language model. The balance penalty for the router. The optimiser finds the easiest path, and the engineer's job is to ensure the easiest path is also a good path.

Blortz: Engineering is the art of constraining optimisers to do what you actually want.

Trviksha has encountered router collapse — a well-known failure mode in Mixture of Experts models. When the router is trained end-to-end, it tends to concentrate tokens on a few experts that perform well early in training, starving the others. This creates a self-reinforcing cycle: popular experts get more training and improve further, while unpopular experts stagnate. The solution is load balancing: adding a penalty that encourages roughly equal distribution of tokens across experts. This is a standard technique in modern MoE models, including the Switch Transformer and GShard. The broader pattern — optimisers finding shortcuts that defeat the system's purpose — connects reward hacking, sycophancy, and router collapse into a single theme: alignment between what the system optimises for and what the designer intends is never automatic. It must be engineered through constraints, penalties, and careful design. Where else in life do you see systems concentrating resources on a few popular options while promising alternatives are starved of attention?