The Sleeping Experts

Sparse attention had reduced the cost of processing long sequences. But there was another source of waste that Trviksha had not yet addressed: the network itself.

The Idle Army

The language model had grown large — hundreds of thousands of velociraptors across many layers. It knew about weather, agriculture, medicine, law, trade, history, and poetry. All of this knowledge was encoded in its weights — distributed across the entire network.

But for any given query, most of that knowledge was irrelevant.

Trviksha: When Zhrondvik asks a legal question, the velociraptors that encode weather patterns, agricultural vocabulary, and medical terminology all activate. They all perform their computations. They all contribute to the output — even though their contributions are negligible for a legal query.

Blortz: Every velociraptor works on every input. That is how the transformer operates — every feedforward layer processes every token.

Trviksha: For a small model, the waste is minor. For a large model — one with hundreds of thousands of velociraptors — activating every single one for every single token is enormously wasteful. It is like calling the entire army to inspect a single grain store.

The Expert Teams

Trviksha reorganised the feedforward layer. Instead of one large layer that processed every token, she divided the velociraptors into specialist teams — groups of velociraptors that each focused on a different domain.

Team A: Legal language — statutes, precedents, regulatory vocabulary. Team B: Agricultural language — crops, weather, field terminology. Team C: Medical language — symptoms, treatments, patient records. Team D: General language — common grammar and vocabulary.

For each token, instead of activating all teams, a small router network decided which two teams were most relevant and activated only those. The other teams stayed idle — asleep — contributing nothing and costing nothing.

Trviksha: For a legal query, the router activates Teams A and D. For a weather question, it activates Teams B and D. The general team (D) is almost always active, because grammar and common vocabulary are needed for everything. The specialist teams activate only when their expertise is relevant.

Drysska: Who decides which teams to activate?

Trviksha: The router. A small network that looks at each token and produces scores for each team. The two highest-scoring teams are activated. The rest sleep.

A large hall with four groups of cartoon velociraptors at workstations, each group labelled with a domain (Legal, Agricultural, Medical, General). For a legal query input, the Legal and General groups are awake and working, while the Agricultural and Medical groups are slumped asleep at their stations. A small router velociraptor at the entrance points incoming tokens toward the active groups

The Savings

The model had four hundred thousand velociraptors in the feedforward layer. Divided into eight specialist teams of fifty thousand each, with only two teams active per token, the computation per token dropped to one hundred thousand — a four-fold reduction.

But the total knowledge stored in the model remained the same — all eight teams still existed, still held their weights, still knew their domains. The model was not smaller in knowledge; it was cheaper to run for any given input.

Blortz: Same capacity. A quarter of the computation. Because three-quarters of the capacity is irrelevant for any given token.

Trviksha: The assumption is that different tokens need different expertise — and that a small router can figure out which expertise is needed. For language, this seems true. A legal token needs legal knowledge. A weather token needs weather knowledge. The router learns to match tokens to experts.

Zhrondvik: What if the router makes a wrong assignment? A legal question routed to the agricultural team?

Trviksha: The model performs worse on that token. But the router is trained end-to-end — routing errors increase the final prediction error, which adjusts the router's weights. Over time, the router learns to route accurately.

The key insight was that large models contained far more knowledge than any single input required. By selectively activating only the relevant portions, Trviksha could build models that were simultaneously very large (storing enormous amounts of knowledge) and very cheap to run (using only a small fraction of that knowledge per input).

Trviksha has built a Mixture of Experts (MoE) model — an architecture where a large network is divided into specialist sub-networks ("experts"), and a learned router selects which experts to activate for each input. This allows the model to have enormous total capacity — and thus store more knowledge — while keeping the computation per input low, because only a fraction of the experts are active at any time. MoE is used in models like Google's Switch Transformer, Mixtral, and GPT-4 (reportedly). The key innovation is the router: a small network that learns which expert is best suited for each input. The total parameter count is high (all experts combined), but the active parameter count per token is much lower. Think about a hospital with specialists: a cardiologist, a neurologist, a dermatologist. The hospital's total expertise is vast, but for any given patient, only one or two specialists are needed. The triage nurse (the router) directs patients to the right specialist. The hospital does not need every doctor to see every patient.