Reading the Stone

The memory stones worked. Monsoon signals from Day 1 persisted to Day 30, and the network's long-range predictions improved dramatically. But Vrothjelka noticed an anomaly.

The Irrelevant Memory

On certain sequences, the network's short-term predictions — one or two days ahead — had gotten worse after adding the stones. The network seemed distracted.

Vrothjelka: Sequence 4,718. Clear skies for ten days straight. Your old recurrent network predicted "clear" for Day 11 with high confidence. Your new stone network predicted "cloudy" with moderate confidence. Day 11 was clear.

Trviksha examined the stone state for that sequence. On Day 3, a brief pressure fluctuation had triggered the input gate, writing a "storm signal" onto the stone. The pressure had returned to normal by Day 4, and the remaining seven days were perfectly clear. But the forget gate had not erased the storm signal — it was still sitting on the stone at Day 10, contaminating the prediction.

Trviksha: The stone remembers the pressure fluctuation, and that memory is influencing the output even though it is no longer relevant to the current situation.

Blortz: The forget gate failed?

Trviksha: The forget gate decides whether information is worth keeping on the stone. In this case, the answer might be yes — pressure fluctuations are sometimes precursors to storms a week later. The stone is correct to store it. But the output should not use that information when making a short-term prediction about clear skies.

The problem was not what the stone stored, but what the stone revealed. The stone carried information that might be relevant later — but dumping all of it into every prediction was wrong. The network needed a way to selectively read from the stone.

The Output Gate

Trviksha added a third gate. The output gate examined the current input and hidden state and determined how much of the stone's contents should influence the current output.

Trviksha: The forget gate controls what to erase. The input gate controls what to write. The output gate controls what to read. Three operations on one stone.

Drysska: On Day 10, the stone holds the old pressure fluctuation. The output gate sees ten days of clear skies and decides: do not use the storm signal for this prediction. The stone keeps it — in case it matters later — but the output ignores it for now.

Trviksha: Precisely. Storing and reading are separate decisions. You can remember something without acting on it.

The three gates together gave each velociraptor fine-grained control over its long-term memory:

Forget gate: How much of the old memory to keep. (Trained by learning when information becomes obsolete.) Input gate: How much new information to store. (Trained by learning what is worth remembering.) Output gate: How much of the stored memory to reveal now. (Trained by learning what is currently relevant.)

Three small gatekeeping velociraptors around a stone tablet. The first (labelled "forget") holds an eraser. The second ("write") holds a chisel. The third ("read") holds a magnifying glass, deciding which parts of the stone to show to the main velociraptor. The main velociraptor looks at the stone through the magnifying glass, seeing only selected portions

A Simpler Design

Blortz, who had been counting the weights, was not entirely happy.

Blortz: Three gates, each with its own set of weights. Each gate is a small network in itself. For eight velociraptors with eight-dimensional hidden states and five weather inputs, the total weight count is... substantial. Roughly four times what the basic recurrent network needed.

Trviksha: Is there a simpler design?

She experimented. What if the forget gate and the input gate were not independent? If you forget more, you need to write more to replace it. If you forget less, you write less. The two decisions might be linked.

She built a variant where the input gate was simply one minus the forget gate. If the forget gate said "keep 70% of the old memory," then the input gate automatically said "write 30% new information." One gate instead of two.

She also merged the hidden state and the stone into a single object — the hidden state itself carried the memory, updated through the remaining gates.

The result was a simpler mechanism with two gates instead of three, fewer weights, and — she tested it on Vrothjelka's data — 97% of the original stone network's performance.

Blortz: Two-thirds the weights, almost the same accuracy.

Trviksha: It loses a small amount of flexibility — the forget and input decisions are no longer independent. But for most sequences, the tradeoff is worth it.

Glagalbagal: You built the complicated version, discovered it was too heavy, and found that a simpler version nearly matched it. This is a pattern.

Trviksha: It is. The question is always whether the extra complexity buys you something real or just extra pebbles. In this case, two gates were almost as good as three.

She offered Vrothjelka both options — the full three-gate system for maximum accuracy on complex long-range patterns, and the simpler two-gate version for faster training and lower cost. Vrothjelka, a practical woman who needed forecasts by morning, chose the simpler one.

Trviksha has completed the LSTM architecture by adding the output gate — the third gate that controls what stored information to reveal at each time step. The distinction between storing and reading is important: the cell state can hold information for many time steps, but the output gate ensures that only currently relevant information influences the prediction. The simpler variant Trviksha discovered is essentially the Gated Recurrent Unit (GRU), introduced by Cho et al. in 2014. GRUs combine the forget and input gates into a single update gate, and merge the cell state with the hidden state — resulting in fewer parameters and comparable performance on many tasks. In practice, the choice between LSTM and GRU often depends on the specific problem: LSTMs are slightly more powerful for very long sequences, while GRUs train faster and are easier to tune. Both solved the vanishing gradient problem that made basic RNNs forget their past. Think about your own memory: you store far more than you actively use at any moment. The ability to remember something without constantly being distracted by it — that separation of storage from retrieval — is what makes long-term memory practical.