Despite remarkable success, attention mechanisms struggle with extended reasoning over time. Watch any AI agent work through a complex problem and you'll likely witness these failure modes:
- Hallucinations that compound over extended interactions
- Inconsistent or erroneous reasoning within a sequence
- An inability to remember long-term goals or strategies
The evidence is clear. Attention can perform impressive computations in parallel, but it does not generalize over time.
The return of recurrence
From this perspective, the recent resurgence of interest in recurrent architectures makes perfect sense. Recurrent networks are designed to update a latent space over time, and thus may solve many of the aforementioned failure modes. By contrast, "chain of thought" is a hack to introduce sequential reasoning: a tacit admission that multi-step thought has irreplaceable advantages over purely parallel attention.
Next steps
If we adopt this perspective, how should we build better models?
- Embrace recurrence: Stop focusing on attention masks and context length. Focus on effectively updating a latent space.
- Scale across time: Train models on tasks that require performance over time. A small model that can maintain coherence for millions of steps may be more useful than a large model that can't.
Exemplary model architectures
Conclusion
General intelligence may not require revolutionary new architectures or larger datasets. It may simply require recognizing time as a fundamental dimension of intelligence.