đ Background
Before Transformers, sequence-to-sequence tasks relied heavily on RNN/LSTM architectures, which suffered from two major bottlenecks:
- Sequential computation prevents parallelization
- Long-range dependency degradation over long sequences
Vaswani et al. proposed the Transformer at NeurIPS 2017, relying entirely on attention mechanisms to model global dependencies â no recurrence, no convolution.
đ Core Mechanisms
Self-Attention
For input $X \in \mathbb{R}^{n \times d}$, we compute Query, Key, Value projections:
$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$
Multi-Head Attention
$$\text{MHA}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$
Positional Encoding
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$
đĄ Key Takeaways
- Parallelization is the core practical advantage
- $O(n^2)$ attention complexity sparked follow-up efficient-attention research
- Positional encoding design remains an active research area (RoPE, ALiBi, etc.)