Attention Is All You Need — Deep Dive into the Transformer Architecture

A thorough reading of the Transformer paper, covering Self-Attention, positional encoding, and multi-head attention

📄 Background

Before Transformers, sequence-to-sequence tasks relied heavily on RNN/LSTM architectures, which suffered from two major bottlenecks:

  1. Sequential computation prevents parallelization
  2. Long-range dependency degradation over long sequences

Vaswani et al. proposed the Transformer at NeurIPS 2017, relying entirely on attention mechanisms to model global dependencies — no recurrence, no convolution.

🔑 Core Mechanisms

Self-Attention

For input $X \in \mathbb{R}^{n \times d}$, we compute Query, Key, Value projections:

$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

Multi-Head Attention

$$\text{MHA}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$

Positional Encoding

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

💡 Key Takeaways

  • Parallelization is the core practical advantage
  • $O(n^2)$ attention complexity sparked follow-up efficient-attention research
  • Positional encoding design remains an active research area (RoPE, ALiBi, etc.)

📚 References