Attention Is All You Need — Deep Dive into the Transformer Architecture

Fri, 06 Mar 2026 20:00:00 +0800

📄 Background

Before Transformers, sequence-to-sequence tasks relied heavily on RNN/LSTM architectures, which suffered from two major bottlenecks:

Sequential computation prevents parallelization
Long-range dependency degradation over long sequences

Vaswani et al. proposed the Transformer at NeurIPS 2017, relying entirely on attention mechanisms to model global dependencies — no recurrence, no convolution.

🔑 Core Mechanisms

Self-Attention

For input $X \in \mathbb{R}^{n \times d}$, we compute Query, Key, Value projections:

$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

NLP on Yang

Attention Is All You Need — Deep Dive into the Transformer Architecture

📄 Background

🔑 Core Mechanisms

Self-Attention