diff --git a/transformers.qmd b/transformers.qmd index 31c7360..7ff36de 100644 --- a/transformers.qmd +++ b/transformers.qmd @@ -80,7 +80,7 @@ $$ The attention mechanism calculates similarity using the dot product, which efficiently measures vector similarity: $$ -a_i = \text{softmax}\left(\frac{[q_i^T k_1, q_i^T k_2, \dots, q_i^T k_n]}{\sqrt{d_k}}\right)^T \in \mathbb{R}^{1\times n} +a_i = \text{softmax}\left(\frac{[q_i^T k_1, q_i^T k_2, \dots, q_i^T k_n]}{\sqrt{d_k}}\right) \in \mathbb{R}^{1\times n} $$ The vector $a_i$ (softmax'd attention scores) quantifies how much attention token $q_i$ should pay to each token in the sequence, normalized so that elements sum to 1. Normalizing by $\sqrt{d_k}$ prevents large dot-product magnitudes, stabilizing training.