From 2be8f3b3c1599813f64745ad9773c25cd3e08ad5 Mon Sep 17 00:00:00 2001 From: Tazik Shahjahan <35576188+taziksh@users.noreply.github.com> Date: Thu, 4 Jun 2026 22:36:09 -0700 Subject: [PATCH] Fix transpose in attention scores equation 9.3.2: Corrected the transpose notation in the attention scores equation --- transformers.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transformers.qmd b/transformers.qmd index 31c7360..7ff36de 100644 --- a/transformers.qmd +++ b/transformers.qmd @@ -80,7 +80,7 @@ $$ The attention mechanism calculates similarity using the dot product, which efficiently measures vector similarity: $$ -a_i = \text{softmax}\left(\frac{[q_i^T k_1, q_i^T k_2, \dots, q_i^T k_n]}{\sqrt{d_k}}\right)^T \in \mathbb{R}^{1\times n} +a_i = \text{softmax}\left(\frac{[q_i^T k_1, q_i^T k_2, \dots, q_i^T k_n]}{\sqrt{d_k}}\right) \in \mathbb{R}^{1\times n} $$ The vector $a_i$ (softmax'd attention scores) quantifies how much attention token $q_i$ should pay to each token in the sequence, normalized so that elements sum to 1. Normalizing by $\sqrt{d_k}$ prevents large dot-product magnitudes, stabilizing training.