390introml · shensquared · Jun 5, 2026 · Jun 5, 2026
diff --git a/transformers.qmd b/transformers.qmd
@@ -80,7 +80,7 @@ $$
 The attention mechanism calculates similarity using the dot product, which efficiently measures vector similarity:
 
 $$
-a_i = \text{softmax}\left(\frac{[q_i^T k_1, q_i^T k_2, \dots, q_i^T k_n]}{\sqrt{d_k}}\right)^T \in \mathbb{R}^{1\times n}
+a_i = \text{softmax}\left(\frac{[q_i^T k_1, q_i^T k_2, \dots, q_i^T k_n]}{\sqrt{d_k}}\right) \in \mathbb{R}^{1\times n}
 $$
 
 The vector $a_i$ (softmax'd attention scores) quantifies how much attention token $q_i$ should pay to each token in the sequence, normalized so that elements sum to 1. Normalizing by $\sqrt{d_k}$ prevents large dot-product magnitudes, stabilizing training.