From 2be8f3b3c1599813f64745ad9773c25cd3e08ad5 Mon Sep 17 00:00:00 2001
From: Tazik Shahjahan <35576188+taziksh@users.noreply.github.com>
Date: Thu, 4 Jun 2026 22:36:09 -0700
Subject: [PATCH] Fix transpose in attention scores equation

9.3.2: Corrected the transpose notation in the attention scores equation
---
 transformers.qmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/transformers.qmd b/transformers.qmd
index 31c7360..7ff36de 100644
--- a/transformers.qmd
+++ b/transformers.qmd
@@ -80,7 +80,7 @@ $$
 The attention mechanism calculates similarity using the dot product, which efficiently measures vector similarity:
 
 $$
-a_i = \text{softmax}\left(\frac{[q_i^T k_1, q_i^T k_2, \dots, q_i^T k_n]}{\sqrt{d_k}}\right)^T \in \mathbb{R}^{1\times n}
+a_i = \text{softmax}\left(\frac{[q_i^T k_1, q_i^T k_2, \dots, q_i^T k_n]}{\sqrt{d_k}}\right) \in \mathbb{R}^{1\times n}
 $$
 
 The vector $a_i$ (softmax'd attention scores) quantifies how much attention token $q_i$ should pay to each token in the sequence, normalized so that elements sum to 1. Normalizing by $\sqrt{d_k}$ prevents large dot-product magnitudes, stabilizing training.