DSCI420Notes/12-ClassificationWithSupportVectorMachines.Rmd at main · TheElementsMath/DSCI420Notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
# Classification with Support Vector Machines (SVMs)

In many machine learning problems, we aim to predict **discrete outcomes** rather than continuous values.  For example:

- An email classifier distinguishes between *personal mail* and *junk mail*.
- A telescope identifies whether an object is a *galaxy*, *star*, or *planet*.

This type of task is known as **classification**, and when there are only two possible outcomes, it is called **binary classification**.


In binary classification, the goal is to predict labels \( y_n \in \{+1, -1\} \) for examples \( \mathbf{x}_n \in \mathbb{R}^D \).  The two labels are often referred to as the **positive** and **negative** classes, although these names do not imply any moral or semantic “positiveness.”


<div class="example">
In a cancer detection problem, \( y = +1 \) might indicate the presence of cancer in a patient while \( y = -1 \) might indicate no cancer.
</div>

Formally, the predictor is a function:
\[
f: \mathbb{R}^D \to \{+1, -1\}.
\]

The task is to find model parameters that minimize classification error on a labeled dataset:
\[
\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\}.
\]


Similar to regression (Chapter 9), SVMs assume a linear model in some transformed feature space:
\[
f(\mathbf{x}) = \text{sign}(\langle \mathbf{w}, \phi(\mathbf{x}) \rangle + b),
\]
where \( \phi(\mathbf{x}) \) is a possibly nonlinear transformation of the input. This approach allows SVMs to represent nonlinear decision boundaries through **kernel functions**.


SVMs are a widely used and theoretically grounded method for binary classification. They are particularly valuable because:

1. They provide a*geometric interpretation of classification, based on concepts such as inner products, projection*, and margins.
2. The optimization problem in SVMs does not have a closed-form solution, requiring tools from convex optimization (see Chapter 7).
3. They achieve strong theoretical guarantees and excellent empirical performance (Steinwart & Christmann, 2008).

Unlike probabilistic approaches such as maximum likelihood estimation (Chapter 9), SVMs are derived from geometric principle* and empirical risk minimization (Section 8.2).

---

## Separating Hyperplanes

The key idea of SVMs is to separate data points of different classes using a **hyperplane** in \( \mathbb{R}^D \).  A hyperplane divides the space into two regions, each corresponding to one of the two classes.


<div class="definition">
A hyperplane in \( \mathbb{R}^D \) is defined as:
\[
\{ \mathbf{x} \in \mathbb{R}^D : f(\mathbf{x}) = 0 \},
\]
where:
\[
f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle + b.
\]
Here:

- \( \mathbf{w} \in \mathbb{R}^D \) is the **normal vector** to the hyperplane, and
- \( b \in \mathbb{R} \) is the **bias (intercept)** term.
</div>

Any vector \( \mathbf{w} \) orthogonal to the hyperplane satisfies:
\[
\langle \mathbf{w}, \mathbf{x}_a - \mathbf{x}_b \rangle = 0,
\]
for all points \( \mathbf{x}_a, \mathbf{x}_b \) lying on the hyperplane.

---

### Classification Rule

A new example \( \mathbf{x}_{\text{test}} \) is classified based on the sign of \( f(\mathbf{x}_{\text{test}}) \):
\[
f(\mathbf{x}_{\text{test}}) =
\begin{cases}
+1, & \text{if } f(\mathbf{x}_{\text{test}}) \ge 0, \\
-1, & \text{if } f(\mathbf{x}_{\text{test}}) < 0.
\end{cases}
\]

Geometrically:

- Points with \( f(\mathbf{x}) > 0 \) lie on the **positive side** of the hyperplane.
- Points with \( f(\mathbf{x}) < 0 \) lie on the **negative side**.

---

### Training Objective

During training, we want:

- Positive examples (\( y_n = +1 \)) to be on the positive side:
  \[
  \langle \mathbf{w}, \mathbf{x}_n \rangle + b \ge 0,
  \]
- Negative examples (\( y_n = -1 \)) to be on the negative side:
  \[
  \langle \mathbf{w}, \mathbf{x}_n \rangle + b < 0.
  \]
Both conditions can be combined compactly as:
\[
y_n (\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \ge 0.
\]
This equation expresses that all examples are correctly classified relative to the hyperplane defined by \( (\mathbf{w}, b) \).


---

### Geometric Interpretation

- The vector \( \mathbf{w} \) determines the **orientation** of the hyperplane.
- The scalar \( b \) shifts the hyperplane along \( \mathbf{w} \).
- Classification is performed by checking on which side of the hyperplane each example lies.
- The SVM’s goal is to find the hyperplane that maximizes the **margin** — the distance between the hyperplane and the nearest data points from each class.


### Exercises {.unnumbered .unlisted}


Put some exercises here.


## Primal Support Vector Machine

Support Vector Machines (SVMs) are based on the geometric concept of finding a separating hyperplane that best divides positive and negative examples.  When data is linearly separable, there are infinitely many such hyperplanes.  To obtain a unique and robust solution, SVMs choose the **hyperplane that maximizes the margin**—the distance between the hyperplane and the nearest data points.

A **large-margin classifier** tends to generalize well to unseen data (Steinwart & Christmann, 2008).


<div class="definition">
The **margin** is the distance between the separating hyperplane and the closest training examples.
</div>


<div class="example">
For a hyperplane defined as:
\[
f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle + b = 0,
\]
and an example \( \mathbf{x}_a \), the orthogonal distance \( r \) to the hyperplane is obtained by projecting \( \mathbf{x}_a \) onto it:
\[
\mathbf{x}_a = \mathbf{x}_a' + r \frac{\mathbf{w}}{\|\mathbf{w}\|},
\]
where \( \mathbf{x}_a' \) lies on the hyperplane and \( \|\mathbf{w}\| \) is the Euclidean norm of \( \mathbf{w} \).
</div>


To ensure that all examples lie at least distance \( r \) from the hyperplane:
\[
y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \ge r.
\]
Assuming \( \|\mathbf{w}\| = 1 \) (unit length normalization), the optimization problem becomes:
\[
\max_{\mathbf{w},b,r} \quad r,
\]
subject to
\[
y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \ge r, \quad \|\mathbf{w}\| = 1, \quad r > 0.
\]

This formulation seeks to **maximize the margin** \( r \) while ensuring all data points are correctly classified.


---

### Traditional Derivation of the Margin

An equivalent approach defines the scale of the data such that the **closest points** satisfy:
\[
\langle \mathbf{w}, \mathbf{x}_a \rangle + b = 1.
\]
By projecting onto the hyperplane, we find the margin:
\[
r = \frac{1}{\|\mathbf{w}\|}.
\]
Thus, maximizing the margin is equivalent to **minimizing** the norm of \( \mathbf{w} \).  The optimization problem becomes:
\[
\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2,
\]
subject to
\[
y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \ge 1, \quad \forall n.
\]

This is known as the **Hard Margin SVM**, which assumes perfect separability and allows no violations of the margin condition.

---

###  Why We Can Set the Margin to 1?

Theorem 12.1 below shows that the two formulations—maximizing \( r \) under \( \|\mathbf{w}\|=1 \), and minimizing \( \|\mathbf{w}\|^2 \) under a unit margin—are equivalent.


<div class="theorem">
Maximizing the margin \( r \), where we consider normalized weights:
\[
\begin{aligned}
\max_{w, b, r} \quad & r \quad && \text{(margin)} \\
\text{subject to} \quad & y_n(\langle w, x_n \rangle + b) \ge r \quad && \text{(data fitting)} \\
& \|w\| = 1 \quad && \text{(normalization)} \\
& r > 0
\end{aligned}
\]
is equivalent to scaling the data such that the margin is unity:
\[
\begin{aligned}
\min_{w, b} \quad & \frac{1}{2} \|w\|^2 \quad && \text{(margin)} \\
\text{subject to} \quad & y_n(\langle w, x_n \rangle + b) \ge 1 \quad && \text{(data fitting)}
\end{aligned}
\]
</div>

Hence, setting the margin to 1 simplifies computation without changing the solution.  This equivalence arises from a rescaling of parameters:
\[
r = \frac{1}{\|\mathbf{w}\|}.
\]

---

### Soft Margin SVM: Geometric View

When data is not linearly separable, we relax the hard constraints by introducing slack variables \( \xi_n \ge 0 \), allowing points to lie within or beyond the margin.

The resulting **Soft Margin SVM** optimization problem is:
\[
\min_{\mathbf{w},b,\xi} \quad \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{n=1}^N \xi_n,
\]
subject to
\[
y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \ge 1 - \xi_n, \quad \xi_n \ge 0.
\]

Here:

- \( C > 0 \) is the regularization parameter controlling the trade-off between a large margin and small classification error.
- The term \( \|\mathbf{w}\|^2 \) acts as the regularizer, encouraging simpler models.
- Larger \( C \) penalizes margin violations more strongly (less regularization).

---

### Soft Margin SVM: Loss Function View

Using the empirical risk minimization framework, we can rewrite the SVM objective as minimizing a regularized loss:
\[
\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{n=1}^N \ell(y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b)),
\]
where \( \ell(t) \) is the **hinge loss**:
\[
\ell(t) = \max(0, 1 - t).
\]
This loss penalizes points:

- **Outside the margin** (\( t \ge 1 \)): no penalty (\( \ell = 0 \)),
- **Inside the margin** (\( 0 < t < 1 \)): linear penalty,
- **Misclassified** (\( t < 0 \)): large penalty.

Hence, the SVM objective can be written as an unconstrained optimization:
\[
\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{n=1}^N \max(0, 1 - y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b)).
\]
The first term acts as a regularizer (encouraging a large margin), while the second term measures training error.


### Exercises {.unnumbered .unlisted}


Put some exercises here.


##  Dual Support Vector Machine

The **Dual Support Vector Machine (SVM)** is an alternative formulation of the SVM optimization problem.  While the **primal SVM** works with variables \( \mathbf{w} \) and \( b \) (whose size depends on the number of features \( D \)), the dual SVM expresses the problem in terms of Lagrange multipliers, where the number of parameters depends instead on the number of training examples.  This formulation is especially useful when:

- The number of features is much larger than the number of examples.
- Kernel functions are used, as they can be incorporated easily in the dual form.
- We leverage **convex duality**, discussed earlier in the text.

---

###  Convex Duality via Lagrange Multipliers

The **dual formulation** is derived from the primal soft-margin SVM by introducing Lagrange multipliers:

- \( \alpha_n \ge 0 \) for the classification constraints.
- \( \gamma_n \ge 0 \) for the non-negativity of the slack variables \( \xi_n \).

The Lagrangian is defined as:
\[
L(\mathbf{w}, b, \xi, \alpha, \gamma)
= \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{n=1}^N \xi_n
- \sum_{n=1}^N \alpha_n[y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) - 1 + \xi_n]
- \sum_{n=1}^N \gamma_n \xi_n.
\]
Setting derivatives to zero gives:
\[
\frac{\partial L}{\partial \mathbf{w}} = 0 \Rightarrow
\mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n.
\]

This equation illustrates the **representer theorem**, showing that \( \mathbf{w} \) is a linear combination of the training example.  Only points with \( \alpha_n > 0 \) contribute to \( \mathbf{w} \); these are the **support vectors**.

---

### Dual Optimization Problem

Substituting \( \mathbf{w} \) back into the Lagrangian yields the dual problem:
\[
\begin{aligned}
\min_{\alpha} \quad &
\frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N y_i y_j \alpha_i \alpha_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle
- \sum_{i=1}^N \alpha_i \\
\text{subject to} \quad &
\sum_{i=1}^N y_i \alpha_i = 0, \\
& 0 \le \alpha_i \le C, \; \forall i = 1, \ldots, N.
\end{aligned}
\]
These box constraints restrict each \( \alpha_i \) between 0 and \( C \).  Support vectors with \( 0 < \alpha_i < C \) lie exactly on the margin boundary.

The primal parameters can be recovered as:
\[
\mathbf{w}^* = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n,
\quad
b^* = y_n - \langle \mathbf{w}^*, \mathbf{x}_n \rangle
\]
(for any example on the margin).

---

### Dual SVM: Convex Hull View

An alternative geometric view of the dual SVM comes from **convex hulls**:

- Each class (positive and negative) forms a convex hull from its examples.
- The SVM finds the closest points \( c \) and \( d \) between these two convex hulls.
- The difference vector \( \mathbf{w} = \mathbf{c} - \mathbf{d} \) defines the **maximum-margin hyperplane**.

Mathematically:
\[
\text{conv}(X) = \left\{ \sum_{n=1}^N \alpha_n \mathbf{x}_n \; \Big| \; \sum_{n=1}^N \alpha_n = 1, \; \alpha_n \ge 0 \right\}.
\]
The optimization problem can then be expressed as minimizing the distance between the convex hulls:
\[
\min_{\alpha} \frac{1}{2}
\left\|
\sum_{n: y_n = +1} \alpha_n \mathbf{x}_n
- \sum_{n: y_n = -1} \alpha_n \mathbf{x}_n
\right\|^2,
\]
subject to:
\[
\sum_{n=1}^N y_n \alpha_n = 0, \quad \alpha_n \ge 0.
\]
For the soft-margin case, a reduced convex hull is used by limiting \( \alpha_n \le C \),  shrinking the hull and allowing for misclassified examples.


### Exercises {.unnumbered .unlisted}


Put some exercises here.


## Kernels

The dual formulation of the SVM depends only on inner products between training example \( \mathbf{x}_i \) and \( \mathbf{x}_j \).  This modularity allows us to replace the inner product with a more general similarity function, giving rise to the concept of **kernels**.

---

### Feature Representations and Kernels

Let each input \( \mathbf{x} \) be represented in some (possibly nonlinear) feature space by a mapping:
\[
\phi(\mathbf{x}) : \mathcal{X} \rightarrow \mathcal{H},
\]
where \( \mathcal{H} \) is a Hilbert space.  In this case, the dual SVM objective involves terms of the form:
\[
\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}.
\]


Rather than explicitly computing \( \phi(\mathbf{x}) \), we define a kernel function:
\[
k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}.
\]
This allows the SVM to implicitly operate in a high-dimensional (or even infinite-dimensional) space, without directly computing the transformation.  This substitution is known as the **kernel trick** — it hides the explicit feature mapping while preserving the geometric relationships defined by the inner product.

---

### The Kernel Function and RKHS

- A **kernel** is a function \( k : \mathcal{X} \times \mathcal{X} \to \mathbb{R} \) that defines inner products in a **reproducing kernel Hilbert space (RKHS)**.
- Every valid kernel has a **unique RKHS** (Aronszajn, 1950).
- The canonical feature map is \( \phi(\mathbf{x}) = k(\cdot, \mathbf{x}) \).

The kernel must satisfy:
\[
\forall z \in \mathbb{R}^N : z^\top K z \ge 0,
\]
where \( K \) is the **Gram matrix** (or **kernel matrix**) defined by \( K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) \).This ensures that \( K \) is symmetric and positive semidefinite.

---

### Common Kernels

Some widely used kernels for data \( \mathbf{x}_i \in \mathbb{R}^D \) include:

- **Polynomial kernel:**
  \[
  k(\mathbf{x}_i, \mathbf{x}_j) = (\langle \mathbf{x}_i, \mathbf{x}_j \rangle + c)^d.
  \]
- **Gaussian radial basis function (RBF) kernel:**
  \[
  k(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{2\sigma^2}\right).
  \]
- **Rational quadratic kernel:**
  \[
  k(\mathbf{x}_i, \mathbf{x}_j) = 1 - \frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{\|\mathbf{x}_i - \mathbf{x}_j\|^2 + c}.
  \]

These kernels allow nonlinear decision boundaries while the SVM still computes a linear separator in feature space.

---

### Practical Aspects

The kernel trick offers computational efficiency — the kernel can be computed directly without constructing \( \phi(\mathbf{x}) \).  For example,

-  The polynomial kernel avoids explicit polynomial expansion.
-  The Gaussian RBF kernel corresponds to an infinite-dimensional feature space.

The choice of kernel and its parameters (e.g., degree \( d \), variance \( \sigma \)) are typically selected via nested cross-validation.  Kernels are not restricted to vector data.  They can be defined over *strings*, *graphs*, *sets*, *distributions*, and other structured objects.

---

### Terminology Note

The term **"kernel"** can have different meanings:

1. **In this context:** kernel functions defining an RKHS (machine learning).
2. **In linear algebra:** the null space of a matrix.
3. **In statistics:** smoothing kernels in **kernel density estimation**.


### Exercises {.unnumbered .unlisted}


Put some exercises here.