parfloat.github.io/index.html at main · parfloat/parfloat.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta name="description" content="CS6969: Fast and Correct GPU Code studies the correctness, semantics, scheduling, and performance principles behind modern GPU kernels for ML and HPC.">
  <title>CS6969: Fast and Correct GPU Code</title>
  <link rel="stylesheet" href="css/style.css">
</head>
<body>
  <header class="hero">
    <div class="hero-inner">
      <p class="eyebrow" style="color: #ffffff;">University of Utah Special Topics</p>
      <h1 style="color: #f3d34a;">CS6969: Fast and Correct GPU Code</h1>
      <p class="lede">
        A project-centered course on building, understanding, testing, and improving
        GPU primitives for modern ML and HPC systems.
      </p>
    </div>
  </header>

  <nav class="topnav">
    <a href="#overview">Overview</a>
    <a href="#timetable">Timetable</a>
    <a href="#syllabus">Syllabus Snapshot</a>
    <a href="#projects">Projects</a>
    <a href="#resources">Resources</a>
    <a href="links.html" class="nav-accent" style="color: #ff8fc8;">Full Links</a>
    <a href="parfloat-archived/index.html">The Old ParFloat Class</a>
  </nav>

  <main>
    <section id="overview" class="panel">
      <div class="copy">
        <h2>Course Overview</h2>
        <p>
          Modern ML and HPC systems depend on carefully engineered computational
          primitives, including GPU kernels and numerical library functions, to
          achieve both high performance and trustworthy behavior. Even heavily
          tested kernels can hide subtle functional bugs, unstable numeric behavior,
          and performance defects that quietly leave large amounts of hardware
          capability unused.
        </p>
        <p>
          This course studies the numeric, semantic, and scheduling abstractions
          needed to build GPU code that is both correct and fast. We focus on how
          data representations, execution order, memory behavior, parallel
          synchronization, and performance models interact in real kernels. We
          also examine tools and methods such as data-flow languages, MLIR-style
          compiler infrastructures, verification techniques, and measurement-driven
          performance analysis that can make future primitives more systematic to
          design.
        </p>
        <p>Course highlights:</p>
        <ul class="highlights-list">
          <li>AWS Neuron hands-on.</li>
          <li>MLIR-AIR deep dive with respect to MLIR transformations.</li>
          <li>Detailed look at modern tile-based languages.</li>
        </ul>
        <p>
          The class is co-taught with <strong>Professor Sreepathi Pai of the University of Rochester</strong>.
          It is explicitly project-centered: student-designed primitives will be
          tested in realistic ML and HPC settings, and the course is intended to
          support paper writing and public artifact release when the work matures
          enough to justify it.
        </p>
      </div>
      <figure class="media">
        <img src="main-img.webp" alt="Illustration of parallel GPU arithmetic">
        <figcaption>GPU correctness and performance are treated together, not as separate concerns.</figcaption>
      </figure>
    </section>

    <section id="timetable" class="panel alt">
      <h2>Timetable</h2>
      <p>
        This table is a structured version of the timetable embedded in the
        shared syllabus document. It preserves the semester flow while keeping
        the public website readable.
      </p>
      <div class="table-wrap">
        <table class="schedule-table">
          <thead>
            <tr>
              <th>Date</th>
              <th>Lead</th>
              <th>Topics</th>
              <th>Readings / Slides</th>
              <th>Assignments / Notes</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>Mon 1/5</td>
              <td>Both</td>
              <td>
                <ul class="cell-list">
                  <li>Course organization</li>
                  <li>Semester goals and project framing</li>
                </ul>
              </td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://docs.google.com/presentation/d/1-KQhZabAgIDfrjPUgkTy4F6k2BqGpFiPdeS4oC_BKsw/edit?usp=sharing">Ganesh intro slides</a></li>
                  <li><a href="https://cs.rochester.edu/~sree/courses/cs6969-spring-2026/sree-intro.pdf">Sree intro slides</a></li>
                </ul>
              </td>
              <td>Semester launch</td>
            </tr>
            <tr>
              <td>Wed 1/7</td>
              <td>Both</td>
              <td>
                <ul class="cell-list">
                  <li>Number systems and tools</li>
                  <li>Intro to performance fundamentals</li>
                </ul>
              </td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://docs.google.com/presentation/d/1wLwiYiBSj3b4UrlhirfCdA2O1dPMqRIJh1LMdygTYuo/edit?usp=sharing">Ganesh slides</a></li>
                  <li><a href="https://cs.rochester.edu/~sree/courses/cs6969-spring-2026/sree-perf-model.pdf">Sree performance slides</a></li>
                </ul>
              </td>
              <td>Asg1 released, due 1/14. See the <a href="https://www.overleaf.com/read/pcfjvhpzghpt#9ec326">assignment Overleaf</a>.</td>
            </tr>
            <tr>
              <td>Mon 1/12</td>
              <td>Sree</td>
              <td>Intro to GPU performance</td>
              <td><a href="https://cs.rochester.edu/~sree/courses/cs6969-spring-2026/sree-gpu-performance.pdf">GPU performance lecture material</a></td>
              <td>Continue Asg1</td>
            </tr>
            <tr>
              <td>Wed 1/14</td>
              <td>Both + student presenters</td>
              <td>
                <ul class="cell-list">
                  <li>Formal model of GPU execution</li>
                  <li>Throughput models</li>
                  <li>Race effects and GKLEE demo</li>
                </ul>
              </td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://docs.google.com/presentation/d/1mijmFYeSxJ5pfZKFdObCJYOC4lZWRzosEBBJF5XHgtM/edit?usp=sharing">Ganesh slides</a></li>
                  <li><a href="https://ieeexplore.ieee.org/document/10289219">Facile</a></li>
                  <li><a href="https://dl.acm.org/doi/10.1145/3524059.3532396">uiCA</a></li>
                </ul>
              </td>
              <td>Asg2 assigned, due 1/21. Detect races using <a href="https://cogumbreiro.github.io/assets/faial-popl26.pdf">Faial</a> and optionally <a href="https://dl.acm.org/doi/pdf/10.1145/2145816.2145844">GKLEE</a>.</td>
            </tr>
            <tr>
              <td>Mon 1/19</td>
              <td>Holiday</td>
              <td>MLK Day</td>
              <td>No class</td>
              <td>University holiday</td>
            </tr>
            <tr>
              <td>Tue 1/20</td>
              <td>Guest talk</td>
              <td>Interactive computing in nature recreation and youth sports</td>
              <td>Prof. Michael Jones, BYU</td>
              <td>Special lecture</td>
            </tr>
            <tr>
              <td>Wed 1/21</td>
              <td>Guest talk</td>
              <td>Modular static cost analysis and related verification ideas</td>
              <td><a href="https://cogumbreiro.github.io/assets/faial-popl26.pdf">Tiago Cogumbreiro / Faial material</a></td>
              <td>Asg3 assigned, due 1/28. See the <a href="https://www.overleaf.com/read/kfzrpddjsdpn#6b6082">Asg-3 writeup workspace</a>.</td>
            </tr>
            <tr>
              <td>Thu 1/22</td>
              <td>Guest talk</td>
              <td>50 years of parallel programming</td>
              <td>Prof. Keshav Pingali</td>
              <td>Kahlert Distinguished Lecture</td>
            </tr>
            <tr>
              <td>Mon 1/26</td>
              <td>Ganesh + David</td>
              <td>AWS training; Tilus; modular scheduling</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://docs.google.com/presentation/d/1C6vt0McbS2if2_AJoVdhWAzqrTK6owV8I71zlvUIOcE/edit?usp=sharing">AWS training slides</a></li>
                  <li><a href="https://arxiv.org/pdf/2504.12984">Tilus paper</a></li>
                  <li><a href="https://github.com/parfloat/parfloat-class/tree/main/TILUS">Tilus repo experiments</a></li>
                </ul>
              </td>
              <td>AWS and low-precision kernel focus</td>
            </tr>
            <tr>
              <td>Wed 1/28</td>
              <td>Ganesh</td>
              <td>AWS training; Neuron architecture; Mojo</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://docs.google.com/presentation/d/1OVkwX4hO3V6tqrvNE9VwxiHOSabSzhxauVZp6ombeqU/edit?usp=sharing">Neuron architecture slides</a></li>
                  <li><a href="https://www.overleaf.com/read/wxrsxmdttcgw#da21ab">AWS writeup workspace</a></li>
                </ul>
              </td>
              <td>Asg4 due 2/6; Asg5 due 2/13.</td>
            </tr>
            <tr>
              <td>Mon 2/2</td>
              <td>Ganesh + Sree + students</td>
              <td>AWS tensor-addition walkthrough; profiling; student talks</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://arxiv.org/abs/2509.21039">Mojo in HPC</a></li>
                  <li><a href="https://graphics.pixar.com/library/RenderManXPU/">RenderMan XPU</a></li>
                </ul>
              </td>
              <td>Interactive experimental session</td>
            </tr>
            <tr>
              <td>Wed 2/4</td>
              <td>Both + student speakers</td>
              <td>Follow-on AWS material; student presentations</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://arxiv.org/abs/2512.04226">tritonBLAS</a></li>
                  <li><a href="https://arxiv.org/pdf/2511.13940">ParallelKittens</a></li>
                </ul>
              </td>
              <td>Project-selection writeup due 2/13</td>
            </tr>
            <tr>
              <td>Mon 2/9</td>
              <td>Both + student speaker</td>
              <td>Discussion of Asg1-Asg4; Hoare logic for GPU programs</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://dl.acm.org/doi/10.1145/3001834">Hoare Logic of GPU Programs</a></li>
                  <li><a href="https://dl.acm.org/doi/10.1109/SC41406.2024.00042">HiRace</a></li>
                </ul>
              </td>
              <td>Homework review</td>
            </tr>
            <tr>
              <td>Wed 2/11</td>
              <td>Both + student speaker</td>
              <td>Memory hierarchy paper discussion</td>
              <td><a href="https://arxiv.org/pdf/1903.07486">Dissecting the NVIDIA Turing T4 GPU via Microbenchmarking</a></td>
              <td>Read before class</td>
            </tr>
            <tr>
              <td>Mon 2/16</td>
              <td>Holiday</td>
              <td>Presidents Day</td>
              <td>No class</td>
              <td>University holiday</td>
            </tr>
            <tr>
              <td>Wed 2/18</td>
              <td>Both + student speakers</td>
              <td>ThunderKittens, HipKittens, TVM-FFI discussion</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://openreview.net/pdf?id=0fJfVOSUra">ThunderKittens</a></li>
                  <li><a href="https://arxiv.org/abs/2511.08083">HipKittens</a></li>
                  <li><a href="https://tvm.apache.org/ffi/">TVM-FFI</a> and <a href="https://github.com/apache/tvm-ffi?tab=readme-ov-file">repo</a></li>
                </ul>
              </td>
              <td>Paper-discussion format</td>
            </tr>
            <tr>
              <td>Mon 2/23</td>
              <td>Ganesh + student speakers</td>
              <td>MLIR-AIR paper and software tryout</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://dl.acm.org/doi/epdf/10.1145/3318464.3380595">MLIR paper</a></li>
                  <li><a href="https://arxiv.org/abs/2510.14871">MLIR-AIR paper</a></li>
                  <li><a href="https://github.com/parfloat/parfloat-class/tree/main/AIR2CUDA">AIR2CUDA</a></li>
                  <li><a href="https://youtu.be/hkgWi0oN_L8?si=Cl_H2Se2AlYvKmqN">Alex Zinenko talk</a></li>
                </ul>
              </td>
              <td>Unit-test and software exploration</td>
            </tr>
            <tr>
              <td>Wed 2/25</td>
              <td>Guest lecture</td>
              <td>Visit by Dr. Sangeeta Chowdhary on MLIR-AIR</td>
              <td>AMD / MLIR-AIR effort</td>
              <td>Asg6 due 3/6; final project proposal expected</td>
            </tr>
            <tr>
              <td>Mon 3/2</td>
              <td>Ganesh + students</td>
              <td>Faial race-checking and GKLEE</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://cogumbreiro.github.io/assets/faial-fmsd23.pdf">Faial FMSD paper</a></li>
                  <li><a href="https://dl.acm.org/doi/pdf/10.1145/2145816.2145844">GKLEE paper</a></li>
                  <li><a href="https://dl.acm.org/doi/10.1145/3352460.3358307">NVBit paper</a></li>
                </ul>
              </td>
              <td>Project-idea discussion</td>
            </tr>
            <tr>
              <td>Wed 3/4</td>
              <td>Students</td>
              <td>Brief project idea presentations</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://docs.google.com/presentation/d/13Wm5mQldICWj227BsHivin1ZhueXy4UJKfQorlbQ4KY/edit?usp=sharing">Correctness tooling slide deck</a></li>
                  <li><a href="https://dl.acm.org/doi/10.1145/103162.103163">Goldberg FP reading</a></li>
                  <li><a href="https://dl.acm.org/doi/pdf/10.1145/3736112.3736144">Additional reading</a></li>
                </ul>
              </td>
              <td>Project pitching</td>
            </tr>
            <tr>
              <td>Mon 3/9</td>
              <td>Holiday</td>
              <td>Spring break</td>
              <td>No class</td>
              <td>University holiday</td>
            </tr>
            <tr>
              <td>Wed 3/11</td>
              <td>Holiday</td>
              <td>Spring break</td>
              <td>No class</td>
              <td>University holiday</td>
            </tr>
            <tr>
              <td>Mon 3/16</td>
              <td>Project mode</td>
              <td>Project discussions</td>
              <td>Project-specific slides and brainstorming</td>
              <td>Loose post-break schedule begins</td>
            </tr>
            <tr>
              <td>Wed 3/18</td>
              <td>Project mode</td>
              <td>Project discussions</td>
              <td>Team meetings and review</td>
              <td>Project focus</td>
            </tr>
            <tr>
              <td>Mon 3/23</td>
              <td>Guest / project mode</td>
              <td>SLEEK paper and code discussion</td>
              <td>
                <ul class="cell-list">
                  <li><a href="https://userweb.cs.txstate.edu/~burtscher/papers/ipdps26.pdf">SLEEK paper</a></li>
                  <li><a href="https://github.com/burtscher/SLEEK/">SLEEK repository</a></li>
                </ul>
              </td>
              <td>Andrew Rodriguez presentation</td>
            </tr>
            <tr>
              <td>Wed 3/25</td>
              <td>Project mode</td>
              <td>Project meetings; brief look at Hexagon MLIR</td>
              <td>Qualcomm slides and related arXiv material</td>
              <td>Project focus</td>
            </tr>
            <tr>
              <td>Mon 3/30</td>
              <td>Project mode</td>
              <td>Project meetings</td>
              <td>Team progress and debugging</td>
              <td>Project focus</td>
            </tr>
            <tr>
              <td>Wed 4/1</td>
              <td>Discussion</td>
              <td>cuFuzz discussion</td>
              <td><a href="https://research.nvidia.com/publication/2026-03_hunting-cuda-bugs-scale-cufuzz">NVIDIA cuFuzz research page</a></td>
              <td>Tooling / bug-finding focus</td>
            </tr>
            <tr>
              <td>Mon 4/6</td>
              <td>Discussion</td>
              <td>MLIR transform dialect and xDSL</td>
              <td>Course-generated transformations, scripts, and slides</td>
              <td>Compiler transformation focus</td>
            </tr>
            <tr>
              <td>Mon 4/20</td>
              <td>All groups</td>
              <td>Last lecture; short project presentations</td>
              <td>In-class final project updates</td>
              <td>All groups present briefly</td>
            </tr>
            <tr>
              <td>Mon 4/27</td>
              <td>Due date</td>
              <td>Project reports due</td>
              <td>Final report submission</td>
              <td>Written deliverable deadline</td>
            </tr>
            <tr>
              <td>Mon 5/4</td>
              <td>Administrative</td>
              <td>Grades due</td>
              <td>End of semester</td>
              <td>Course closeout</td>
            </tr>
          </tbody>
        </table>
      </div>
    </section>

    <section id="syllabus" class="panel alt">
      <h2>Syllabus Snapshot</h2>
      <div class="toc-grid">
        <div class="toc-card">
          <h3>Table of Contents</h3>
          <ol>
            <li>Course organization and project expectations</li>
            <li>Number systems, floating-point, and tool foundations</li>
            <li>GPU performance fundamentals and throughput modeling</li>
            <li>Formal GPU execution models, races, and schedule-sensitive bugs</li>
            <li>AWS Trainium and Neuron/NKI experimentation</li>
            <li>Compiler and language systems: Tilus, Mojo, MLIR, MLIR-AIR</li>
            <li>Profiling, tracing, and performance-measurement workflows</li>
            <li>Verification, race-checking, and floating-point analysis</li>
            <li>Student paper presentations and visiting research talks</li>
            <li>Semester-long project development and artifact release</li>
          </ol>
        </div>
        <div class="toc-card">
          <h3>Semester Shape</h3>
          <ul>
            <li>Early weeks emphasize abstractions: numeric representation, correctness reasoning, and cost models.</li>
            <li>Middle weeks shift into concrete GPU and Trainium experimentation, with profiling and tool use.</li>
            <li>Later weeks increasingly revolve around paper discussion, project reviews, and system-building.</li>
            <li>Short student presentations are threaded throughout to connect reading with active experimentation.</li>
          </ul>
        </div>
      </div>
      <p>
        The detailed course document lays out a semester that moves from basic
        numerical and performance foundations toward concrete GPU experiments on
        current systems. The course opens with number systems, correctness tools,
        performance fundamentals, and GPU throughput modeling, then develops a
        formal view of GPU execution, race behavior, and schedule-sensitive bugs.
      </p>
      <p>
        The middle of the semester shifts into hands-on systems work. The shared
        syllabus emphasizes AWS Trainium access, profiling exercises, compiler and
        language ecosystems such as Mojo and Tilus, and performance-model readings.
        Students are expected to run kernels, measure them, explain observed
        bottlenecks, and connect those measurements back to formal and algorithmic
        reasoning about the code.
      </p>
      <p>
        The document also makes clear that the class is discussion-heavy and
        research-oriented. Short student presentations are integrated throughout
        the semester so that teams regularly read papers, explain ideas to the
        class, and use those readings to sharpen their own project direction.
        The overall pattern is deliberate: learn the abstractions, inspect real
        artifacts, experiment on modern accelerators, and then design or repair
        primitives with publishable discipline.
      </p>
    </section>

    <section id="projects" class="panel">
      <div class="copy">
        <h2>Projects and Outcomes</h2>
        <p>
          The course document frames the project as the center of the class. Teams
          are encouraged to choose ambitious GPU themes, including new kernels,
          correctness and performance diagnostics, compiler-assisted kernel design,
          and experiments involving realistic ML or HPC frameworks. Assignments
          along the way are structured to feed into that larger project rather than
          stand alone.
        </p>
        <p>
          Expected outcomes include a well-documented artifact, a clear correctness
          and performance story, and potentially a paper-ready result. Students are
          pushed not only to make something run, but to justify why it is correct,
          explain why it performs the way it does, and show how the underlying
          abstractions support those claims.
        </p>
      </div>
      <figure class="media dual">
        <img src="hammer-looking-for-nail.webp" alt="A hammer looking for a nail">
        <img src="nail-seeking-right-hammer.webp" alt="A nail looking for the right hammer">
        <figcaption>The course treats tools as methods to be matched to the right correctness and performance question.</figcaption>
      </figure>
    </section>

    <section id="resources" class="panel alt">
      <h2>Resources and Logistics</h2>
      <p>
        The detailed syllabus highlights several practical resources: University
        computing access, CHPC systems, AWS Trainium resources, and guest or
        partner lectures that connect course material to active systems research.
        It also uses shared communication channels and frequent instructor contact
        to keep project work moving.
      </p>
      <p>
        If you want the original detailed course planning material, see the
        archived course materials and the semester documents in the repository.
        This homepage is meant to provide the compact public-facing summary.
      </p>
      <p>
        A consolidated catalog of the public URLs embedded throughout the main
        syllabus document and its side tabs is available here:
        <a href="links.html">Full Link Catalog</a>.
      </p>
      <div class="catalog-grid">
        <article class="catalog-card">
          <h3>Software Tools</h3>
          <p>
            The shared course document points students toward a hands-on stack
            of systems for writing, checking, and profiling GPU primitives.
          </p>
          <ul>
            <li><strong>AWS Trainium + Neuron/NKI</strong>: the main accelerator experimentation path in the syllabus, including NKI kernels, Neuron Explorer, profiling traces, and attention and matrix-multiplication tutorials.</li>
            <li><strong>CHPC GPU workflow</strong>: CUDA-capable campus systems, `nvcc`, `nvidia-smi`, `nsys`, and batch allocation workflows for NVIDIA profiling.</li>
            <li><strong>Faial</strong>: a race and cost-analysis direction used in the course to reason about warp-level behavior and correctness/performance interactions.</li>
            <li><strong>GKLEE</strong>: symbolic and concolic GPU bug-finding, used as a reference point for race exposure and schedule-sensitive failures.</li>
            <li><strong>Tilus</strong>: a tile-level GPGPU language for low-precision computation, treated as a language-design case study for structured primitive construction.</li>
            <li><strong>Mojo</strong>: discussed as an emerging systems language for high-performance kernel and HPC-oriented experimentation.</li>
            <li><strong>MLIR and MLIR-AIR</strong>: compiler infrastructure and accelerator-lowering frameworks used to connect loop nests, transformations, and hardware realization.</li>
            <li><strong>AIR2CUDA and related tooling</strong>: software artifacts used to inspect lowering pathways from MLIR-AIR-style flows toward GPU code generation.</li>
            <li><strong>NVBit and custom instrumentation</strong>: dynamic GPU instrumentation ideas, including barrier-focused tooling and low-level runtime inspection.</li>
            <li><strong>Vercors, CIVL, and FP analysis tools</strong>: formal and numeric-analysis tools for proving race freedom, checking semantics, and studying floating-point error.</li>
          </ul>
        </article>
        <article class="catalog-card">
          <h3>Papers by Topic</h3>
          <p>
            The readings in the shared syllabus cluster naturally into a few
            recurring themes.
          </p>
          <ul>
            <li><strong>Performance and throughput modeling</strong>: papers such as <em>uiCA</em>, <em>Facile</em>, the shared-memory atomic bottleneck work, and modular static cost analysis build the vocabulary for predicting and explaining kernel throughput.</li>
            <li><strong>GPU execution cost and productivity</strong>: works such as NPBench, data-centric Python, and CUDA cost-model papers connect user productivity, performance portability, and evaluation-cost reasoning.</li>
            <li><strong>Race detection and GPU verification</strong>: the syllabus groups FastTrack, FSE 2010 SMT-based GPU verification, GKLEE, GPUVerify, HiRace, Memory Access Protocols, and Vercors as complementary approaches to proving or detecting correctness properties.</li>
            <li><strong>Formal semantics and Hoare-style reasoning</strong>: materials such as Hoare logic for GPU programs, memory-model readings, and CIVL point students toward specification-first reasoning instead of purely empirical debugging.</li>
            <li><strong>Floating-point rigor</strong>: the background includes Goldberg’s classic essay, floating-point error-analysis work, Herbie-style rewriting, and scalable rigorous FP analysis, tying numerical semantics directly to kernel trustworthiness.</li>
            <li><strong>Scheduling, mapping, and specialization</strong>: software pipelining, warp specialization, distributed tensor mapping, and distributed Fourier mapping papers capture the scheduling side of making kernels and tensor systems fast.</li>
            <li><strong>Compiler and accelerator design</strong>: MLIR, MLIR-AIR, Tilus, and recent accelerator-lowering work show how modern compiler structures can encode performance intent and hardware structure more systematically.</li>
            <li><strong>Project-facing frontier systems</strong>: RenderMan XPU, tritonBLAS, ParallelKittens, ProofWright, GEAK, TileGym, and Tensor Core survey material serve as examples of current systems that students can study, reimplement, or benchmark against.</li>
          </ul>
        </article>
      </div>
    </section>
  </main>

  <footer>
    <p>CS6969: Fast and Correct GPU Code</p>
  </footer>
</body>
</html>