Adjust build config and add benchmarks#2
Conversation
flyingcow8
commented
Jun 3, 2026
- Change MHA_NUM_JOBS calculation to round up
- Set default HDIM=128,256, DTYPE=BF16
- Add .gitignore and benchmarks directory
- Change MHA_NUM_JOBS calculation to round up - Set default HDIM=128, DTYPE=BF16 - Change make parallel jobs to fixed value 6 - Add .gitignore and benchmarks directory Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces modular build options to Flash Attention, allowing users to disable backward kernels, dropout, and various forward attention variants (such as local, ALiBi, softcap, append-KV, and causal) to enable faster compilation. It updates the CMake, Makefile, and setup configurations to support these options, documents them in the README, and adds a new KV cache benchmark script. The review feedback highlights a potential shape mismatch error in the new benchmark script when the number of blocks is not divisible by the batch size, and points out fragile CPU core detection logic in the Makefile and test script that could lead to build failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| CURRENT_CPU_NUM:=$(shell grep "cpu cores" /proc/cpuinfo | head -1 | awk '{print $$4}') | ||
| # If in maca build, restrict num_jobs to 140 to resolve x86 docker which using 384 cores leads to OOM | ||
| ifdef BUILDROOT | ||
| # In x86 docker maca building, "make -j112" is a stable parallel number, so we use 140 here to get 140*0.8=112 | ||
| CURRENT_CPU_NUM:=$(shell awk -v a=$(CURRENT_CPU_NUM) 'BEGIN {if(a>140) print 140; else print a}') | ||
| endif | ||
| MHA_NUM_JOBS:=$(shell awk -v n=$(CURRENT_CPU_NUM) 'BEGIN {print int(n * 0.8)}') | ||
| HDIM ?= 0 | ||
| DTYPE ?= all | ||
| # FAST_BUILD is not currently supported. | ||
| FAST_BUILD ?= 0 | ||
| GEN_KERNEL ?= 0 | ||
| MHA_NUM_JOBS:=$(shell awk -v n=$(CURRENT_CPU_NUM) 'BEGIN {print int(n*0.8)}') |
There was a problem hiding this comment.
Using grep "cpu cores" /proc/cpuinfo to determine the CPU count is fragile and can fail or return empty in virtualized, containerized, or non-x86 environments. If it returns empty, MHA_NUM_JOBS evaluates to 0, which causes make -j0 to fail with a syntax error.\n\nUsing nproc is much more robust and standard across Linux environments. We should also ensure MHA_NUM_JOBS is at least 1 to prevent invalid -j0 build options.
CURRENT_CPU_NUM:=$(shell nproc 2>/dev/null || echo 4)\n# If in maca build, restrict num_jobs to 140 to resolve x86 docker which using 384 cores leads to OOM\nifdef BUILDROOT\n# In x86 docker maca building, "make -j112" is a stable parallel number, so we use 140 here to get 140*0.8=112\n\tCURRENT_CPU_NUM:=$(shell awk -v a=$(CURRENT_CPU_NUM) 'BEGIN {if(a>140) print 140; else print a}')\nendif\nMHA_NUM_JOBS:=$(shell awk -v n=$(CURRENT_CPU_NUM) 'BEGIN {jobs=int(n*0.8); print (jobs>1?jobs:1)}')
| num_blocks = math.ceil(seqlen_k / page_block_size) * batch_size * 3 | ||
| num_blocks = max(1024, num_blocks) |
There was a problem hiding this comment.
If num_blocks is capped at 1024 via max(1024, num_blocks), it is not guaranteed to be divisible by batch_size (e.g., if batch_size is a non-power-of-2 like 3, 5, 6, etc.). This will cause rearrange on line 70 to fail with a shape mismatch error because the total number of elements in torch.randperm(num_blocks) cannot be reshaped into (batch_size, nblocks).\n\nTo prevent this, we should ensure that the minimum block count is scaled per batch size so that num_blocks is always a multiple of batch_size.
blocks_per_batch = math.ceil(seqlen_k / page_block_size) * 3\n blocks_per_batch = max(math.ceil(1024 / batch_size), blocks_per_batch)\n num_blocks = blocks_per_batch * batch_size| CURRENT_CPU_NUM=$(grep "cpu cores" /proc/cpuinfo | head -1 | awk '{print $4}') | ||
| MHA_NUM_JOBS=$(awk -v n=$CURRENT_CPU_NUM 'BEGIN {n=int(n*0.8); if(n>12) print 12; else print n}') |
There was a problem hiding this comment.
Similar to the issue in the Makefile, using grep "cpu cores" /proc/cpuinfo is fragile and can fail in containerized or non-x86 environments, leading to an empty CURRENT_CPU_NUM and causing awk to output 0 or fail. Using nproc is much more robust.
| CURRENT_CPU_NUM=$(grep "cpu cores" /proc/cpuinfo | head -1 | awk '{print $4}') | |
| MHA_NUM_JOBS=$(awk -v n=$CURRENT_CPU_NUM 'BEGIN {n=int(n*0.8); if(n>12) print 12; else print n}') | |
| CURRENT_CPU_NUM=$(nproc 2>/dev/null || echo 4)\nMHA_NUM_JOBS=$(awk -v n=$CURRENT_CPU_NUM 'BEGIN {n=int(n*0.8); if(n>12) n=12; if(n<1) n=1; print n}') |