Skip to content

Adjust build config and add benchmarks#2

Open
flyingcow8 wants to merge 1 commit into
MetaX-MACA:2.25.2from
flyingcow8:2.25.2
Open

Adjust build config and add benchmarks#2
flyingcow8 wants to merge 1 commit into
MetaX-MACA:2.25.2from
flyingcow8:2.25.2

Conversation

@flyingcow8

Copy link
Copy Markdown
  • Change MHA_NUM_JOBS calculation to round up
  • Set default HDIM=128,256, DTYPE=BF16
  • Add .gitignore and benchmarks directory

- Change MHA_NUM_JOBS calculation to round up
- Set default HDIM=128, DTYPE=BF16
- Change make parallel jobs to fixed value 6
- Add .gitignore and benchmarks directory

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces modular build options to Flash Attention, allowing users to disable backward kernels, dropout, and various forward attention variants (such as local, ALiBi, softcap, append-KV, and causal) to enable faster compilation. It updates the CMake, Makefile, and setup configurations to support these options, documents them in the README, and adds a new KV cache benchmark script. The review feedback highlights a potential shape mismatch error in the new benchmark script when the number of blocks is not divisible by the batch size, and points out fragile CPU core detection logic in the Makefile and test script that could lead to build failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread Makefile
Comment on lines +1 to +7
CURRENT_CPU_NUM:=$(shell grep "cpu cores" /proc/cpuinfo | head -1 | awk '{print $$4}')
# If in maca build, restrict num_jobs to 140 to resolve x86 docker which using 384 cores leads to OOM
ifdef BUILDROOT
# In x86 docker maca building, "make -j112" is a stable parallel number, so we use 140 here to get 140*0.8=112
CURRENT_CPU_NUM:=$(shell awk -v a=$(CURRENT_CPU_NUM) 'BEGIN {if(a>140) print 140; else print a}')
endif
MHA_NUM_JOBS:=$(shell awk -v n=$(CURRENT_CPU_NUM) 'BEGIN {print int(n * 0.8)}')
HDIM ?= 0
DTYPE ?= all
# FAST_BUILD is not currently supported.
FAST_BUILD ?= 0
GEN_KERNEL ?= 0
MHA_NUM_JOBS:=$(shell awk -v n=$(CURRENT_CPU_NUM) 'BEGIN {print int(n*0.8)}')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using grep "cpu cores" /proc/cpuinfo to determine the CPU count is fragile and can fail or return empty in virtualized, containerized, or non-x86 environments. If it returns empty, MHA_NUM_JOBS evaluates to 0, which causes make -j0 to fail with a syntax error.\n\nUsing nproc is much more robust and standard across Linux environments. We should also ensure MHA_NUM_JOBS is at least 1 to prevent invalid -j0 build options.

CURRENT_CPU_NUM:=$(shell nproc 2>/dev/null || echo 4)\n# If in maca build, restrict num_jobs to 140 to resolve x86 docker which using 384 cores leads to OOM\nifdef BUILDROOT\n# In x86 docker maca building, "make -j112" is a stable parallel number, so we use 140 here to get 140*0.8=112\n\tCURRENT_CPU_NUM:=$(shell awk -v a=$(CURRENT_CPU_NUM) 'BEGIN {if(a>140) print 140; else print a}')\nendif\nMHA_NUM_JOBS:=$(shell awk -v n=$(CURRENT_CPU_NUM) 'BEGIN {jobs=int(n*0.8); print (jobs>1?jobs:1)}')

Comment on lines +51 to +52
num_blocks = math.ceil(seqlen_k / page_block_size) * batch_size * 3
num_blocks = max(1024, num_blocks)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If num_blocks is capped at 1024 via max(1024, num_blocks), it is not guaranteed to be divisible by batch_size (e.g., if batch_size is a non-power-of-2 like 3, 5, 6, etc.). This will cause rearrange on line 70 to fail with a shape mismatch error because the total number of elements in torch.randperm(num_blocks) cannot be reshaped into (batch_size, nblocks).\n\nTo prevent this, we should ensure that the minimum block count is scaled per batch size so that num_blocks is always a multiple of batch_size.

    blocks_per_batch = math.ceil(seqlen_k / page_block_size) * 3\n    blocks_per_batch = max(math.ceil(1024 / batch_size), blocks_per_batch)\n    num_blocks = blocks_per_batch * batch_size

Comment thread test_jobs.sh
Comment on lines +2 to +3
CURRENT_CPU_NUM=$(grep "cpu cores" /proc/cpuinfo | head -1 | awk '{print $4}')
MHA_NUM_JOBS=$(awk -v n=$CURRENT_CPU_NUM 'BEGIN {n=int(n*0.8); if(n>12) print 12; else print n}')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the issue in the Makefile, using grep "cpu cores" /proc/cpuinfo is fragile and can fail in containerized or non-x86 environments, leading to an empty CURRENT_CPU_NUM and causing awk to output 0 or fail. Using nproc is much more robust.

Suggested change
CURRENT_CPU_NUM=$(grep "cpu cores" /proc/cpuinfo | head -1 | awk '{print $4}')
MHA_NUM_JOBS=$(awk -v n=$CURRENT_CPU_NUM 'BEGIN {n=int(n*0.8); if(n>12) print 12; else print n}')
CURRENT_CPU_NUM=$(nproc 2>/dev/null || echo 4)\nMHA_NUM_JOBS=$(awk -v n=$CURRENT_CPU_NUM 'BEGIN {n=int(n*0.8); if(n>12) n=12; if(n<1) n=1; print n}')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant