Fix swiglu_decode intermediate comparison to use chained reference#105
Merged
Fix swiglu_decode intermediate comparison to use chained reference#105
Conversation
The decode test verified `intermediate` against golden_ref["intermediate"] (CPU-computed silu(golden_left) * golden_right), while the prefill test uses a chained reference built from the observed AIE left_swished and right buffers. That inconsistency surfaces as spurious failures at rectangular FFN shapes (e.g. embedding=1024, hidden=3584): the AIE SiLU LUT rounds near-zero outputs to exactly 0.0 where fp32 CPU silu keeps a tiny negative value, and the subsequent multiply against a large-magnitude right operand amplifies that sub-tolerance drift into "got 0.0, expected -1.27"-style mismatches. The AIE kernels are numerically correct, the observed intermediate matches observed_left_swished * observed_right exactly, and the final output already passes at a tighter tolerance. Only the verification methodology was wrong. Switch to the prefill-style chained reference and tighten tolerance to (rel=0.04, abs=0.4), same as the output check and prefill intermediate. Add (1024, 3584) to the parametrization so rectangular decode is covered in CI.
Collaborator
|
Thanks, this is a useful improvement. I just kicked off the CI, if this passes I'll merge it |
andrej
approved these changes
Apr 17, 2026
Contributor
CI Test Resultsd22ca66 (2026_04_17_16_04_27) IRONCLAD - CI SummaryExamples
Small
Extensive
Krackan - SmallIRONCLADTested on
Trends: IRONCLAD TrendsM_128-K_128-num_aie_columns_1-tile_size_input_32-tile_size_output_128
M_1792-K_896-N_1152-num_aie_columns_8-b_col_maj_False-c_col_maj_True-m_64-k_32-n_48-trace_size_0-partition_N_1
M_192-K_384-N_64-num_aie_columns_4-b_col_maj_False-c_col_maj_False-m_48-k_96-n_16-trace_size_0-partition_N_1
M_192-K_384-N_64-num_aie_columns_4-b_col_maj_True-c_col_maj_True-m_48-k_96-n_16-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_1-b_col_maj_False-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_2-b_col_maj_True-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_8-b_col_maj_True-c_col_maj_True-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_8192-num_aie_columns_1-tile_size_input_1-tile_size_output_2048
M_2048-K_8192-num_aie_columns_2-tile_size_input_1-tile_size_output_1024
M_2048-K_8192-num_aie_columns_4-tile_size_input_1-tile_size_output_512
M_2048-K_8192-num_aie_columns_8-tile_size_input_1-tile_size_output_256
M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8
M_2048-N_64-aie_columns_1-channels_2-m_64-n_64-s_8
M_384-K_1536-N_1792-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_32-k_48-n_64-trace_size_0-partition_N_1
M_64-K_512-N_256-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_16-k_64-n_64-trace_size_0-partition_N_4
M_8192-K_2048-num_aie_columns_1-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_2-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_4-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_8-tile_size_input_4-tile_size_output_1024
M_896-K_1792-N_640-num_aie_columns_8-b_col_maj_False-c_col_maj_True-m_32-k_64-n_80-trace_size_0-partition_N_1
embedding_dim_1024-hidden_dim_3584
embedding_dim_2048-hidden_dim_2048
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-group_size_32
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_False
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_True
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-group_size_32
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_False
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_True
input_length_2048-num_aie_columns_1-tile_size_2048
input_length_2048-num_aie_columns_1-tile_size_2048-scalar_factor_3.0
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-group_size_32
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_False
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_True
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-group_size_32
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_False
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_True
input_length_2048-num_aie_columns_2-tile_size_1024
input_length_2048-num_aie_columns_2-tile_size_1024-scalar_factor_3.0
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-group_size_32
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_False
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_True
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-group_size_32
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-weighted_False
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-weighted_True
input_length_2048-num_aie_columns_4-tile_size_512
input_length_2048-num_aie_columns_4-tile_size_512-scalar_factor_3.0
input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256
input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-group_size_32
input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-weighted_False
input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-weighted_True
input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128
input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128-group_size_32
input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128-weighted_False
input_length_2048-num_aie_columns_8-tile_size_256
input_length_2048-num_aie_columns_8-tile_size_256-scalar_factor_3.0
input_length_2048-num_cores_1-num_channels_1-bypass_False-tile_size_2048
input_length_2048-num_cores_16-num_channels_2-bypass_False-tile_size_128
input_length_2048-num_cores_2-num_channels_1-bypass_False-tile_size_1024
input_length_2048-num_cores_2-num_channels_2-bypass_False-tile_size_1024
input_length_2048-num_cores_4-num_channels_1-bypass_False-tile_size_512
input_length_2048-num_cores_4-num_channels_2-bypass_False-tile_size_512
input_length_2048-num_cores_8-num_channels_1-bypass_False-tile_size_256
input_length_2048-num_cores_8-num_channels_2-bypass_False-tile_size_256
input_length_32768-num_aie_columns_2-num_channels_2-tile_size_1024
input_length_32768-num_aie_columns_2-num_channels_2-tile_size_2048
input_length_32768-num_aie_columns_2-num_channels_2-tile_size_512
rows_32-cols_512-angle_rows_32-aie_columns_1-method_type_0
rows_32-cols_512-angle_rows_32-aie_columns_2-method_type_0
rows_32-cols_512-angle_rows_32-aie_columns_4-method_type_0
rows_32-cols_512-angle_rows_32-aie_columns_8-method_type_0
rows_32-cols_512-angle_rows_8-aie_columns_1-method_type_0
rows_32-cols_512-angle_rows_8-aie_columns_2-method_type_0
rows_32-cols_512-angle_rows_8-aie_columns_4-method_type_0
rows_32-cols_512-angle_rows_8-aie_columns_8-method_type_0
seq_len_16384-dim_64-num_heads_1-num_pipelines_8-num_kv_heads_0
seq_len_256-embedding_dim_2048-hidden_dim_2048-prio_accuracy_False
Krackan - ExamplesIRONCLADTested on
Trends: IRONCLAD Trendsllama_3.2_1b_prompt_1024_tokens_1
llama_3.2_1b_prompt_1024_tokens_40
llama_3.2_1b_prompt_13_tokens_1
llama_3.2_1b_prompt_13_tokens_40
Phoenix - SmallIRONCLADTested on
Trends: IRONCLAD TrendsM_128-K_128-num_aie_columns_1-tile_size_input_32-tile_size_output_128
M_192-K_384-N_64-num_aie_columns_4-b_col_maj_False-c_col_maj_False-m_48-k_96-n_16-trace_size_0-partition_N_1
M_192-K_384-N_64-num_aie_columns_4-b_col_maj_True-c_col_maj_True-m_48-k_96-n_16-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_1-b_col_maj_False-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_2-b_col_maj_True-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_8192-num_aie_columns_1-tile_size_input_1-tile_size_output_2048
M_2048-K_8192-num_aie_columns_2-tile_size_input_1-tile_size_output_1024
M_2048-K_8192-num_aie_columns_4-tile_size_input_1-tile_size_output_512
M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8
M_2048-N_64-aie_columns_1-channels_2-m_64-n_64-s_8
M_384-K_1536-N_1792-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_32-k_48-n_64-trace_size_0-partition_N_1
M_64-K_512-N_256-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_16-k_64-n_64-trace_size_0-partition_N_4
M_8192-K_2048-num_aie_columns_1-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_2-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_4-tile_size_input_4-tile_size_output_1024
embedding_dim_1024-hidden_dim_3584
embedding_dim_2048-hidden_dim_2048
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-group_size_32
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_False
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_True
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-group_size_32
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_False
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_True
input_length_2048-num_aie_columns_1-tile_size_2048
input_length_2048-num_aie_columns_1-tile_size_2048-scalar_factor_3.0
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-group_size_32
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_False
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_True
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-group_size_32
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_False
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_True
input_length_2048-num_aie_columns_2-tile_size_1024
input_length_2048-num_aie_columns_2-tile_size_1024-scalar_factor_3.0
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-group_size_32
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_False
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_True
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-group_size_32
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-weighted_False
input_length_2048-num_aie_columns_4-tile_size_512
input_length_2048-num_aie_columns_4-tile_size_512-scalar_factor_3.0
input_length_2048-num_cores_1-num_channels_1-bypass_False-tile_size_2048
input_length_2048-num_cores_2-num_channels_1-bypass_False-tile_size_1024
input_length_2048-num_cores_2-num_channels_2-bypass_False-tile_size_1024
input_length_2048-num_cores_4-num_channels_1-bypass_False-tile_size_512
input_length_2048-num_cores_4-num_channels_2-bypass_False-tile_size_512
input_length_2048-num_cores_8-num_channels_2-bypass_False-tile_size_256
input_length_32768-num_aie_columns_2-num_channels_2-tile_size_1024
input_length_32768-num_aie_columns_2-num_channels_2-tile_size_2048
input_length_32768-num_aie_columns_2-num_channels_2-tile_size_512
rows_32-cols_512-angle_rows_32-aie_columns_1-method_type_0
rows_32-cols_512-angle_rows_32-aie_columns_2-method_type_0
rows_32-cols_512-angle_rows_32-aie_columns_4-method_type_0
rows_32-cols_512-angle_rows_8-aie_columns_1-method_type_0
rows_32-cols_512-angle_rows_8-aie_columns_2-method_type_0
rows_32-cols_512-angle_rows_8-aie_columns_4-method_type_0
seq_len_256-embedding_dim_2048-hidden_dim_2048-prio_accuracy_False
Phoenix - ExamplesIRONCLADTested on
Trends: IRONCLAD Trends |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Aligns
swiglu_decode/test.pywithswiglu_prefill/test.pyby verifying theintermediatebuffer against a chained reference built from the observed AIEleft_swishedandrightbuffers, instead of against the CPU-computedgolden_ref["intermediate"].The golden-reference path amplifies legitimate, sub-tolerance bf16 drift from upstream stages (e.g. SiLU of very-negative inputs where the AIE LUT rounds to
0.0while fp32 CPU silu preserves a tiny negative value) through the multiplication against a large-magnituderightoperand, producing spurious "got 0.0, expected -1.27"-style failures. The AIE kernels themselves are numerically correct, the observedintermediatematchesobserved_left_swished * observed_rightexactly, and the finaloutputalready passes at a tighter tolerance than the intermediate stage.This issue surfaces at rectangular FFN shapes (e.g.
embedding_dim=1024, hidden_dim=3584) where the statistics of the SiLU input distribution make near-zero LUT outputs more common than at the previously-tested square2048²shape. Adds(1024, 3584)to the parametrization so regressions in rectangular decode are caught.Added
(1024, 3584)parametrization iniron/operators/swiglu_decode/test.py, reflecting Qwen3.5-0.8B FFN dims so rectangular decode is covered alongside the existing square smoke test.Changed
iron/operators/swiglu_decode/test.py: verifyintermediateagainst a chained reference (observed_left_swished * observed_right) rather thangolden_ref["intermediate"], matching the approach already inswiglu_prefill/test.py. Tightens the tolerance torel_tol=0.04, abs_tol=0.4accordingly (same values used for the output check and for prefill).Removed
Testing
Verified on NPU2 (Strix,
aie2p):pytest iron/operators/swiglu_decode/test.py -v --iterations 1: both2048×2048and1024×3584pass.pytest iron/operators/ -m "not extensive" --iterations 1: no regressions.PR Merge Checklist
develcommit and pointing todevel.