Using commit 9873887
Crash
EXC_BAD_ACCESS (KERN_INVALID_ADDRESS at 0x8) during img2img generation with reference images:
MPSGraphOSLog
-[MPSGraphExecutable runInternalWithDevice:commandBuffer:feeds:results:executionDescriptor:mpsGraphOwnedCommandBuffer:]
-[MPSGraph encodeToCommandBuffer:feeds:targetOperations:resultsDictionary:executionDescriptor:]
iris_gpu_linear_bf16_mpsgraph_into (iris_metal.m:5691)
iris_gpu_linear_bf16_native_into (iris_metal.m:5799)
single_block_forward_bf16 (iris_transformer_flux.c:2939)
iris_transformer_forward_bf16_flux (iris_transformer_flux.c:3217)
iris_transformer_forward_refs_flux (iris_transformer_flux.c:4040)
iris_sample_euler_refs_flux (iris_sample.c:525)
iris_img2img (iris.c:1329)
Root cause
The MPSGraph linear caches (g_linear_graph_cache_bf16 and g_linear_graph_cache, both 32 entries) use a slot-0-overwrite eviction policy. When the cache is full and a new (seq_len, in_dim, out_dim) tuple is needed, slot 0 is always overwritten:
// get_linear_graph_cache_bf16, line 4692
int slot = 0;
if (g_linear_graph_bf16_count < MAX_LINEAR_GRAPH_CACHE) {
slot = g_linear_graph_bf16_count++;
}
The __strong ObjC fields (MPSGraph, MPSGraphTensor, NSArray shapes) are released by ARC when overwritten. But in batch mode (iris_gpu_batch_begin/end), the entire forward pass shares a command buffer chain without committing. The previously-encoded operations still reference the freed graph objects internally.
How slot 0 thrashes within a single batch
The forward pass uses these bf16-direct cache tuples (condition: out_dim >= 8192 || seq_len >= 8192):
Per double block iteration (5 total):
- (img_seq, 3072, 9216) for img MLP gate/up
- (txt_seq, 3072, 9216) for txt MLP gate/up
Per single block iteration (20 total):
- (total_seq, 3072, 27648) for fused QKV+MLP
When the cache is full from previous runs at different resolutions, all three tuples are cache misses. They cycle through slot 0 like this:
DB0: lookup (img_seq, 3072, 9216) -> MISS -> slot 0 = graph_A
DB0: lookup (txt_seq, 3072, 9216) -> MISS -> slot 0 = graph_B (graph_A freed!)
DB1: lookup (img_seq, 3072, 9216) -> MISS -> slot 0 = graph_A' (graph_B freed!)
DB1: lookup (txt_seq, 3072, 9216) -> MISS -> slot 0 = graph_B' (graph_A' freed!)
... (repeats for 5 double blocks)
SB0: lookup (total_seq, 3072, 27648) -> MISS -> slot 0 = graph_C (graph_B'' freed!)
Each eviction releases the previous graph via ARC while the command buffer still has pending encodes that reference it. MPSGraph's internal logging (MPSGraphOSLog, top of the crash stack) accesses the freed graph, dereferences a zeroed pointer at offset 8, and crashes.
Why img2img triggers it
Each different image resolution creates unique (seq_len, in_dim, out_dim) tuples. img2img with reference images changes the effective img_seq to combined_img_seq (= target + reference tokens), producing new tuples that differ from regular txt2img. Users trying different reference image sizes fill the 32-entry cache quickly (3 bf16-direct entries per resolution, overflow after ~11 different resolutions). Once the cache is full, the next generation with a new resolution triggers the thrashing.
The same issue exists in g_linear_graph_cache (fp32-cast path, also 32 entries) and g_sdpa_graph_cache (attention path, only 8 entries).
Suggested fix
Before evicting a cache entry, flush pending GPU work so the old graph objects are no longer referenced:
int slot = 0;
if (g_linear_graph_bf16_count < MAX_LINEAR_GRAPH_CACHE) {
slot = g_linear_graph_bf16_count++;
} else {
// Flush pending GPU work before evicting to avoid use-after-free
if (g_tensor_batch_mode && g_tensor_cmd) {
if (g_tensor_cmd.status < MTLCommandBufferStatusCommitted) {
[g_tensor_cmd commit];
}
[g_tensor_cmd waitUntilCompleted];
g_tensor_cmd = [g_queue commandBuffer];
}
}
Or increase MAX_LINEAR_GRAPH_CACHE to accommodate more resolutions (e.g., 128 or 256).
Or retain evicted graph objects in a deferred-release list (similar to pool_flush_deferred for tensor buffers) that is only flushed at iris_gpu_batch_end.
Using commit 9873887
Crash
EXC_BAD_ACCESS (KERN_INVALID_ADDRESS at 0x8) during img2img generation with reference images:
Root cause
The MPSGraph linear caches (
g_linear_graph_cache_bf16andg_linear_graph_cache, both 32 entries) use a slot-0-overwrite eviction policy. When the cache is full and a new (seq_len, in_dim, out_dim) tuple is needed, slot 0 is always overwritten:The
__strongObjC fields (MPSGraph, MPSGraphTensor, NSArray shapes) are released by ARC when overwritten. But in batch mode (iris_gpu_batch_begin/end), the entire forward pass shares a command buffer chain without committing. The previously-encoded operations still reference the freed graph objects internally.How slot 0 thrashes within a single batch
The forward pass uses these bf16-direct cache tuples (condition: out_dim >= 8192 || seq_len >= 8192):
Per double block iteration (5 total):
Per single block iteration (20 total):
When the cache is full from previous runs at different resolutions, all three tuples are cache misses. They cycle through slot 0 like this:
Each eviction releases the previous graph via ARC while the command buffer still has pending encodes that reference it. MPSGraph's internal logging (MPSGraphOSLog, top of the crash stack) accesses the freed graph, dereferences a zeroed pointer at offset 8, and crashes.
Why img2img triggers it
Each different image resolution creates unique (seq_len, in_dim, out_dim) tuples. img2img with reference images changes the effective img_seq to combined_img_seq (= target + reference tokens), producing new tuples that differ from regular txt2img. Users trying different reference image sizes fill the 32-entry cache quickly (3 bf16-direct entries per resolution, overflow after ~11 different resolutions). Once the cache is full, the next generation with a new resolution triggers the thrashing.
The same issue exists in g_linear_graph_cache (fp32-cast path, also 32 entries) and g_sdpa_graph_cache (attention path, only 8 entries).
Suggested fix
Before evicting a cache entry, flush pending GPU work so the old graph objects are no longer referenced:
Or increase MAX_LINEAR_GRAPH_CACHE to accommodate more resolutions (e.g., 128 or 256).
Or retain evicted graph objects in a deferred-release list (similar to pool_flush_deferred for tensor buffers) that is only flushed at iris_gpu_batch_end.