EXC_BAD_ACCESS in MPSGraph encodeToCommandBuffer during img2img (graph cache use-after-free)

Using commit https://github.com/antirez/iris.c/commit/9873887d4aa0646c650adc5b86b986d2f653b7e0

## Crash

EXC_BAD_ACCESS (KERN_INVALID_ADDRESS at 0x8) during img2img generation with reference images:

```
MPSGraphOSLog
-[MPSGraphExecutable runInternalWithDevice:commandBuffer:feeds:results:executionDescriptor:mpsGraphOwnedCommandBuffer:]
-[MPSGraph encodeToCommandBuffer:feeds:targetOperations:resultsDictionary:executionDescriptor:]
iris_gpu_linear_bf16_mpsgraph_into (iris_metal.m:5691)
iris_gpu_linear_bf16_native_into (iris_metal.m:5799)
single_block_forward_bf16 (iris_transformer_flux.c:2939)
iris_transformer_forward_bf16_flux (iris_transformer_flux.c:3217)
iris_transformer_forward_refs_flux (iris_transformer_flux.c:4040)
iris_sample_euler_refs_flux (iris_sample.c:525)
iris_img2img (iris.c:1329)
```

## Root cause

The MPSGraph linear caches (`g_linear_graph_cache_bf16` and `g_linear_graph_cache`, both 32 entries) use a slot-0-overwrite eviction policy. When the cache is full and a new (seq_len, in_dim, out_dim) tuple is needed, slot 0 is always overwritten:

```c
// get_linear_graph_cache_bf16, line 4692
int slot = 0;
if (g_linear_graph_bf16_count < MAX_LINEAR_GRAPH_CACHE) {
    slot = g_linear_graph_bf16_count++;
}
```

The `__strong` ObjC fields (MPSGraph, MPSGraphTensor, NSArray shapes) are released by ARC when overwritten. But in batch mode (iris_gpu_batch_begin/end), the entire forward pass shares a command buffer chain without committing. The previously-encoded operations still reference the freed graph objects internally.

## How slot 0 thrashes within a single batch

The forward pass uses these bf16-direct cache tuples (condition: out_dim >= 8192 || seq_len >= 8192):

Per double block iteration (5 total):
- (img_seq, 3072, 9216) for img MLP gate/up
- (txt_seq, 3072, 9216) for txt MLP gate/up

Per single block iteration (20 total):
- (total_seq, 3072, 27648) for fused QKV+MLP

When the cache is full from previous runs at different resolutions, all three tuples are cache misses. They cycle through slot 0 like this:

```
DB0: lookup (img_seq, 3072, 9216)  -> MISS -> slot 0 = graph_A
DB0: lookup (txt_seq, 3072, 9216)  -> MISS -> slot 0 = graph_B  (graph_A freed!)
DB1: lookup (img_seq, 3072, 9216)  -> MISS -> slot 0 = graph_A' (graph_B freed!)
DB1: lookup (txt_seq, 3072, 9216)  -> MISS -> slot 0 = graph_B' (graph_A' freed!)
... (repeats for 5 double blocks)
SB0: lookup (total_seq, 3072, 27648) -> MISS -> slot 0 = graph_C (graph_B'' freed!)
```

Each eviction releases the previous graph via ARC while the command buffer still has pending encodes that reference it. MPSGraph's internal logging (MPSGraphOSLog, top of the crash stack) accesses the freed graph, dereferences a zeroed pointer at offset 8, and crashes.

## Why img2img triggers it

Each different image resolution creates unique (seq_len, in_dim, out_dim) tuples. img2img with reference images changes the effective img_seq to combined_img_seq (= target + reference tokens), producing new tuples that differ from regular txt2img. Users trying different reference image sizes fill the 32-entry cache quickly (3 bf16-direct entries per resolution, overflow after ~11 different resolutions). Once the cache is full, the next generation with a new resolution triggers the thrashing.

The same issue exists in g_linear_graph_cache (fp32-cast path, also 32 entries) and g_sdpa_graph_cache (attention path, only 8 entries).

## Suggested fix

Before evicting a cache entry, flush pending GPU work so the old graph objects are no longer referenced:

```c
int slot = 0;
if (g_linear_graph_bf16_count < MAX_LINEAR_GRAPH_CACHE) {
    slot = g_linear_graph_bf16_count++;
} else {
    // Flush pending GPU work before evicting to avoid use-after-free
    if (g_tensor_batch_mode && g_tensor_cmd) {
        if (g_tensor_cmd.status < MTLCommandBufferStatusCommitted) {
            [g_tensor_cmd commit];
        }
        [g_tensor_cmd waitUntilCompleted];
        g_tensor_cmd = [g_queue commandBuffer];
    }
}
```

Or increase MAX_LINEAR_GRAPH_CACHE to accommodate more resolutions (e.g., 128 or 256).

Or retain evicted graph objects in a deferred-release list (similar to pool_flush_deferred for tensor buffers) that is only flushed at iris_gpu_batch_end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXC_BAD_ACCESS in MPSGraph encodeToCommandBuffer during img2img (graph cache use-after-free) #50

Crash

Root cause

How slot 0 thrashes within a single batch

Why img2img triggers it

Suggested fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

EXC_BAD_ACCESS in MPSGraph encodeToCommandBuffer during img2img (graph cache use-after-free) #50

Description

Crash

Root cause

How slot 0 thrashes within a single batch

Why img2img triggers it

Suggested fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions