Skip to content

feat[next-dace]: Use SDFG library node for lowering of broadcast and reduce#2386

Draft
edopao wants to merge 97 commits into
GridTools:mainfrom
edopao:dace-fill_node
Draft

feat[next-dace]: Use SDFG library node for lowering of broadcast and reduce#2386
edopao wants to merge 97 commits into
GridTools:mainfrom
edopao:dace-fill_node

Conversation

@edopao
Copy link
Copy Markdown
Contributor

@edopao edopao commented Nov 11, 2025

TODO:

  • Run ICON4Py CI, see ICON4Py PR#1240
  • Run Blueline (there are some degradation but in general it works)
  • Run MuPhys (just to be sure)

Copy link
Copy Markdown
Contributor

@philip-paul-mueller philip-paul-mueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some refinements needed.

Comment thread src/gt4py/next/program_processors/runners/dace/sdfg_library_nodes.py Outdated


@dace_library.node
class Fill(dace_nodes.LibraryNode):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some more semantic, i.e. an input connector, that collects the value that should be broadcasted and an output connector for the output.

I am also wondering if it would make sense to have two different library nodes.
One where the value that is broadcast is a literal, like 0.0 and one, which is probably the current one, where the value is read from another data descriptor (might be hard to integrate into the lowering).

Comment thread src/gt4py/next/program_processors/runners/dace/sdfg_library_nodes.py Outdated
Comment thread src/gt4py/next/program_processors/runners/dace/sdfg_library_nodes.py Outdated
Comment thread src/gt4py/next/program_processors/runners/dace/sdfg_library_nodes.py Outdated
Comment thread src/gt4py/next/program_processors/runners/dace/sdfg_library_nodes.py Outdated
Comment thread src/gt4py/next/program_processors/runners/dace/sdfg_library_nodes.py Outdated
Comment thread src/gt4py/next/program_processors/runners/dace/sdfg_library_nodes.py Outdated
Comment thread src/gt4py/next/program_processors/runners/dace/sdfg_library_nodes.py Outdated
Comment thread src/gt4py/next/program_processors/runners/dace/sdfg_library_nodes.py Outdated
@philip-paul-mueller
Copy link
Copy Markdown
Contributor

@edopao
I am not sure if we should add the transformations we need already in this PR or in a later one.
If we put it in a later one, we should patch the optimizer to expand the node right at the beginning, this way we preserve the current behaviour and performance.

Comment thread src/gt4py/next/program_processors/runners/dace/gtir_to_sdfg_primitives.py Outdated
@edopao
Copy link
Copy Markdown
Contributor Author

edopao commented Nov 24, 2025

cscs-ci run default

@edopao
Copy link
Copy Markdown
Contributor Author

edopao commented Nov 24, 2025

cscs-ci run default

@edopao
Copy link
Copy Markdown
Contributor Author

edopao commented Nov 24, 2025

cscs-ci run default

@edopao
Copy link
Copy Markdown
Contributor Author

edopao commented Dec 10, 2025

No plan for now to integrate this feature.

Copy link
Copy Markdown
Contributor Author

@edopao edopao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, just some minor comments.

Comment thread src/gt4py/next/program_processors/runners/dace/library_nodes/broadcast.py Outdated
```python
for i in range(len(broadcast_in_dim):
assert output.shape[broadcast_in_dim[i]] == value_to_broadcast.shape[i]
```
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```
```
In other words, the result array shape has the same size as the broadcast domain.

Comment thread src/gt4py/next/program_processors/runners/dace/library_nodes/broadcast.py Outdated
Comment thread src/gt4py/next/program_processors/runners/dace/library_nodes/broadcast.py Outdated

Todo:
- While for the output it is probably okay to always require an adjacent
AccessNode for the input it might be possible to be on the other side
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AccessNode for the input it might be possible to be on the other side
AccessNode, the input nodes might be outside a map scope.

However, I don't understand how this could happen.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment dates back when there was still the .value attribute.

Comment thread src/gt4py/next/program_processors/runners/dace/transformations/broadcast.py Outdated
Comment thread src/gt4py/next/program_processors/runners/dace/transformations/broadcast.py Outdated
# Check single use data if it was not known at the beginning.
if self._single_use_data is None:
find_single_use_data = dace_analysis.FindSingleUseData()
single_use_data = find_single_use_data.apply_pass(sdfg, None)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be wrong to now store single_use_data? I am asking because it is used again inside apply().

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because, there is no guarantee that between now and later there is a change that makes data no longer single use.
By passing it on constructor the user kinds of guarantee that it is safe.

Comment thread src/gt4py/next/program_processors/runners/dace/transformations/auto_optimize.py Outdated
# probably yes, as we can remove the read and write of the initial data
# only the write to final destination is left. If the consumers are Maps
# the thing is a bit different. As we have to create the intermediate
# allocation. If the read of the memory is okay the `InlineBroadcastAccess`
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InlineBroadcastAccess does not exist yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was the old name.
Then I removed it to Scalar..., because I thought that I onlyneed to handle scalars which was wrong.
So there is a todo to rename it, I think I will now switch to the old name.

Copy link
Copy Markdown
Contributor Author

@edopao edopao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, just some minor comments.

No unit test yet and also it has thr wrong name.
…nto simplify.

Also expanding is now happening at the very end.
They are now conditionally expanded in the auto optimizer if splitting is enabled and no broadcast related transformation has been applied.
- Added a custom `free_symbols()` function to the library node to ensure that the symbols in `params` are not found
- Added validation to ensure that the name and lable attribute of the node are always in sync.
I assume that some map (horizontal) fusion transformations runs amock.
Probably there needs to be a better place.
Before I was thinking that the issue is with the fusion that the splitter performs.
But now I think it is earlier, i.e. Parallel Mapfusion has laready killed them.
Maybe it is a good idea to make the transformation split, i.e. map inlining is distinct from AccessNode replacement.
Now the expansion of teh broadcast node takes place after the first fix point was found.
I think this is the best idea, but I am not fully sure.
@philip-paul-mueller
Copy link
Copy Markdown
Contributor

Here are the newest data from 8edefc0 which show that the speed is now comparable (at least is much better than before).
If I look at my experiences with the compute_advection_in_vertical then the 3% degradation we observes are most likely caused by some broadcast loops that were (for whatever reasons) not integrated into other kernels.

bench_blueline_stencil_compute

@philip-paul-mueller
Copy link
Copy Markdown
Contributor

My current guess of why they are not integrated is because of their range.
In this PR the ranges are based on the size of the target location, i.e. the range is [0, size_of_patch_to_broadcast), but before the range was based on the grid coordinates that were written, i.e. the range was [136, 2000).
Furthermore, the Map splitter currently only consider the range and not the sizes.

I think the current design (with size) is better, because it makes handling the nodes much simpler and they contain less state (ideally they would be stateless, but this is not possible).
Thus, we should update the splitting tools to consider the size of the iteration spaces instead of their size.

@edopao
Copy link
Copy Markdown
Contributor Author

edopao commented May 26, 2026

I have moved the library node for reduction with skip values to a separate PR #2603

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants