I have been experiencing issues achieving timing closure on all direct-HBM designs. This affects all designs that use sp=<x>:HBMn, including example 05_perf. Any thoughts or insights on this would be greatly appreciated.
Observations
Direct HBM (non-VNOC/MEM) connects an AXI master directly to the HBM NMU. The linking stage fails post-route timing on the platform's 400 MHz clock in every such design. It is not possible to write user/RM logic that prevents this timing closure problem. The build still completes successfully and writes a .vbin without a warning or error, and in many designs, there is no evidence of data corruption. However, some designs (that have been validated through simulation) have shown non-deterministic data corruption, and it is not possible to rule out this timing violation as benign.
The following data is from a minimal direct-HBM vector-add kernel. It has three AXI4 master ports, each pinned to it's own HBM channel.
Timing report
Every failing path is a register in the user region driving the static HBM NoC Master Unit (NMU) AXI slave or the same crossing in the other direction. More specifically, it's the per-channel HBM SmartConnect master output driving axi_noc_cips/HBMxx_AXI. The SmartConnect crosses clock domains from the user region (on the user clock) up to the fixed 400 MHz static-shell clock for the HBM NMU (clk_wizard_0/clk_out1, exported as static_region_clk, timing name clk_wizard_0_clk_out1_1), which is non-reconfigurable.
The important detail is that there is no logic after the SmartConnect's output. The data path routing delay is 96.7%, and it crosses from the reconfigurable partition into the static shell.
Why it can't be fixed from user logic
- Lowering the user clock can't help because the failing path is on the non-configurable 400 MHz clock.
- A pipeline register can't be placed closer to the NMU. Firstly, the NMU's clock region has no general fabric / SLICE sites. Secondly, the nearest fabric below is not inside the SLASH pblock, so registers cannot be placed there.
- Even if the pblock were closer, the data shows that the route delay into the hardened NMU is very similar across the failing endpoints, even when distance varies, which suggests that there's some intrinsic fixed delay incurred from crossing out of the fabric into the hardened block.
The highlighted path is the worst timing violation.
Notes
With the current floorplanning, the static region clock would need to be clocked at about 325 MHz to meet timing. The only way to meet timing on HBM designs currently is to use a sp=MEM allocator, which route through the user-clock VNOC ingresses and complete the clock crossing on the hardened NoC block. However, it's a large performance drop below 8 channels, and a massive performance drop past that (as the VNOC ingresses must be multiplexed).
I have been experiencing issues achieving timing closure on all direct-HBM designs. This affects all designs that use
sp=<x>:HBMn, including example 05_perf. Any thoughts or insights on this would be greatly appreciated.Observations
Direct HBM (non-VNOC/MEM) connects an AXI master directly to the HBM NMU. The linking stage fails post-route timing on the platform's 400 MHz clock in every such design. It is not possible to write user/RM logic that prevents this timing closure problem. The build still completes successfully and writes a
.vbinwithout a warning or error, and in many designs, there is no evidence of data corruption. However, some designs (that have been validated through simulation) have shown non-deterministic data corruption, and it is not possible to rule out this timing violation as benign.The following data is from a minimal direct-HBM vector-add kernel. It has three AXI4 master ports, each pinned to it's own HBM channel.
Timing report
Every failing path is a register in the user region driving the static HBM NoC Master Unit (NMU) AXI slave or the same crossing in the other direction. More specifically, it's the per-channel HBM SmartConnect master output driving
axi_noc_cips/HBMxx_AXI. The SmartConnect crosses clock domains from the user region (on the user clock) up to the fixed 400 MHz static-shell clock for the HBM NMU (clk_wizard_0/clk_out1, exported as static_region_clk, timing name clk_wizard_0_clk_out1_1), which is non-reconfigurable.The important detail is that there is no logic after the SmartConnect's output. The data path routing delay is 96.7%, and it crosses from the reconfigurable partition into the static shell.
Why it can't be fixed from user logic
The highlighted path is the worst timing violation.
Notes
With the current floorplanning, the static region clock would need to be clocked at about 325 MHz to meet timing. The only way to meet timing on HBM designs currently is to use a sp=MEM allocator, which route through the user-clock VNOC ingresses and complete the clock crossing on the hardened NoC block. However, it's a large performance drop below 8 channels, and a massive performance drop past that (as the VNOC ingresses must be multiplexed).