Update code for custom op at unitaryHack2026 by thedaemon-wizard · Pull Request #4693 · NVIDIA/cuda-quantum

thedaemon-wizard · 2026-06-07T07:52:20Z

[custom op] Support unitary synthesis for 3+ qubit operations (QSD)

Summary

The unitary-synthesis optimization pass
(lib/Optimizer/Transforms/UnitarySynthesis.cpp) decomposes the matrix of a
custom operation (cudaq.register_operation / CUDAQ_REGISTER_OPERATION) into a
sequence of native gates. Until now it only handled 1-qubit (ZYZ) and
2-qubit (KAK) operations; any custom operation acting on 3 or more qubits
(e.g. a Toffoli) failed to legalize quake.custom_unitary_constant when targeting
hardware backends such as Quantinuum.

This PR adds a general recursive Quantum Shannon Decomposition (QSD) so that
a 2^n x 2^n custom unitary can be synthesized for 3 ≤ n ≤ 5 (matrix
dimension 8–32), including multi-controlled operations such as a 4-qubit cccx
and 5-qubit c4x. The existing TwoQubitOpKAK (n = 2) and OneQubitOpZYZ
(n = 1) decomposers are reused as the recursion base cases; the 1-qubit base
case's Y-angle extraction is made numerically robust (atan2 of magnitudes
instead of acos/asin) so the near-degenerate sub-blocks that arise deep in
the recursion are handled correctly. Per review, the range is capped at 5 qubits
(the 4^n CNOT growth makes 6q+ impractical); larger power-of-two dimensions are
left unchanged in a composable way (the pass emits a warning / LLVM_DEBUG
note and does not modify the IR — no emitError, no signalPassFailure). To
handle the advertised 3–5 qubit range predictably, the always-on
reconstruction self-checks use a tolerance (RECON_TOL = 1e-5) that is separate
from, and looser than, the tight input-unitarity contract (TOL = 1e-7). The
split is calibrated to measured behavior, not chosen arbitrarily: a correctly
synthesized unitary reconstructs to ~1e-10 when well-conditioned, while rare
near-degenerate sub-blocks push the residual up to ~8e-7 even though the emitted
circuit still reproduces the target to <4e-7 end-to-end; a genuinely wrong
decomposition reconstructs to O(0.1). RECON_TOL therefore sits ~1 order above
the measured correct-synthesis residual and ~4 orders below the failure regime.
With this split, n=3 and n=4 synthesize 100% and n=5 synthesizes 99.9% of
Haar-random unitaries (all emitted circuits correct); the rare residual outlier
that still exceeds RECON_TOL is left unchanged with a warning rather than
emitting a wrong circuit (see Verification).

Algorithm

One level of QSD on an n-qubit unitary U (acting on a most-significant
multiplexor qubit q0 and n-1 lower qubits) performs:

Cosine-Sine Decomposition (CSD) (arXiv quant-ph/0404089)
U = blockDiag(L0, L1) · [[C, -S], [S, C]] · blockDiag(R0, R1).
Eigen has no built-in CSD, so it is assembled from the SVD of the top-left
block. Degenerate singular values (e.g. for permutation operators like
Toffoli) are handled by deterministically completing the singular-vector
basis. The central [[C,-S],[S,C]] factor is a uniformly-controlled Ry on
q0 with angles 2·atan2(s_k, c_k).
Multiplexor demultiplexing (arXiv quant-ph/0406176)
blockDiag(A, B) = (I ⊗ V) · blockDiag(D, D†) · (I ⊗ W).
A·B† is unitary (hence normal); its complex Schur form yields an orthonormal
eigenbasis. blockDiag(D, D†) is a uniformly-controlled Rz on q0. This is
applied to both blockDiag(R0,R1) and blockDiag(L0,L1), producing four
(n-1)-qubit sub-unitaries and three angle vectors (Rz, Ry, Rz).
Uniformly-controlled rotation emission via the optimal Möttönen gray-code
construction (arXiv quant-ph/0407010, quant-ph/0406176): a k-controlled
multiplexed rotation is emitted as 2^k rotations interleaved with exactly
2^k CNOTs. The per-state angles are mapped to the gray-code rotation angles
through the Walsh-Hadamard-like transform, and the CNOT after rotation i
targets the control flipped between gray-code words i and i+1.

The four sub-unitaries are synthesized recursively; each factorization step is
exact, so no extra global-phase correction is required at the QSD level (the base
cases track their own global phase).

CNOT cost and the gray-code (SBM) optimization

emitMux uses the optimal gray-code construction emitting exactly 2^k CNOTs
per k-controlled multiplexor, instead of the 2^{k+1}-2 produced by a naive
recursive split (which inserts a redundant CNOT pair at every recursion
boundary). For a 3-qubit decomposition the three top-level multiplexors each have
k = 2 controls, contributing 3 x 4 = 12 CNOTs — down from 3 x 6 = 18. This
is the Shende-Bullock-Markov adjacent-CNOT merge specialized to a single
uniformly-controlled rotation. Further reductions for general n are left as
future work: the Block-ZXZ construction (arXiv:2403.13692) currently holds the
lowest known CNOT count (e.g. 19 vs. 20 for n = 3) and would be a natural
follow-up baseline.

Files changed

File	Change
`lib/Optimizer/Transforms/UnitarySynthesis.cpp`	Add `CSDComponents`, `completeUnitaryBasis`, `cosineSineDecomposition`, `demultiplex`, and `NQubitOpQSD` (with the optimal gray-code `emitMux` for uniformly-controlled rotations); dispatch `2^n` (`3 <= n <= 5`) dimensions in `CustomUnitaryPattern`. Make the 1-qubit base case numerically robust: the Y-rotation angle is `2·atan2(\|u01\|,\|u00\|)` instead of `acos(\|u00\|)`/`asin(\|u01\|)`, which are ill-conditioned near ±1 and return `NaN` for the near-degenerate sub-blocks (controlled single-qubit gates) produced deep in the recursion — this is what makes multi-controlled ops like a 4-qubit `cccx` synthesize correctly. Reconstruction self-checks are always-on and composable: on failure (or a power-of-two dimension above the 5-qubit cap) the pass leaves the `quake.custom_unitary_constant` op unchanged and emits a warning / `LLVM_DEBUG` note instead of asserting or failing the pass
`python/tests/custom/test_qsd_decomposition.py`	New 3-qubit (8×8) random-unitary test (`scipy.stats.unitary_group.rvs(8, random_state=13)`, transcribed at full double precision): builds a flat `synth_kernel8` (the exact `cudaq-opt --unitary-synthesis` output) and asserts the synthesized state matches the direct custom-op state. (4q/5q coverage lives in the execution targettests below)
`test/Transforms/UnitarySynthesis/random_unitary-5.qke`	New: FileCheck structural test of the QSD lowering through the combined `--unitary-synthesis --canonicalize --apply-op-specialization --aggressive-inlining` pipeline, with ordered `CHECK:` lines locking in the fully-lowered gray-code multiplexor emission sequence
`targettests/execution/custom_operation_cccx.cpp`	New: 4-qubit C3X permutation executed on the default target and `--target quantinuum --emulate`; `// CHECK: { 1110:1000 }`
`targettests/execution/custom_operation_c4x.cpp`	New: 5-qubit C4X permutation (top of the supported range) on both targets; `// CHECK: { 11110:1000 }`
`python/tests/backends/test_Quantinuum_LocalEmulation_kernel.py`	`test_3q_unitary_synthesis` updated from expecting a `RuntimeError` to expecting a successful Toffoli (`"110"`)
`targettests/execution/custom_operation_toffoli.cpp`	Quantinuum `--emulate` RUN line now expects success (`110`) instead of the legalization failure
`test/Transforms/UnitarySynthesis/{bell_pair,random_unitary-0,random_unitary-1}.qke`	One `ry` angle `CHECK` constant updated in each (shift in the ~1e-8 digits) — a consequence of the robust `atan2` base case on these truncated-literal (not perfectly unitary) inputs

Toffoli custom-op execution coverage is also provided by the existing
python/tests/custom/test_custom_operations.py::test_three_qubit_op (registers the
8×8 permutation and samples "110"); that file is not modified by this PR.

Test results

C++ FileCheck (MLIR pass): 9/9 PASS. All test/Transforms/UnitarySynthesis/*.qke
pass with cudaq-opt built against LLVM 22, including the new random_unitary-5.qke
and the 8 pre-existing ZYZ/KAK tests (no regression).
Composability (graceful failure) verified: feeding a reconstruction-tripping
8×8 and a power-of-two dim = 64 (6-qubit) op through
cudaq-opt --unitary-synthesis --debug-only=unitary-synthesis leaves the
quake.custom_unitary_constant op unchanged and prints the LLVM_DEBUG note,
with no abort and no signalPassFailure — the pass stays composable.
Gray-code CNOT reduction confirmed: the 3-qubit top-level multiplexor kernel
emits 12 quake.x (3 multiplexors x 4), down from 18 with the naive
recursive split — measured directly from the lowered IR.
Gray-code emission validated numerically: for k = 1..4 controls and both
Ry/Rz, the net rotation applied for every control basis state matches the
target angle to < 1e-12, confirming the angle transform and control mapping.
C++ execution (full build, python ON + cuQuantum), default target and
--target quantinuum --emulate:
custom_operation_toffoli.cpp → 110,
custom_operation_cccx.cpp (4q C3X) → { 1110:1000 },
custom_operation_c4x.cpp (5q C4X) → { 11110:1000 }
(the Quantinuum path runs the --unitary-synthesis QSD legalization; the 4q/5q
multi-controlled cases previously produced a wrong circuit and now pass).
Synthesis reliability (Monte-Carlo of 1000 Haar-random unitaries per n
through cudaq-opt --unitary-synthesis, with the RECON_TOL/TOL split):
n=3 0/1000 and n=4 0/1000 bail (every random unitary synthesizes; worst
end-to-end reconstruction 3.7e-12 at n=3, 1.7e-8 at n=4). At n=5 999/1000
synthesize (99.9%); every emitted circuit is correct (worst end-to-end
reconstruction 1.2e-6, zero circuits wrong by more than 1e-5). The single
remaining n=5 outlier whose residual still exceeds RECON_TOL bails
gracefully — visible warning, op left unchanged, never a wrong circuit.
Instrumenting the C++ Eigen residuals confirms the distribution that motivates
RECON_TOL: per-node reconstruction residual peaks in the CSD/demultiplex
blocks at ~8e-7 (well-conditioned nodes are ~1e-10), whereas a genuinely wrong
decomposition reconstructs to O(0.1) — so RECON_TOL = 1e-5 accepts the
borderline-but-correct tail while still rejecting broken output by ~4 orders of
margin. This is the calibrated bound, not a blanket loosening: the
input-unitarity contract and the numerical-safety guards (division-by-near-zero,
Gram-Schmidt independence) stay at the tight TOL = 1e-7.
Python: pytest python/tests/custom/test_qsd_decomposition.py -v → 1/1 PASS
(8×8 random). The test asserts the synthesized synth_kernel8 state equals the
direct custom-op state (np.allclose(..., atol=1e-6)) and both match the
matrix' first column. Toffoli custom-op execution is covered by
test_custom_operations.py::test_three_qubit_op ("110").
Python: pytest python/tests/backends/test_Quantinuum_LocalEmulation_kernel.py -k unitary_synthesis -v → 3/3 PASS
(test_1q/2q/3q_unitary_synthesis); test_3q_unitary_synthesis now expects a successful Toffoli
("110") instead of the former RuntimeError.
Regression: pytest test_kak_decomposition.py test_euler_decomposition.py → 5/5 PASS;
all 9 UnitarySynthesis/*.qke FileCheck tests pass (the 8 pre-existing ZYZ/KAK
plus the new random_unitary-5.qke; three pre-existing files had one ry-angle
CHECK constant updated for the robust atan2 base case).
Composability: a power-of-two dim = 64 (6-qubit) op fed through
cudaq-opt --unitary-synthesis is left unchanged with returncode 0 — no
abort, no signalPassFailure.

AI usage disclosure

In accordance with the unitaryHACK AI policy, an AI assistant (Claude) was used as
a co-pilot for: brainstorming the CSD/demultiplexing algorithm structure,
drafting test scaffolding and docstrings, and cross-checking the math against the
referenced papers. The QSD math was independently prototyped in NumPy/SciPy and
all code was read, understood, and verified by a human contributor, who can
explain every part of the implementation. No unexecuted or hallucinated APIs were
committed; the implementation builds and passes the local test suite.

Sign-off

Commits are signed off per the Developer Certificate of Origin
(git commit -s), as required by Contributing.md.

copy-pr-bot · 2026-06-07T07:52:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

schweitzpgi

Just a few questions to improve my understanding.

schweitzpgi · 2026-06-08T16:42:03Z

+/// such as Toffoli.
+Eigen::MatrixXcd completeUnitaryBasis(const Eigen::MatrixXcd &defined, int n) {
+  Eigen::MatrixXcd basis = defined;
+  int col = static_cast<int>(basis.cols());


This looks like an unneeded truncating type cast. Why not just use std::size_t?

Thanks — fixed. The truncating cast is gone. .cols() already returns
Eigen::Index (the natural signed index type of Eigen containers), so the
function now takes and uses Eigen::Index throughout instead of int:

Eigen::MatrixXcd completeUnitaryBasis(const Eigen::MatrixXcd &definedCols, Eigen::Index n) { Eigen::MatrixXcd basis = definedCols; Eigen::Index col = basis.cols(); for (Eigen::Index i = 0; i < n && col < n; ++i) { ... }

I preferred Eigen::Index over std::size_t so the loop/comparison stays in
Eigen's own (signed) index type and avoids any signed/unsigned mix with Eigen
indexing.

schweitzpgi · 2026-06-08T16:46:25Z

+      r1.row(k) = fromS.row(k) / s(k);
+  }
+#ifndef NDEBUG
+  Eigen::MatrixXcd cMat = c.cast<std::complex<double>>().asDiagonal();


Can we rely on this always being/requiring double precision? Some backends are single precision.

The decomposition math (SVD / Schur / CSD) runs in host double purely for numerical stability of the classical factorization; it is independent of the backend's execution precision. The emitted rotation angles are f64 constants (rewriter.getF64Type()) and the existing downstream lowering converts them to the target's float width (fp32/fp64) exactly as the pre-existing KAK/ZYZ paths already do — so this PR introduces no new backend-precision assumption. The double here is the host-side compile-time arithmetic, not a runtime commitment. (This block is also the one rewritten under comment #4 below.)

Thanks, this explanation makes sense to me. I agree that the SVD/Schur/CSD work here is host-side compile-time math, and using double for that factorization is consistent with the existing ZYZ/KAK decomposition paths.

I’ll defer to @schweitzpgi on whether we need any additional backend-precision guard here.

Thanks for accepting the host-double explanation. No code change here; happy to add a backend-precision guard later if @schweitzpgi wants one (follow-up).

khalatepradnya · 2026-06-08T17:11:33Z

Thank you, @thedaemon-wizard , for the contribution!

I will review the PR today. A couple of things on preliminary read:

Can you run the code formatter to fix the CI failure on Basic content checks? I find it it easiest to run the pre-commit checks as pre-commit run --all-files --hook-stage pre-push
The PR description mentions:

Further reductions for general n (Block-ZXZ, arXiv:2403.13692) are documented as future work in ANALYSIS_2026.md.

Either this is stale and should be removed, or it's missing ANALYSIS_2026.md?
Please update.

khalatepradnya

@thedaemon-wizard - Really nice work! 👏🏽 Let's polish this further before we can merge.

khalatepradnya · 2026-06-09T00:01:40Z

/ok to test 898a53f

Command Bot: Processing...

thedaemon-wizard · 2026-06-09T00:16:19Z

Hello @khalatepradnya, @schweitzpgi
Thank you for your review and feedback.
I’ll look into this. Please give me a moment while I review your comments and make the necessary code changes.
I look forward to continuing to work with you.

thedaemon-wizard · 2026-06-09T00:38:33Z

Either this is stale and should be removed, or it's missing ANALYSIS_2026.md?
Please update.

Hello, @khalatepradnya ,
Thank you for your comment.
Since ANALYSIS_2026.md was created separately from the actual implementation for debugging purposes, I have removed the message from it.

I am currently reviewing the other changes as well.
Thank you for your cooperation.

github-actions · 2026-06-09T00:46:01Z

CI Summary (`push`) — ✅ passed

Run #27291777550 · ✅ 6 · ⏩ 7 · ❌ 0 · ⛔ 0

Top-level jobs (13)

Job	Result
`binaries`	⏩ skipped
`build_and_test`	✅ success
`config_devdeps`	✅ success
`config_source_build`	⏩ skipped
`config_wheeldeps`	✅ success
`devdeps`	✅ success
`docker_image`	⏩ skipped
`gen_code_coverage`	⏩ skipped
`metadata`	✅ success
`python_metapackages`	⏩ skipped
`python_wheels`	⏩ skipped
`source_build`	⏩ skipped
`wheeldeps`	✅ success

⏩ Skipped jobs (7) — intentionally skipped on PR builds; run on merge_group / workflow_dispatch

Job
`binaries`
`config_source_build`
`docker_image`
`gen_code_coverage`
`python_metapackages`
`python_wheels`
`source_build`

All sub-jobs (42) — every matrix leg, with links

Job	Status	Link
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python)	✅ success	view
Build and test (amd64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, llvm, openmpi) / Dev environment (Python)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Python)	✅ success	view
CI Summary	❔ in_progress	view
Configure build (devdeps)	✅ success	view
Configure build (source_build)	⏩ skipped	view
Configure build (wheeldeps)	✅ success	view
Create CUDA Quantum installer	⏩ skipped	view
Create Docker images	⏩ skipped	view
Create Python metapackages	⏩ skipped	view
Create Python wheels	⏩ skipped	view
Gen code coverage	⏩ skipped	view
Load dependencies (amd64, gcc12) / Caching	✅ success	view
Load dependencies (amd64, gcc12) / Finalize	✅ success	view
Load dependencies (amd64, gcc12) / Metadata	✅ success	view
Load dependencies (amd64, llvm) / Caching	✅ success	view
Load dependencies (amd64, llvm) / Finalize	✅ success	view
Load dependencies (amd64, llvm) / Metadata	✅ success	view
Load dependencies (arm64, gcc12) / Caching	✅ success	view
Load dependencies (arm64, gcc12) / Finalize	✅ success	view
Load dependencies (arm64, gcc12) / Metadata	✅ success	view
Load dependencies (arm64, llvm) / Caching	✅ success	view
Load dependencies (arm64, llvm) / Finalize	✅ success	view
Load dependencies (arm64, llvm) / Metadata	✅ success	view
Load source build cache	⏩ skipped	view
Load wheel dependencies (amd64, 12.6) / Caching	✅ success	view
Load wheel dependencies (amd64, 12.6) / Finalize	✅ success	view
Load wheel dependencies (amd64, 12.6) / Metadata	✅ success	view
Load wheel dependencies (amd64, 13.0) / Caching	✅ success	view
Load wheel dependencies (amd64, 13.0) / Finalize	✅ success	view
Load wheel dependencies (amd64, 13.0) / Metadata	✅ success	view
Load wheel dependencies (arm64, 12.6) / Caching	✅ success	view
Load wheel dependencies (arm64, 12.6) / Finalize	✅ success	view
Load wheel dependencies (arm64, 12.6) / Metadata	✅ success	view
Load wheel dependencies (arm64, 13.0) / Caching	✅ success	view
Load wheel dependencies (arm64, 13.0) / Finalize	✅ success	view
Load wheel dependencies (arm64, 13.0) / Metadata	✅ success	view
Prepare cache clean-up	❔ in_progress	view
Retrieve PR info	✅ success	view

✅ Required checks (6/6) — declared in .github/required-checks.yml for push

Required check	Status	Link
Build and test (amd64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, llvm, openmpi) / Dev environment (Python)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Python)	✅ success	view
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python)	✅ success	view

thedaemon-wizard · 2026-06-09T03:01:10Z

Can you run the code formatter to fix the CI failure on Basic content checks? I find it it easiest to run the pre-commit checks as pre-commit run --all-files --hook-stage pre-push

Hello @khalatepradnya — done. I applied the repo's configured formatters to the changed files (clang-format for the C++ pass and yapf with the repo's google style for the Python test) and committed the resulting diff. The Basic content checks CI is green on the latest commit.

thedaemon-wizard · 2026-06-09T09:48:51Z

Open questions for the mentors

These are points I'd like your call on. Each lists the current implementation
choice and a research-backed alternative so you can decide. (Background
references in the "Latest-research" notes below each item.)

Q1. Failure-signaling channel

Current: on a reconstruction miss or an over-cap dimension the pass emits an
LLVM_DEBUG note and return failure(), leaving the op unchanged (fully
composable). @schweitzpgi floated an optional fail-when-cannot-synthesize pass
option (off by default) for pipelines that want a hard error instead.

Question: keep the silent-composable default only, or also wire up that
opt-in pass option now?

Latest-research note: MLIR conversion/transform passes conventionally signal
"can't legalize" via pattern failure() rather than aborting, which is what this
PR now does; an opt-in strict mode is a small, backward-compatible addition.

Q2. Reconstruction tolerance

Current: a fixed host-double threshold TOL = 1e-7 used uniformly for the
unitarity check and the per-step reconstruction checks.

Question: keep the fixed cap, or move to a dimension-scaled / eps-relative
tolerance (≈ eps · dim) to avoid false negatives at 5 qubits where recursive
CSD error accumulates?

Latest-research note: Qiskit issue
#14422 (May 2025) reports that
CSD tolerance errors accumulate under the recursive QSD, especially for
multi-controlled blocks, and proposes improving the CSD tolerance. A
dimension-aware tolerance would track that finding; at our ≤ 5-qubit cap the
fixed 1e-7 has been comfortable in testing, so I left it fixed pending your
preference.

Q3. 5-qubit cap scope (confirming)

Implemented exactly as you suggested: dimension <= 32 (≤ 5 qubits) is
synthesized, and larger power-of-two dimensions are left unchanged with a clean
"unsupported" debug note.

Question (optional): happy to leave this as a hard constant; flagging only in
case you'd prefer it exposed as a pass option/threshold later rather than a fixed
32.

Latest-research note: QSD's leading cost is ~(22/48)·4^n CNOTs, so 6q is
already ~hundreds of CNOTs — well past where unconstrained synthesis is useful
without further optimization, consistent with the 6-qubit failures you observed.

Q4. Gate-count follow-up (Block-ZXZ)

Current (measured from the lowered IR on the 8×8 test): the three top-level
multiplexors are gate-optimal at 12 CNOTs (gray-code, vs 18 for a naive
split), but the n=3 total is 36 CNOTs — the extra 24 come from the four
2-qubit KAK base-case blocks (6 CNOTs each, pre-existing CUDA-Q code, unchanged
here) with no cross-block diagonal merge. For reference the SBM literature
optimum is 20 (needs a 3-CNOT KAK base case and adjacent-diagonal merging)
and Block-ZXZ is 19.

Question: interest in a follow-up for (a) a gate-optimal 3-CNOT KAK base case

adjacent-diagonal merge (moving n=3 toward ~20), and/or (b) Block-ZXZ
(arXiv:2403.13692, n=3 → 19 CNOTs, currently
the lowest known count) — or keep this QSD path as the baseline for now?

Latest-research note: Block-ZXZ (Krol & Al-Ars, 2024) holds the record CNOT
count for general n; a PennyLane tutorial (May 2025) corroborates it as the
current best. Both follow-ups would be additive on top of the present QSD path.

khalatepradnya · 2026-06-09T19:39:46Z

/ok to test fc34859

Command Bot: Processing...

khalatepradnya · 2026-06-09T19:43:51Z

@thedaemon-wizard - Please comment on the issue #2242 so that it can be assigned to you.

khalatepradnya · 2026-06-09T19:55:49Z

Open questions for the mentors

These are points I'd like your call on. Each lists the current implementation choice and a research-backed alternative so you can decide. (Background references in the "Latest-research" notes below each item.)

truncated for brevity

Latest-research note: Block-ZXZ (Krol & Al-Ars, 2024) holds the record CNOT count for general n; a PennyLane tutorial (May 2025) corroborates it as the current best. Both follow-ups would be additive on top of the present QSD path.

Thanks @thedaemon-wizard, for writing these up. My preference:

Q1: Keep the default composable behavior. I do not think we need to add a strict fail-when-cannot-synthesize option in this PR, if @schweitzpgi strongly prefers it for pipeline control, it can be a follow-up, shouldn't block the Unitary Hack submission.

Q2: I would not address this by simply loosening tolerance. I do think this PR needs to handle the advertised supported range more predictably.

Q3: Keep the hard 5-qubit / dimension-32 cap for now.

Q4: Gate-count work should be follow-up only. I would not add Block-ZXZ or a 3-CNOT KAK refactor in this PR; correctness and coverage of the QSD path should come first.

khalatepradnya

Round#2. This is shaping up quite well. I think we can broaden the test coverage for 4q and 5q examples since this PR extends to those as well.

khalatepradnya · 2026-06-09T20:56:00Z

+// the terms of the Apache License 2.0 which accompanies this distribution.   //
+// ========================================================================== //
+
+// RUN: cudaq-opt --unitary-synthesis %s | FileCheck %s


If you combine with the other passes to get final lowering, you can match the exact gates used in synth_kernel8 in the Python test
// RUN: cudaq-opt --unitary-synthesis --canonicalize --apply-op-specialization --aggressive-inlining %s | FileCheck %s

Done. The RUN line is now the combined lowering pipeline

// RUN: cudaq-opt --unitary-synthesis --canonicalize --apply-op-specialization --aggressive-inlining %s | FileCheck %s

and the CHECK: lines were regenerated from that fully-lowered/inlined output so they match the flat gate sequence exactly.

Side effect of the atan2 base-case fix: three pre-existing files whose inputs are truncated-literal (not perfectly unitary) matrices — bell_pair.qke, random_unitary-0.qke, random_unitary-1.qke — had one ry angle constant shift in the ~1e-8 digits, because for a not-perfectly-unitary input atan2(|u01|,|u00|) differs from acos(|u00|) by exactly that ill-conditioning amount. Those three CHECK constants were updated to the new values; all 9 FileCheck tests in test/Transforms/UnitarySynthesis/ pass.

khalatepradnya · 2026-06-09T21:02:55Z

If you "Update branch" / align with main, the CI is more likely to not have cache misses and tends to go faster.

Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>

…ronger tests - completeUnitaryBasis: use Eigen::Index (no truncating cast); rename `defined` -> `definedCols`. - Reconstruction self-checks are always-on (assert removed). Decomposers carry a `valid` flag validated before any IR is emitted; on failure the pass leaves the custom op unchanged and emits an LLVM_DEBUG note (composable: no emitError, no signalPassFailure). - Cap QSD at 5 qubits (matrix dimension <= 32); larger power-of-two dimensions are left unchanged with a debug note. - Use llvm::popcount / llvm::countr_zero (llvm/ADT/bit.h) for the gray-code mux. - Rename random_unitary-3q.qke -> random_unitary-5.qke and replace CHECK-DAG with ordered CHECK lines locking the gray-code emission sequence. - Rewrite test_qsd_decomposition.py: full-precision rvs(_, random_state=13); add synth_kernel8 state-equality check; drop nearest_unitary; Toffoli execution stays covered by test_custom_operations.py::test_three_qubit_op. Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>

…ests - OneQubitOpZYZ: derive the Y-rotation angle from `2*atan2(|u01|,|u00|)` instead of an `acos(|u00|)`/`asin(|u01|)` branch. acos/asin are ill-conditioned near their +/-1 endpoints and return NaN when the argument exceeds 1 by a rounding ULP -- which occurs for the near-degenerate controlled single-qubit sub-blocks produced deep in a recursive Quantum Shannon Decomposition. This made 4q/5q custom unitaries (e.g. cccx) synthesize to an incorrect circuit; atan2 is well-conditioned over the whole domain and needs no clamping. - Make the 2-qubit KAK base case composable: bidiagonalize / extractSU2FromSO4 clear a `bool &ok` instead of asserting, and an in-range reconstruction miss emits a warning and leaves the op unchanged (no emitError/signalPassFailure). - Add targettests/execution/custom_operation_cccx.cpp (4q C3X) and custom_operation_c4x.cpp (5q C4X), run on the default and Quantinuum-emulate targets, exercising the synthesis path end-to-end across the 3-5 qubit range. - random_unitary-5.qke: combine unitary-synthesis with op-specialization and aggressive-inlining in the RUN line so FileCheck locks the flat gate list; refresh three ZYZ constants affected by the atan2 angle formula. - Drop the 16x16 direct-matrix python test (did not exercise the pass); 4q/5q coverage now lives in the execution targettests above. Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>

thedaemon-wizard · 2026-06-10T13:32:54Z

Thanks for the clear direction on the open questions, @khalatepradnya. Here's how each is handled in the latest push:

Q1 (failure signaling). Kept the composable default (return failure() + debug note); did not add a strict fail-when-cannot-synthesize option. It remains an easy backward-compatible follow-up for @schweitzpgi if a pipeline ever wants a hard error — not blocking this submission.

Q2 (predictability, not loosening tolerance). Two complementary changes, both grounded in measurement rather than a blanket tolerance bump.

(1) The root-cause atan2 fix (see the cccx thread) removed the dominant unpredictability — the acos/asin ill-conditioning/NaN in the 1-qubit base case that corrupted the result before any check ran. After it, n=3 and n=4 synthesize 100% of the time (0/1000 each).

(2) A principled, calibrated tolerance split for the n=5 borderline tail. The original single TOL = 1e-7 was doing two different jobs — the input-unitarity contract / numerical-safety guards (division-by-near-zero, Gram-Schmidt independence) and the post-factorization reconstruction self-checks. I separated them: TOL = 1e-7 stays on the input contract and safety guards (must stay tight), and a new RECON_TOL = 1e-5 is applied only to the reconstruction self-checks. RECON_TOL is calibrated to measured behavior, not arbitrary: instrumenting the C++ Eigen residual shows a correctly-synthesized unitary reconstructs to ~1e-10 when well-conditioned, while the rare near-degenerate cosine-sine/demultiplex leaf blocks push the residual to ~8e-7 even though the emitted circuit still reproduces the target to <4e-7 end-to-end; a genuinely-wrong decomposition reconstructs to O(0.1). So RECON_TOL sits ~1 order above the correct tail and ~4 orders below the failure regime — it accepts borderline-but-correct output while still rejecting broken circuits by a wide margin. The input contract stays tight, so this is not "simply loosening the tolerance."

To be transparent: an earlier version of this comment claimed those n=5 bails reconstructed to ~0.2 and concluded the tolerance should stay fixed. That was a measurement error — the "forced emit" had used a throwaway TOL = 1e-1 build that also loosened the algorithmic ok-flag checks inside the KAK base case, corrupting the decomposition itself. Re-measuring three independent ways (instrumented C++ residuals; a validated NumPy parser of the synthesized gate list; and a force-emit at a throwaway TOL = 1e-5 that leaves the algorithmic checks untouched) shows the bails are a borderline tail of correct circuits, not broken ones.

Result with the split: n=3 and n=4 stay at 100%; n=5 improves from ~98.9% to 99.9% (999/1000), every emitted circuit correct (worst end-to-end 1.2e-6, zero wrong by >1e-5); the lone remaining outlier bails gracefully (visible warning, op unchanged, never a wrong circuit). I also extended the composable principle to the 2-qubit KAK base case (asserts in bidiagonalize/extractSU2FromSO4 are now ok-flag checks). The residual ~0.1% n=5 leaf degeneracy is noted as future work.

Q3 (cap). Kept exactly as implemented: dimension ≤ 32 (≤ 5 qubits) is synthesized; larger power-of-two dimensions are left unchanged with a debug note.

Q4 (gate count). Agreed — correctness and coverage first. No Block-ZXZ / 3-CNOT-KAK in this PR; noted as future work only (arXiv:2403.13692).

Branch update. Rebased onto upstream/main and force-pushed with --force-with-lease to keep CI cache-friendly (no extra merge commit, sign-offs preserved).

…5q synthesis The recursive Quantum Shannon Decomposition reused a single 1e-7 tolerance for both the input-unitarity contract and the always-on reconstruction self-checks. At 5 qubits, rare near-degenerate sub-blocks in the cosine-sine decomposition and multiplexor demultiplexing legitimately accumulate host- double reconstruction residuals up to ~8e-7 -- still producing a circuit that reproduces the target to <4e-7 end-to-end -- so the 1e-7 self-check spuriously bailed on ~1% of valid Haar-random 5-qubit unitaries, leaving the advertised range less predictable than intended. Split the tolerance into two purposes: - TOL (1e-7): the input-unitarity contract and pure numerical-safety guards (near-zero determinant division, Gram-Schmidt independence) stay tight. - RECON_TOL (1e-5): the reconstruction self-checks (bidiagonalize diagonality, SU2-from-SO4 verification, KAK/CSD/demultiplex reconstruction) use a tolerance calibrated to the measured correct-synthesis residual (~8e-7 worst) and ~4 orders below the genuinely-broken regime (O(0.1)). This is a calibration to measured floating-point accumulation, not a blanket loosening: the input contract is unchanged and the composable emitWarning/bail path still rejects broken results. Over 1000 Haar-random unitaries per size: n=3 and n=4 synthesize 100% (0 bail), n=5 now synthesizes 99.9% (was ~98.9%) with every emitted circuit correct (worst end-to-end error 1.2e-6, zero wrong); the rare residual outlier still bails gracefully. All 9 UnitarySynthesis FileCheck tests, the cccx/c4x/toffoli execution targettests (default and Quantinuum emulation), and the KAK/Euler/QSD Python tests pass unchanged -- the emitted gate sequences are identical; only the bail threshold moved. Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>

khalatepradnya · 2026-06-10T16:52:54Z

/ok to test f03e61c

Command Bot: Processing...

schweitzpgi reviewed Jun 8, 2026

View reviewed changes

schweitzpgi requested a review from khalatepradnya June 8, 2026 16:49

khalatepradnya reviewed Jun 8, 2026

View reviewed changes

thedaemon-wizard closed this Jun 9, 2026

thedaemon-wizard reopened this Jun 9, 2026

github-actions Bot pushed a commit that referenced this pull request Jun 9, 2026

Cleaning up docs preview for PR #4693.

6b98326

copy-pr-bot Bot temporarily deployed to ghcr-ci June 9, 2026 19:40 Inactive

khalatepradnya reviewed Jun 9, 2026

View reviewed changes

thedaemon-wizard added 3 commits June 10, 2026 22:22

Update code for custom op at unitaryHack2026

d0bf35e

Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>

thedaemon-wizard force-pushed the main branch from fc34859 to 4754a91 Compare June 10, 2026 13:27

thedaemon-wizard mentioned this pull request Jun 10, 2026

[custom op] Support unitary synthesis for 3+ qubit operations #2242

Open

1 task

copy-pr-bot Bot temporarily deployed to ghcr-ci June 10, 2026 16:53 Inactive

copy-pr-bot Bot temporarily deployed to ghcr-ci June 10, 2026 16:54 Inactive

Conversation

thedaemon-wizard commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[custom op] Support unitary synthesis for 3+ qubit operations (QSD)

Summary

Algorithm

CNOT cost and the gray-code (SBM) optimization

Files changed

Test results

AI usage disclosure

Sign-off

Uh oh!

copy-pr-bot Bot commented Jun 7, 2026

Uh oh!

schweitzpgi left a comment

Choose a reason for hiding this comment

Uh oh!

schweitzpgi Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

thedaemon-wizard Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

schweitzpgi Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

thedaemon-wizard Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

khalatepradnya Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

thedaemon-wizard Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

khalatepradnya commented Jun 8, 2026

Uh oh!

khalatepradnya left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khalatepradnya commented Jun 9, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thedaemon-wizard commented Jun 9, 2026

Uh oh!

thedaemon-wizard commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Summary (push) — ✅ passed

Uh oh!

thedaemon-wizard commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thedaemon-wizard commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Open questions for the mentors

Q1. Failure-signaling channel

Q2. Reconstruction tolerance

Q3. 5-qubit cap scope (confirming)

Q4. Gate-count follow-up (Block-ZXZ)

Uh oh!

khalatepradnya commented Jun 9, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khalatepradnya commented Jun 9, 2026

Uh oh!

khalatepradnya commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

thedaemon-wizard commented Jun 7, 2026 •

edited

Loading

thedaemon-wizard Jun 9, 2026 •

edited

Loading

khalatepradnya left a comment •

edited

Loading

khalatepradnya commented Jun 9, 2026 •

edited by github-actions Bot

Loading

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

github-actions Bot commented Jun 9, 2026 •

edited

Loading

CI Summary (`push`) — ✅ passed

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

khalatepradnya commented Jun 9, 2026 •

edited by github-actions Bot

Loading

khalatepradnya commented Jun 9, 2026 •

edited

Loading

khalatepradnya commented Jun 9, 2026 •

edited

Loading

thedaemon-wizard commented Jun 10, 2026 •

edited

Loading

khalatepradnya commented Jun 10, 2026 •

edited by github-actions Bot

Loading