Skip to content

Update code for custom op at unitaryHack2026#4693

Open
thedaemon-wizard wants to merge 4 commits into
NVIDIA:mainfrom
thedaemon-wizard:main
Open

Update code for custom op at unitaryHack2026#4693
thedaemon-wizard wants to merge 4 commits into
NVIDIA:mainfrom
thedaemon-wizard:main

Conversation

@thedaemon-wizard

@thedaemon-wizard thedaemon-wizard commented Jun 7, 2026

Copy link
Copy Markdown

[custom op] Support unitary synthesis for 3+ qubit operations (QSD)

Closes #2242

Summary

The unitary-synthesis optimization pass
(lib/Optimizer/Transforms/UnitarySynthesis.cpp) decomposes the matrix of a
custom operation (cudaq.register_operation / CUDAQ_REGISTER_OPERATION) into a
sequence of native gates. Until now it only handled 1-qubit (ZYZ) and
2-qubit (KAK) operations; any custom operation acting on 3 or more qubits
(e.g. a Toffoli) failed to legalize quake.custom_unitary_constant when targeting
hardware backends such as Quantinuum.

This PR adds a general recursive Quantum Shannon Decomposition (QSD) so that
a 2^n x 2^n custom unitary can be synthesized for 3 ≤ n ≤ 5 (matrix
dimension 8–32), including multi-controlled operations such as a 4-qubit cccx
and 5-qubit c4x. The existing TwoQubitOpKAK (n = 2) and OneQubitOpZYZ
(n = 1) decomposers are reused as the recursion base cases; the 1-qubit base
case's Y-angle extraction is made numerically robust (atan2 of magnitudes
instead of acos/asin) so the near-degenerate sub-blocks that arise deep in
the recursion are handled correctly. Per review, the range is capped at 5 qubits
(the 4^n CNOT growth makes 6q+ impractical); larger power-of-two dimensions are
left unchanged in a composable way (the pass emits a warning / LLVM_DEBUG
note and does not modify the IR — no emitError, no signalPassFailure). To
handle the advertised 3–5 qubit range predictably, the always-on
reconstruction self-checks use a tolerance (RECON_TOL = 1e-5) that is separate
from, and looser than, the tight input-unitarity contract (TOL = 1e-7). The
split is calibrated to measured behavior, not chosen arbitrarily: a correctly
synthesized unitary reconstructs to ~1e-10 when well-conditioned, while rare
near-degenerate sub-blocks push the residual up to ~8e-7 even though the emitted
circuit still reproduces the target to <4e-7 end-to-end; a genuinely wrong
decomposition reconstructs to O(0.1). RECON_TOL therefore sits ~1 order above
the measured correct-synthesis residual and ~4 orders below the failure regime.
With this split, n=3 and n=4 synthesize 100% and n=5 synthesizes 99.9% of
Haar-random unitaries (all emitted circuits correct); the rare residual outlier
that still exceeds RECON_TOL is left unchanged with a warning rather than
emitting a wrong circuit (see Verification).

Algorithm

One level of QSD on an n-qubit unitary U (acting on a most-significant
multiplexor qubit q0 and n-1 lower qubits) performs:

  1. Cosine-Sine Decomposition (CSD) (arXiv quant-ph/0404089)
    U = blockDiag(L0, L1) · [[C, -S], [S, C]] · blockDiag(R0, R1).
    Eigen has no built-in CSD, so it is assembled from the SVD of the top-left
    block. Degenerate singular values (e.g. for permutation operators like
    Toffoli) are handled by deterministically completing the singular-vector
    basis. The central [[C,-S],[S,C]] factor is a uniformly-controlled Ry on
    q0 with angles 2·atan2(s_k, c_k).

  2. Multiplexor demultiplexing (arXiv quant-ph/0406176)
    blockDiag(A, B) = (I ⊗ V) · blockDiag(D, D†) · (I ⊗ W).
    A·B† is unitary (hence normal); its complex Schur form yields an orthonormal
    eigenbasis. blockDiag(D, D†) is a uniformly-controlled Rz on q0. This is
    applied to both blockDiag(R0,R1) and blockDiag(L0,L1), producing four
    (n-1)-qubit sub-unitaries and three angle vectors (Rz, Ry, Rz).

  3. Uniformly-controlled rotation emission via the optimal Möttönen gray-code
    construction (arXiv quant-ph/0407010, quant-ph/0406176): a k-controlled
    multiplexed rotation is emitted as 2^k rotations interleaved with exactly
    2^k CNOTs. The per-state angles are mapped to the gray-code rotation angles
    through the Walsh-Hadamard-like transform, and the CNOT after rotation i
    targets the control flipped between gray-code words i and i+1.

The four sub-unitaries are synthesized recursively; each factorization step is
exact, so no extra global-phase correction is required at the QSD level (the base
cases track their own global phase).

CNOT cost and the gray-code (SBM) optimization

emitMux uses the optimal gray-code construction emitting exactly 2^k CNOTs
per k-controlled multiplexor, instead of the 2^{k+1}-2 produced by a naive
recursive split (which inserts a redundant CNOT pair at every recursion
boundary). For a 3-qubit decomposition the three top-level multiplexors each have
k = 2 controls, contributing 3 x 4 = 12 CNOTs — down from 3 x 6 = 18. This
is the Shende-Bullock-Markov adjacent-CNOT merge specialized to a single
uniformly-controlled rotation. Further reductions for general n are left as
future work: the Block-ZXZ construction (arXiv:2403.13692) currently holds the
lowest known CNOT count (e.g. 19 vs. 20 for n = 3) and would be a natural
follow-up baseline.

Files changed

File Change
lib/Optimizer/Transforms/UnitarySynthesis.cpp Add CSDComponents, completeUnitaryBasis, cosineSineDecomposition, demultiplex, and NQubitOpQSD (with the optimal gray-code emitMux for uniformly-controlled rotations); dispatch 2^n (3 <= n <= 5) dimensions in CustomUnitaryPattern. Make the 1-qubit base case numerically robust: the Y-rotation angle is 2·atan2(|u01|,|u00|) instead of acos(|u00|)/asin(|u01|), which are ill-conditioned near ±1 and return NaN for the near-degenerate sub-blocks (controlled single-qubit gates) produced deep in the recursion — this is what makes multi-controlled ops like a 4-qubit cccx synthesize correctly. Reconstruction self-checks are always-on and composable: on failure (or a power-of-two dimension above the 5-qubit cap) the pass leaves the quake.custom_unitary_constant op unchanged and emits a warning / LLVM_DEBUG note instead of asserting or failing the pass
python/tests/custom/test_qsd_decomposition.py New 3-qubit (8×8) random-unitary test (scipy.stats.unitary_group.rvs(8, random_state=13), transcribed at full double precision): builds a flat synth_kernel8 (the exact cudaq-opt --unitary-synthesis output) and asserts the synthesized state matches the direct custom-op state. (4q/5q coverage lives in the execution targettests below)
test/Transforms/UnitarySynthesis/random_unitary-5.qke New: FileCheck structural test of the QSD lowering through the combined --unitary-synthesis --canonicalize --apply-op-specialization --aggressive-inlining pipeline, with ordered CHECK: lines locking in the fully-lowered gray-code multiplexor emission sequence
targettests/execution/custom_operation_cccx.cpp New: 4-qubit C3X permutation executed on the default target and --target quantinuum --emulate; // CHECK: { 1110:1000 }
targettests/execution/custom_operation_c4x.cpp New: 5-qubit C4X permutation (top of the supported range) on both targets; // CHECK: { 11110:1000 }
python/tests/backends/test_Quantinuum_LocalEmulation_kernel.py test_3q_unitary_synthesis updated from expecting a RuntimeError to expecting a successful Toffoli ("110")
targettests/execution/custom_operation_toffoli.cpp Quantinuum --emulate RUN line now expects success (110) instead of the legalization failure
test/Transforms/UnitarySynthesis/{bell_pair,random_unitary-0,random_unitary-1}.qke One ry angle CHECK constant updated in each (shift in the ~1e-8 digits) — a consequence of the robust atan2 base case on these truncated-literal (not perfectly unitary) inputs

Toffoli custom-op execution coverage is also provided by the existing
python/tests/custom/test_custom_operations.py::test_three_qubit_op (registers the
8×8 permutation and samples "110"); that file is not modified by this PR.

Test results

  • C++ FileCheck (MLIR pass): 9/9 PASS. All test/Transforms/UnitarySynthesis/*.qke
    pass with cudaq-opt built against LLVM 22, including the new random_unitary-5.qke
    and the 8 pre-existing ZYZ/KAK tests (no regression).
  • Composability (graceful failure) verified: feeding a reconstruction-tripping
    8×8 and a power-of-two dim = 64 (6-qubit) op through
    cudaq-opt --unitary-synthesis --debug-only=unitary-synthesis leaves the
    quake.custom_unitary_constant op unchanged and prints the LLVM_DEBUG note,
    with no abort and no signalPassFailure — the pass stays composable.
  • Gray-code CNOT reduction confirmed: the 3-qubit top-level multiplexor kernel
    emits 12 quake.x (3 multiplexors x 4), down from 18 with the naive
    recursive split — measured directly from the lowered IR.
  • Gray-code emission validated numerically: for k = 1..4 controls and both
    Ry/Rz, the net rotation applied for every control basis state matches the
    target angle to < 1e-12, confirming the angle transform and control mapping.
  • C++ execution (full build, python ON + cuQuantum), default target and
    --target quantinuum --emulate:

    custom_operation_toffoli.cpp110,
    custom_operation_cccx.cpp (4q C3X) → { 1110:1000 },
    custom_operation_c4x.cpp (5q C4X) → { 11110:1000 }
    (the Quantinuum path runs the --unitary-synthesis QSD legalization; the 4q/5q
    multi-controlled cases previously produced a wrong circuit and now pass).
  • Synthesis reliability (Monte-Carlo of 1000 Haar-random unitaries per n
    through cudaq-opt --unitary-synthesis, with the RECON_TOL/TOL split):

    n=3 0/1000 and n=4 0/1000 bail (every random unitary synthesizes; worst
    end-to-end reconstruction 3.7e-12 at n=3, 1.7e-8 at n=4). At n=5 999/1000
    synthesize (99.9%); every emitted circuit is correct (worst end-to-end
    reconstruction 1.2e-6, zero circuits wrong by more than 1e-5). The single
    remaining n=5 outlier whose residual still exceeds RECON_TOL bails
    gracefully — visible warning, op left unchanged, never a wrong circuit.
    Instrumenting the C++ Eigen residuals confirms the distribution that motivates
    RECON_TOL: per-node reconstruction residual peaks in the CSD/demultiplex
    blocks at ~8e-7 (well-conditioned nodes are ~1e-10), whereas a genuinely wrong
    decomposition reconstructs to O(0.1) — so RECON_TOL = 1e-5 accepts the
    borderline-but-correct tail while still rejecting broken output by ~4 orders of
    margin. This is the calibrated bound, not a blanket loosening: the
    input-unitarity contract and the numerical-safety guards (division-by-near-zero,
    Gram-Schmidt independence) stay at the tight TOL = 1e-7.
  • Python: pytest python/tests/custom/test_qsd_decomposition.py -v → 1/1 PASS
    (8×8 random). The test asserts the synthesized synth_kernel8 state equals the
    direct custom-op state (np.allclose(..., atol=1e-6)) and both match the
    matrix' first column. Toffoli custom-op execution is covered by
    test_custom_operations.py::test_three_qubit_op ("110").
  • Python: pytest python/tests/backends/test_Quantinuum_LocalEmulation_kernel.py -k unitary_synthesis -v → 3/3 PASS
    (test_1q/2q/3q_unitary_synthesis); test_3q_unitary_synthesis now expects a successful Toffoli
    ("110") instead of the former RuntimeError.
  • Regression: pytest test_kak_decomposition.py test_euler_decomposition.py → 5/5 PASS;
    all 9 UnitarySynthesis/*.qke FileCheck tests pass (the 8 pre-existing ZYZ/KAK
    plus the new random_unitary-5.qke; three pre-existing files had one ry-angle
    CHECK constant updated for the robust atan2 base case).
  • Composability: a power-of-two dim = 64 (6-qubit) op fed through
    cudaq-opt --unitary-synthesis is left unchanged with returncode 0 — no
    abort, no signalPassFailure.

AI usage disclosure

In accordance with the unitaryHACK AI policy, an AI assistant (Claude) was used as
a co-pilot for: brainstorming the CSD/demultiplexing algorithm structure,
drafting test scaffolding and docstrings, and cross-checking the math against the
referenced papers. The QSD math was independently prototyped in NumPy/SciPy and
all code was read, understood, and verified by a human contributor, who can
explain every part of the implementation. No unexecuted or hallucinated APIs were
committed; the implementation builds and passes the local test suite.

Sign-off

Commits are signed off per the Developer Certificate of Origin
(git commit -s), as required by Contributing.md.

@copy-pr-bot

copy-pr-bot Bot commented Jun 7, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@schweitzpgi schweitzpgi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions to improve my understanding.

/// such as Toffoli.
Eigen::MatrixXcd completeUnitaryBasis(const Eigen::MatrixXcd &defined, int n) {
Eigen::MatrixXcd basis = defined;
int col = static_cast<int>(basis.cols());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an unneeded truncating type cast. Why not just use std::size_t?

@thedaemon-wizard thedaemon-wizard Jun 9, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — fixed. The truncating cast is gone. .cols() already returns
Eigen::Index (the natural signed index type of Eigen containers), so the
function now takes and uses Eigen::Index throughout instead of int:

Eigen::MatrixXcd completeUnitaryBasis(const Eigen::MatrixXcd &definedCols,
                                      Eigen::Index n) {
  Eigen::MatrixXcd basis = definedCols;
  Eigen::Index col = basis.cols();
  for (Eigen::Index i = 0; i < n && col < n; ++i) { ... }

I preferred Eigen::Index over std::size_t so the loop/comparison stays in
Eigen's own (signed) index type and avoids any signed/unsigned mix with Eigen
indexing.

Comment thread cudaq/lib/Optimizer/Transforms/UnitarySynthesis.cpp Outdated
r1.row(k) = fromS.row(k) / s(k);
}
#ifndef NDEBUG
Eigen::MatrixXcd cMat = c.cast<std::complex<double>>().asDiagonal();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rely on this always being/requiring double precision? Some backends are single precision.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decomposition math (SVD / Schur / CSD) runs in host double purely for numerical stability of the classical factorization; it is independent of the backend's execution precision. The emitted rotation angles are f64 constants (rewriter.getF64Type()) and the existing downstream lowering converts them to the target's float width (fp32/fp64) exactly as the pre-existing KAK/ZYZ paths already do — so this PR introduces no new backend-precision assumption. The double here is the host-side compile-time arithmetic, not a runtime commitment. (This block is also the one rewritten under comment #4 below.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this explanation makes sense to me. I agree that the SVD/Schur/CSD work here is host-side compile-time math, and using double for that factorization is consistent with the existing ZYZ/KAK decomposition paths.

I’ll defer to @schweitzpgi on whether we need any additional backend-precision guard here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for accepting the host-double explanation. No code change here; happy to add a backend-precision guard later if @schweitzpgi wants one (follow-up).

@khalatepradnya

Copy link
Copy Markdown
Collaborator

Thank you, @thedaemon-wizard , for the contribution!

I will review the PR today. A couple of things on preliminary read:

  • Can you run the code formatter to fix the CI failure on Basic content checks? I find it it easiest to run the pre-commit checks as pre-commit run --all-files --hook-stage pre-push

  • The PR description mentions:

Further reductions for general n (Block-ZXZ, arXiv:2403.13692) are documented as future work in ANALYSIS_2026.md.

Either this is stale and should be removed, or it's missing ANALYSIS_2026.md?
Please update.

@khalatepradnya khalatepradnya left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thedaemon-wizard - Really nice work! 👏🏽 Let's polish this further before we can merge.

Comment thread cudaq/lib/Optimizer/Transforms/UnitarySynthesis.cpp Outdated
Comment thread cudaq/lib/Optimizer/Transforms/UnitarySynthesis.cpp Outdated
Comment thread cudaq/lib/Optimizer/Transforms/UnitarySynthesis.cpp Outdated
Comment thread cudaq/test/Transforms/UnitarySynthesis/random_unitary-5.qke
Comment thread cudaq/test/Transforms/UnitarySynthesis/random_unitary-3q.qke Outdated
Comment thread python/tests/custom/test_qsd_decomposition.py Outdated
Comment thread python/tests/custom/test_qsd_decomposition.py
Comment thread python/tests/custom/test_qsd_decomposition.py Outdated
@khalatepradnya

khalatepradnya commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

/ok to test 898a53f

Command Bot: Processing...

@thedaemon-wizard

Copy link
Copy Markdown
Author

Hello @khalatepradnya, @schweitzpgi
Thank you for your review and feedback.
I’ll look into this. Please give me a moment while I review your comments and make the necessary code changes.
I look forward to continuing to work with you.

@thedaemon-wizard

thedaemon-wizard commented Jun 9, 2026

Copy link
Copy Markdown
Author

Either this is stale and should be removed, or it's missing ANALYSIS_2026.md?
Please update.

Hello, @khalatepradnya ,
Thank you for your comment.
Since ANALYSIS_2026.md was created separately from the actual implementation for debugging purposes, I have removed the message from it.

I am currently reviewing the other changes as well.
Thank you for your cooperation.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

CI Summary (push) — ✅ passed

Run #27291777550 · ✅ 6 · ⏩ 7 · ❌ 0 · ⛔ 0

Top-level jobs (13)
Job Result
binaries ⏩ skipped
build_and_test ✅ success
config_devdeps ✅ success
config_source_build ⏩ skipped
config_wheeldeps ✅ success
devdeps ✅ success
docker_image ⏩ skipped
gen_code_coverage ⏩ skipped
metadata ✅ success
python_metapackages ⏩ skipped
python_wheels ⏩ skipped
source_build ⏩ skipped
wheeldeps ✅ success
⏩ Skipped jobs (7) — intentionally skipped on PR builds; run on merge_group / workflow_dispatch
Job
binaries
config_source_build
docker_image
gen_code_coverage
python_metapackages
python_wheels
source_build
All sub-jobs (42) — every matrix leg, with links
Job Status Link
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python) ✅ success view
Build and test (amd64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, llvm, openmpi) / Dev environment (Python) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Python) ✅ success view
CI Summary ❔ in_progress view
Configure build (devdeps) ✅ success view
Configure build (source_build) ⏩ skipped view
Configure build (wheeldeps) ✅ success view
Create CUDA Quantum installer ⏩ skipped view
Create Docker images ⏩ skipped view
Create Python metapackages ⏩ skipped view
Create Python wheels ⏩ skipped view
Gen code coverage ⏩ skipped view
Load dependencies (amd64, gcc12) / Caching ✅ success view
Load dependencies (amd64, gcc12) / Finalize ✅ success view
Load dependencies (amd64, gcc12) / Metadata ✅ success view
Load dependencies (amd64, llvm) / Caching ✅ success view
Load dependencies (amd64, llvm) / Finalize ✅ success view
Load dependencies (amd64, llvm) / Metadata ✅ success view
Load dependencies (arm64, gcc12) / Caching ✅ success view
Load dependencies (arm64, gcc12) / Finalize ✅ success view
Load dependencies (arm64, gcc12) / Metadata ✅ success view
Load dependencies (arm64, llvm) / Caching ✅ success view
Load dependencies (arm64, llvm) / Finalize ✅ success view
Load dependencies (arm64, llvm) / Metadata ✅ success view
Load source build cache ⏩ skipped view
Load wheel dependencies (amd64, 12.6) / Caching ✅ success view
Load wheel dependencies (amd64, 12.6) / Finalize ✅ success view
Load wheel dependencies (amd64, 12.6) / Metadata ✅ success view
Load wheel dependencies (amd64, 13.0) / Caching ✅ success view
Load wheel dependencies (amd64, 13.0) / Finalize ✅ success view
Load wheel dependencies (amd64, 13.0) / Metadata ✅ success view
Load wheel dependencies (arm64, 12.6) / Caching ✅ success view
Load wheel dependencies (arm64, 12.6) / Finalize ✅ success view
Load wheel dependencies (arm64, 12.6) / Metadata ✅ success view
Load wheel dependencies (arm64, 13.0) / Caching ✅ success view
Load wheel dependencies (arm64, 13.0) / Finalize ✅ success view
Load wheel dependencies (arm64, 13.0) / Metadata ✅ success view
Prepare cache clean-up ❔ in_progress view
Retrieve PR info ✅ success view
✅ Required checks (6/6) — declared in .github/required-checks.yml for push
Required check Status Link
Build and test (amd64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, llvm, openmpi) / Dev environment (Python) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Python) ✅ success view
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python) ✅ success view

@thedaemon-wizard

thedaemon-wizard commented Jun 9, 2026

Copy link
Copy Markdown
Author
  • Can you run the code formatter to fix the CI failure on Basic content checks? I find it it easiest to run the pre-commit checks as pre-commit run --all-files --hook-stage pre-push

Hello @khalatepradnya — done. I applied the repo's configured formatters to the changed files (clang-format for the C++ pass and yapf with the repo's google style for the Python test) and committed the resulting diff. The Basic content checks CI is green on the latest commit.

github-actions Bot pushed a commit that referenced this pull request Jun 9, 2026
@thedaemon-wizard

thedaemon-wizard commented Jun 9, 2026

Copy link
Copy Markdown
Author

Open questions for the mentors

These are points I'd like your call on. Each lists the current implementation
choice
and a research-backed alternative so you can decide. (Background
references in the "Latest-research" notes below each item.)

Q1. Failure-signaling channel

Current: on a reconstruction miss or an over-cap dimension the pass emits an
LLVM_DEBUG note and return failure(), leaving the op unchanged (fully
composable). @schweitzpgi floated an optional fail-when-cannot-synthesize pass
option (off by default) for pipelines that want a hard error instead.

Question: keep the silent-composable default only, or also wire up that
opt-in pass option now?

Latest-research note: MLIR conversion/transform passes conventionally signal
"can't legalize" via pattern failure() rather than aborting, which is what this
PR now does; an opt-in strict mode is a small, backward-compatible addition.

Q2. Reconstruction tolerance

Current: a fixed host-double threshold TOL = 1e-7 used uniformly for the
unitarity check and the per-step reconstruction checks.

Question: keep the fixed cap, or move to a dimension-scaled / eps-relative
tolerance (≈ eps · dim) to avoid false negatives at 5 qubits where recursive
CSD error accumulates?

Latest-research note: Qiskit issue
#14422 (May 2025) reports that
CSD tolerance errors accumulate under the recursive QSD, especially for
multi-controlled blocks, and proposes improving the CSD tolerance. A
dimension-aware tolerance would track that finding; at our ≤ 5-qubit cap the
fixed 1e-7 has been comfortable in testing, so I left it fixed pending your
preference.

Q3. 5-qubit cap scope (confirming)

Implemented exactly as you suggested: dimension <= 32 (≤ 5 qubits) is
synthesized, and larger power-of-two dimensions are left unchanged with a clean
"unsupported" debug note.

Question (optional): happy to leave this as a hard constant; flagging only in
case you'd prefer it exposed as a pass option/threshold later rather than a fixed
32.

Latest-research note: QSD's leading cost is ~(22/48)·4^n CNOTs, so 6q is
already ~hundreds of CNOTs — well past where unconstrained synthesis is useful
without further optimization, consistent with the 6-qubit failures you observed.

Q4. Gate-count follow-up (Block-ZXZ)

Current (measured from the lowered IR on the 8×8 test): the three top-level
multiplexors are gate-optimal at 12 CNOTs (gray-code, vs 18 for a naive
split), but the n=3 total is 36 CNOTs — the extra 24 come from the four
2-qubit KAK base-case blocks (6 CNOTs each, pre-existing CUDA-Q code, unchanged
here) with no cross-block diagonal merge. For reference the SBM literature
optimum is 20 (needs a 3-CNOT KAK base case and adjacent-diagonal merging)
and Block-ZXZ is 19.

Question: interest in a follow-up for (a) a gate-optimal 3-CNOT KAK base case

  • adjacent-diagonal merge (moving n=3 toward ~20), and/or (b) Block-ZXZ
    (arXiv:2403.13692, n=3 → 19 CNOTs, currently
    the lowest known count) — or keep this QSD path as the baseline for now?

Latest-research note: Block-ZXZ (Krol & Al-Ars, 2024) holds the record CNOT
count for general n; a PennyLane tutorial (May 2025) corroborates it as the
current best. Both follow-ups would be additive on top of the present QSD path.

@khalatepradnya

khalatepradnya commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

/ok to test fc34859

Command Bot: Processing...

@khalatepradnya

Copy link
Copy Markdown
Collaborator

@thedaemon-wizard - Please comment on the issue #2242 so that it can be assigned to you.

@khalatepradnya

khalatepradnya commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Open questions for the mentors

These are points I'd like your call on. Each lists the current implementation choice and a research-backed alternative so you can decide. (Background references in the "Latest-research" notes below each item.)

truncated for brevity

Latest-research note: Block-ZXZ (Krol & Al-Ars, 2024) holds the record CNOT count for general n; a PennyLane tutorial (May 2025) corroborates it as the current best. Both follow-ups would be additive on top of the present QSD path.

Thanks @thedaemon-wizard, for writing these up. My preference:

Q1: Keep the default composable behavior. I do not think we need to add a strict fail-when-cannot-synthesize option in this PR, if @schweitzpgi strongly prefers it for pipeline control, it can be a follow-up, shouldn't block the Unitary Hack submission.

Q2: I would not address this by simply loosening tolerance. I do think this PR needs to handle the advertised supported range more predictably.

Q3: Keep the hard 5-qubit / dimension-32 cap for now.

Q4: Gate-count work should be follow-up only. I would not add Block-ZXZ or a 3-CNOT KAK refactor in this PR; correctness and coverage of the QSD path should come first.

@khalatepradnya khalatepradnya left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round#2. This is shaping up quite well. I think we can broaden the test coverage for 4q and 5q examples since this PR extends to those as well.

Comment thread targettests/execution/custom_operation_toffoli.cpp
Comment thread python/tests/custom/test_qsd_decomposition.py Outdated
// the terms of the Apache License 2.0 which accompanies this distribution. //
// ========================================================================== //

// RUN: cudaq-opt --unitary-synthesis %s | FileCheck %s

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you combine with the other passes to get final lowering, you can match the exact gates used in synth_kernel8 in the Python test
// RUN: cudaq-opt --unitary-synthesis --canonicalize --apply-op-specialization --aggressive-inlining %s | FileCheck %s

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The RUN line is now the combined lowering pipeline

// RUN: cudaq-opt --unitary-synthesis --canonicalize --apply-op-specialization --aggressive-inlining %s | FileCheck %s

and the CHECK: lines were regenerated from that fully-lowered/inlined output so they match the flat gate sequence exactly.

Side effect of the atan2 base-case fix: three pre-existing files whose inputs are truncated-literal (not perfectly unitary) matrices — bell_pair.qke, random_unitary-0.qke, random_unitary-1.qke — had one ry angle constant shift in the ~1e-8 digits, because for a not-perfectly-unitary input atan2(|u01|,|u00|) differs from acos(|u00|) by exactly that ill-conditioning amount. Those three CHECK constants were updated to the new values; all 9 FileCheck tests in test/Transforms/UnitarySynthesis/ pass.

@khalatepradnya

khalatepradnya commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

If you "Update branch" / align with main, the CI is more likely to not have cache misses and tends to go faster.

Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>
…ronger tests

- completeUnitaryBasis: use Eigen::Index (no truncating cast); rename
  `defined` -> `definedCols`.
- Reconstruction self-checks are always-on (assert removed). Decomposers carry
  a `valid` flag validated before any IR is emitted; on failure the pass leaves
  the custom op unchanged and emits an LLVM_DEBUG note (composable: no emitError,
  no signalPassFailure).
- Cap QSD at 5 qubits (matrix dimension <= 32); larger power-of-two dimensions
  are left unchanged with a debug note.
- Use llvm::popcount / llvm::countr_zero (llvm/ADT/bit.h) for the gray-code mux.
- Rename random_unitary-3q.qke -> random_unitary-5.qke and replace CHECK-DAG with
  ordered CHECK lines locking the gray-code emission sequence.
- Rewrite test_qsd_decomposition.py: full-precision rvs(_, random_state=13);
  add synth_kernel8 state-equality check; drop nearest_unitary; Toffoli execution
  stays covered by test_custom_operations.py::test_three_qubit_op.

Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>
…ests

- OneQubitOpZYZ: derive the Y-rotation angle from `2*atan2(|u01|,|u00|)` instead
  of an `acos(|u00|)`/`asin(|u01|)` branch. acos/asin are ill-conditioned near
  their +/-1 endpoints and return NaN when the argument exceeds 1 by a rounding
  ULP -- which occurs for the near-degenerate controlled single-qubit sub-blocks
  produced deep in a recursive Quantum Shannon Decomposition. This made 4q/5q
  custom unitaries (e.g. cccx) synthesize to an incorrect circuit; atan2 is
  well-conditioned over the whole domain and needs no clamping.
- Make the 2-qubit KAK base case composable: bidiagonalize / extractSU2FromSO4
  clear a `bool &ok` instead of asserting, and an in-range reconstruction miss
  emits a warning and leaves the op unchanged (no emitError/signalPassFailure).
- Add targettests/execution/custom_operation_cccx.cpp (4q C3X) and
  custom_operation_c4x.cpp (5q C4X), run on the default and Quantinuum-emulate
  targets, exercising the synthesis path end-to-end across the 3-5 qubit range.
- random_unitary-5.qke: combine unitary-synthesis with op-specialization and
  aggressive-inlining in the RUN line so FileCheck locks the flat gate list;
  refresh three ZYZ constants affected by the atan2 angle formula.
- Drop the 16x16 direct-matrix python test (did not exercise the pass); 4q/5q
  coverage now lives in the execution targettests above.

Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>
@thedaemon-wizard

thedaemon-wizard commented Jun 10, 2026

Copy link
Copy Markdown
Author

Thanks for the clear direction on the open questions, @khalatepradnya. Here's how each is handled in the latest push:

Q1 (failure signaling). Kept the composable default (return failure() + debug note); did not add a strict fail-when-cannot-synthesize option. It remains an easy backward-compatible follow-up for @schweitzpgi if a pipeline ever wants a hard error — not blocking this submission.

Q2 (predictability, not loosening tolerance). Two complementary changes, both grounded in measurement rather than a blanket tolerance bump.

(1) The root-cause atan2 fix (see the cccx thread) removed the dominant unpredictability — the acos/asin ill-conditioning/NaN in the 1-qubit base case that corrupted the result before any check ran. After it, n=3 and n=4 synthesize 100% of the time (0/1000 each).

(2) A principled, calibrated tolerance split for the n=5 borderline tail. The original single TOL = 1e-7 was doing two different jobs — the input-unitarity contract / numerical-safety guards (division-by-near-zero, Gram-Schmidt independence) and the post-factorization reconstruction self-checks. I separated them: TOL = 1e-7 stays on the input contract and safety guards (must stay tight), and a new RECON_TOL = 1e-5 is applied only to the reconstruction self-checks. RECON_TOL is calibrated to measured behavior, not arbitrary: instrumenting the C++ Eigen residual shows a correctly-synthesized unitary reconstructs to ~1e-10 when well-conditioned, while the rare near-degenerate cosine-sine/demultiplex leaf blocks push the residual to ~8e-7 even though the emitted circuit still reproduces the target to <4e-7 end-to-end; a genuinely-wrong decomposition reconstructs to O(0.1). So RECON_TOL sits ~1 order above the correct tail and ~4 orders below the failure regime — it accepts borderline-but-correct output while still rejecting broken circuits by a wide margin. The input contract stays tight, so this is not "simply loosening the tolerance."

To be transparent: an earlier version of this comment claimed those n=5 bails reconstructed to ~0.2 and concluded the tolerance should stay fixed. That was a measurement error — the "forced emit" had used a throwaway TOL = 1e-1 build that also loosened the algorithmic ok-flag checks inside the KAK base case, corrupting the decomposition itself. Re-measuring three independent ways (instrumented C++ residuals; a validated NumPy parser of the synthesized gate list; and a force-emit at a throwaway TOL = 1e-5 that leaves the algorithmic checks untouched) shows the bails are a borderline tail of correct circuits, not broken ones.

Result with the split: n=3 and n=4 stay at 100%; n=5 improves from ~98.9% to 99.9% (999/1000), every emitted circuit correct (worst end-to-end 1.2e-6, zero wrong by >1e-5); the lone remaining outlier bails gracefully (visible warning, op unchanged, never a wrong circuit). I also extended the composable principle to the 2-qubit KAK base case (asserts in bidiagonalize/extractSU2FromSO4 are now ok-flag checks). The residual ~0.1% n=5 leaf degeneracy is noted as future work.

Q3 (cap). Kept exactly as implemented: dimension ≤ 32 (≤ 5 qubits) is synthesized; larger power-of-two dimensions are left unchanged with a debug note.

Q4 (gate count). Agreed — correctness and coverage first. No Block-ZXZ / 3-CNOT-KAK in this PR; noted as future work only (arXiv:2403.13692).

Branch update. Rebased onto upstream/main and force-pushed with --force-with-lease to keep CI cache-friendly (no extra merge commit, sign-offs preserved).

…5q synthesis

The recursive Quantum Shannon Decomposition reused a single 1e-7 tolerance
for both the input-unitarity contract and the always-on reconstruction
self-checks. At 5 qubits, rare near-degenerate sub-blocks in the cosine-sine
decomposition and multiplexor demultiplexing legitimately accumulate host-
double reconstruction residuals up to ~8e-7 -- still producing a circuit that
reproduces the target to <4e-7 end-to-end -- so the 1e-7 self-check spuriously
bailed on ~1% of valid Haar-random 5-qubit unitaries, leaving the advertised
range less predictable than intended.

Split the tolerance into two purposes:

- TOL (1e-7): the input-unitarity contract and pure numerical-safety guards
  (near-zero determinant division, Gram-Schmidt independence) stay tight.
- RECON_TOL (1e-5): the reconstruction self-checks (bidiagonalize diagonality,
  SU2-from-SO4 verification, KAK/CSD/demultiplex reconstruction) use a
  tolerance calibrated to the measured correct-synthesis residual (~8e-7 worst)
  and ~4 orders below the genuinely-broken regime (O(0.1)).

This is a calibration to measured floating-point accumulation, not a blanket
loosening: the input contract is unchanged and the composable emitWarning/bail
path still rejects broken results. Over 1000 Haar-random unitaries per size:
n=3 and n=4 synthesize 100% (0 bail), n=5 now synthesizes 99.9% (was ~98.9%)
with every emitted circuit correct (worst end-to-end error 1.2e-6, zero wrong);
the rare residual outlier still bails gracefully. All 9 UnitarySynthesis
FileCheck tests, the cccx/c4x/toffoli execution targettests (default and
Quantinuum emulation), and the KAK/Euler/QSD Python tests pass unchanged --
the emitted gate sequences are identical; only the bail threshold moved.

Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>
@khalatepradnya

khalatepradnya commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

/ok to test f03e61c

Command Bot: Processing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[custom op] Support unitary synthesis for 3+ qubit operations

3 participants