Update code for custom op at unitaryHack2026#4693
Conversation
schweitzpgi
left a comment
There was a problem hiding this comment.
Just a few questions to improve my understanding.
| /// such as Toffoli. | ||
| Eigen::MatrixXcd completeUnitaryBasis(const Eigen::MatrixXcd &defined, int n) { | ||
| Eigen::MatrixXcd basis = defined; | ||
| int col = static_cast<int>(basis.cols()); |
There was a problem hiding this comment.
This looks like an unneeded truncating type cast. Why not just use std::size_t?
There was a problem hiding this comment.
Thanks — fixed. The truncating cast is gone. .cols() already returns
Eigen::Index (the natural signed index type of Eigen containers), so the
function now takes and uses Eigen::Index throughout instead of int:
Eigen::MatrixXcd completeUnitaryBasis(const Eigen::MatrixXcd &definedCols,
Eigen::Index n) {
Eigen::MatrixXcd basis = definedCols;
Eigen::Index col = basis.cols();
for (Eigen::Index i = 0; i < n && col < n; ++i) { ... }I preferred Eigen::Index over std::size_t so the loop/comparison stays in
Eigen's own (signed) index type and avoids any signed/unsigned mix with Eigen
indexing.
| r1.row(k) = fromS.row(k) / s(k); | ||
| } | ||
| #ifndef NDEBUG | ||
| Eigen::MatrixXcd cMat = c.cast<std::complex<double>>().asDiagonal(); |
There was a problem hiding this comment.
Can we rely on this always being/requiring double precision? Some backends are single precision.
There was a problem hiding this comment.
The decomposition math (SVD / Schur / CSD) runs in host double purely for numerical stability of the classical factorization; it is independent of the backend's execution precision. The emitted rotation angles are f64 constants (rewriter.getF64Type()) and the existing downstream lowering converts them to the target's float width (fp32/fp64) exactly as the pre-existing KAK/ZYZ paths already do — so this PR introduces no new backend-precision assumption. The double here is the host-side compile-time arithmetic, not a runtime commitment. (This block is also the one rewritten under comment #4 below.)
There was a problem hiding this comment.
Thanks, this explanation makes sense to me. I agree that the SVD/Schur/CSD work here is host-side compile-time math, and using double for that factorization is consistent with the existing ZYZ/KAK decomposition paths.
I’ll defer to @schweitzpgi on whether we need any additional backend-precision guard here.
There was a problem hiding this comment.
Thanks for accepting the host-double explanation. No code change here; happy to add a backend-precision guard later if @schweitzpgi wants one (follow-up).
|
Thank you, @thedaemon-wizard , for the contribution! I will review the PR today. A couple of things on preliminary read:
Either this is stale and should be removed, or it's missing |
There was a problem hiding this comment.
@thedaemon-wizard - Really nice work! 👏🏽 Let's polish this further before we can merge.
Command Bot: Processing... |
|
Hello @khalatepradnya, @schweitzpgi |
Hello, @khalatepradnya , I am currently reviewing the other changes as well. |
CI Summary (
|
| Job | Result |
|---|---|
binaries |
⏩ skipped |
build_and_test |
✅ success |
config_devdeps |
✅ success |
config_source_build |
⏩ skipped |
config_wheeldeps |
✅ success |
devdeps |
✅ success |
docker_image |
⏩ skipped |
gen_code_coverage |
⏩ skipped |
metadata |
✅ success |
python_metapackages |
⏩ skipped |
python_wheels |
⏩ skipped |
source_build |
⏩ skipped |
wheeldeps |
✅ success |
⏩ Skipped jobs (7) — intentionally skipped on PR builds; run on merge_group / workflow_dispatch
| Job |
|---|
binaries |
config_source_build |
docker_image |
gen_code_coverage |
python_metapackages |
python_wheels |
source_build |
All sub-jobs (42) — every matrix leg, with links
| Job | Status | Link |
|---|---|---|
| Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (amd64, gcc12, openmpi) / Dev environment (Python) | ✅ success | view |
| Build and test (amd64, llvm, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (amd64, llvm, openmpi) / Dev environment (Python) | ✅ success | view |
| Build and test (arm64, llvm, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (arm64, llvm, openmpi) / Dev environment (Python) | ✅ success | view |
| CI Summary | ❔ in_progress | view |
| Configure build (devdeps) | ✅ success | view |
| Configure build (source_build) | ⏩ skipped | view |
| Configure build (wheeldeps) | ✅ success | view |
| Create CUDA Quantum installer | ⏩ skipped | view |
| Create Docker images | ⏩ skipped | view |
| Create Python metapackages | ⏩ skipped | view |
| Create Python wheels | ⏩ skipped | view |
| Gen code coverage | ⏩ skipped | view |
| Load dependencies (amd64, gcc12) / Caching | ✅ success | view |
| Load dependencies (amd64, gcc12) / Finalize | ✅ success | view |
| Load dependencies (amd64, gcc12) / Metadata | ✅ success | view |
| Load dependencies (amd64, llvm) / Caching | ✅ success | view |
| Load dependencies (amd64, llvm) / Finalize | ✅ success | view |
| Load dependencies (amd64, llvm) / Metadata | ✅ success | view |
| Load dependencies (arm64, gcc12) / Caching | ✅ success | view |
| Load dependencies (arm64, gcc12) / Finalize | ✅ success | view |
| Load dependencies (arm64, gcc12) / Metadata | ✅ success | view |
| Load dependencies (arm64, llvm) / Caching | ✅ success | view |
| Load dependencies (arm64, llvm) / Finalize | ✅ success | view |
| Load dependencies (arm64, llvm) / Metadata | ✅ success | view |
| Load source build cache | ⏩ skipped | view |
| Load wheel dependencies (amd64, 12.6) / Caching | ✅ success | view |
| Load wheel dependencies (amd64, 12.6) / Finalize | ✅ success | view |
| Load wheel dependencies (amd64, 12.6) / Metadata | ✅ success | view |
| Load wheel dependencies (amd64, 13.0) / Caching | ✅ success | view |
| Load wheel dependencies (amd64, 13.0) / Finalize | ✅ success | view |
| Load wheel dependencies (amd64, 13.0) / Metadata | ✅ success | view |
| Load wheel dependencies (arm64, 12.6) / Caching | ✅ success | view |
| Load wheel dependencies (arm64, 12.6) / Finalize | ✅ success | view |
| Load wheel dependencies (arm64, 12.6) / Metadata | ✅ success | view |
| Load wheel dependencies (arm64, 13.0) / Caching | ✅ success | view |
| Load wheel dependencies (arm64, 13.0) / Finalize | ✅ success | view |
| Load wheel dependencies (arm64, 13.0) / Metadata | ✅ success | view |
| Prepare cache clean-up | ❔ in_progress | view |
| Retrieve PR info | ✅ success | view |
✅ Required checks (6/6) — declared in .github/required-checks.yml for push
| Required check | Status | Link |
|---|---|---|
| Build and test (amd64, llvm, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (amd64, llvm, openmpi) / Dev environment (Python) | ✅ success | view |
| Build and test (arm64, llvm, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (arm64, llvm, openmpi) / Dev environment (Python) | ✅ success | view |
| Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (amd64, gcc12, openmpi) / Dev environment (Python) | ✅ success | view |
Hello @khalatepradnya — done. I applied the repo's configured formatters to the changed files (clang-format for the C++ pass and yapf with the repo's |
Open questions for the mentorsThese are points I'd like your call on. Each lists the current implementation Q1. Failure-signaling channelCurrent: on a reconstruction miss or an over-cap dimension the pass emits an Question: keep the silent-composable default only, or also wire up that Latest-research note: MLIR conversion/transform passes conventionally signal Q2. Reconstruction toleranceCurrent: a fixed host- Question: keep the fixed cap, or move to a dimension-scaled / eps-relative Latest-research note: Qiskit issue Q3. 5-qubit cap scope (confirming)Implemented exactly as you suggested: Question (optional): happy to leave this as a hard constant; flagging only in Latest-research note: QSD's leading cost is ~ Q4. Gate-count follow-up (Block-ZXZ)Current (measured from the lowered IR on the 8×8 test): the three top-level Question: interest in a follow-up for (a) a gate-optimal 3-CNOT KAK base case
Latest-research note: Block-ZXZ (Krol & Al-Ars, 2024) holds the record CNOT |
Command Bot: Processing... |
|
@thedaemon-wizard - Please comment on the issue #2242 so that it can be assigned to you. |
truncated for brevity
Thanks @thedaemon-wizard, for writing these up. My preference: Q1: Keep the default composable behavior. I do not think we need to add a strict Q2: I would not address this by simply loosening tolerance. I do think this PR needs to handle the advertised supported range more predictably. Q3: Keep the hard 5-qubit / dimension-32 cap for now. Q4: Gate-count work should be follow-up only. I would not add Block-ZXZ or a 3-CNOT KAK refactor in this PR; correctness and coverage of the QSD path should come first. |
khalatepradnya
left a comment
There was a problem hiding this comment.
Round#2. This is shaping up quite well. I think we can broaden the test coverage for 4q and 5q examples since this PR extends to those as well.
| // the terms of the Apache License 2.0 which accompanies this distribution. // | ||
| // ========================================================================== // | ||
|
|
||
| // RUN: cudaq-opt --unitary-synthesis %s | FileCheck %s |
There was a problem hiding this comment.
If you combine with the other passes to get final lowering, you can match the exact gates used in synth_kernel8 in the Python test
// RUN: cudaq-opt --unitary-synthesis --canonicalize --apply-op-specialization --aggressive-inlining %s | FileCheck %s
There was a problem hiding this comment.
Done. The RUN line is now the combined lowering pipeline
// RUN: cudaq-opt --unitary-synthesis --canonicalize --apply-op-specialization --aggressive-inlining %s | FileCheck %s
and the CHECK: lines were regenerated from that fully-lowered/inlined output so they match the flat gate sequence exactly.
Side effect of the atan2 base-case fix: three pre-existing files whose inputs are truncated-literal (not perfectly unitary) matrices — bell_pair.qke, random_unitary-0.qke, random_unitary-1.qke — had one ry angle constant shift in the ~1e-8 digits, because for a not-perfectly-unitary input atan2(|u01|,|u00|) differs from acos(|u00|) by exactly that ill-conditioning amount. Those three CHECK constants were updated to the new values; all 9 FileCheck tests in test/Transforms/UnitarySynthesis/ pass.
|
If you "Update branch" / align with main, the CI is more likely to not have cache misses and tends to go faster. |
Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>
…ronger tests - completeUnitaryBasis: use Eigen::Index (no truncating cast); rename `defined` -> `definedCols`. - Reconstruction self-checks are always-on (assert removed). Decomposers carry a `valid` flag validated before any IR is emitted; on failure the pass leaves the custom op unchanged and emits an LLVM_DEBUG note (composable: no emitError, no signalPassFailure). - Cap QSD at 5 qubits (matrix dimension <= 32); larger power-of-two dimensions are left unchanged with a debug note. - Use llvm::popcount / llvm::countr_zero (llvm/ADT/bit.h) for the gray-code mux. - Rename random_unitary-3q.qke -> random_unitary-5.qke and replace CHECK-DAG with ordered CHECK lines locking the gray-code emission sequence. - Rewrite test_qsd_decomposition.py: full-precision rvs(_, random_state=13); add synth_kernel8 state-equality check; drop nearest_unitary; Toffoli execution stays covered by test_custom_operations.py::test_three_qubit_op. Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>
…ests - OneQubitOpZYZ: derive the Y-rotation angle from `2*atan2(|u01|,|u00|)` instead of an `acos(|u00|)`/`asin(|u01|)` branch. acos/asin are ill-conditioned near their +/-1 endpoints and return NaN when the argument exceeds 1 by a rounding ULP -- which occurs for the near-degenerate controlled single-qubit sub-blocks produced deep in a recursive Quantum Shannon Decomposition. This made 4q/5q custom unitaries (e.g. cccx) synthesize to an incorrect circuit; atan2 is well-conditioned over the whole domain and needs no clamping. - Make the 2-qubit KAK base case composable: bidiagonalize / extractSU2FromSO4 clear a `bool &ok` instead of asserting, and an in-range reconstruction miss emits a warning and leaves the op unchanged (no emitError/signalPassFailure). - Add targettests/execution/custom_operation_cccx.cpp (4q C3X) and custom_operation_c4x.cpp (5q C4X), run on the default and Quantinuum-emulate targets, exercising the synthesis path end-to-end across the 3-5 qubit range. - random_unitary-5.qke: combine unitary-synthesis with op-specialization and aggressive-inlining in the RUN line so FileCheck locks the flat gate list; refresh three ZYZ constants affected by the atan2 angle formula. - Drop the 16x16 direct-matrix python test (did not exercise the pass); 4q/5q coverage now lives in the execution targettests above. Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>
|
Thanks for the clear direction on the open questions, @khalatepradnya. Here's how each is handled in the latest push: Q1 (failure signaling). Kept the composable default ( Q2 (predictability, not loosening tolerance). Two complementary changes, both grounded in measurement rather than a blanket tolerance bump. (1) The root-cause (2) A principled, calibrated tolerance split for the n=5 borderline tail. The original single To be transparent: an earlier version of this comment claimed those n=5 bails reconstructed to ~0.2 and concluded the tolerance should stay fixed. That was a measurement error — the "forced emit" had used a throwaway Result with the split: n=3 and n=4 stay at 100%; n=5 improves from ~98.9% to 99.9% (999/1000), every emitted circuit correct (worst end-to-end 1.2e-6, zero wrong by >1e-5); the lone remaining outlier bails gracefully (visible warning, op unchanged, never a wrong circuit). I also extended the composable principle to the 2-qubit KAK base case ( Q3 (cap). Kept exactly as implemented: dimension ≤ 32 (≤ 5 qubits) is synthesized; larger power-of-two dimensions are left unchanged with a debug note. Q4 (gate count). Agreed — correctness and coverage first. No Block-ZXZ / 3-CNOT-KAK in this PR; noted as future work only (arXiv:2403.13692). Branch update. Rebased onto |
…5q synthesis The recursive Quantum Shannon Decomposition reused a single 1e-7 tolerance for both the input-unitarity contract and the always-on reconstruction self-checks. At 5 qubits, rare near-degenerate sub-blocks in the cosine-sine decomposition and multiplexor demultiplexing legitimately accumulate host- double reconstruction residuals up to ~8e-7 -- still producing a circuit that reproduces the target to <4e-7 end-to-end -- so the 1e-7 self-check spuriously bailed on ~1% of valid Haar-random 5-qubit unitaries, leaving the advertised range less predictable than intended. Split the tolerance into two purposes: - TOL (1e-7): the input-unitarity contract and pure numerical-safety guards (near-zero determinant division, Gram-Schmidt independence) stay tight. - RECON_TOL (1e-5): the reconstruction self-checks (bidiagonalize diagonality, SU2-from-SO4 verification, KAK/CSD/demultiplex reconstruction) use a tolerance calibrated to the measured correct-synthesis residual (~8e-7 worst) and ~4 orders below the genuinely-broken regime (O(0.1)). This is a calibration to measured floating-point accumulation, not a blanket loosening: the input contract is unchanged and the composable emitWarning/bail path still rejects broken results. Over 1000 Haar-random unitaries per size: n=3 and n=4 synthesize 100% (0 bail), n=5 now synthesizes 99.9% (was ~98.9%) with every emitted circuit correct (worst end-to-end error 1.2e-6, zero wrong); the rare residual outlier still bails gracefully. All 9 UnitarySynthesis FileCheck tests, the cccx/c4x/toffoli execution targettests (default and Quantinuum emulation), and the KAK/Euler/QSD Python tests pass unchanged -- the emitted gate sequences are identical; only the bail threshold moved. Signed-off-by: thedamon-wizard <amon.koike@daemons.jp>
Command Bot: Processing... |
[custom op] Support unitary synthesis for 3+ qubit operations (QSD)
Closes #2242
Summary
The
unitary-synthesisoptimization pass(
lib/Optimizer/Transforms/UnitarySynthesis.cpp) decomposes the matrix of acustom operation (
cudaq.register_operation/CUDAQ_REGISTER_OPERATION) into asequence of native gates. Until now it only handled 1-qubit (ZYZ) and
2-qubit (KAK) operations; any custom operation acting on 3 or more qubits
(e.g. a Toffoli) failed to legalize
quake.custom_unitary_constantwhen targetinghardware backends such as Quantinuum.
This PR adds a general recursive Quantum Shannon Decomposition (QSD) so that
a
2^n x 2^ncustom unitary can be synthesized for 3 ≤ n ≤ 5 (matrixdimension 8–32), including multi-controlled operations such as a 4-qubit
cccxand 5-qubit
c4x. The existingTwoQubitOpKAK(n = 2) andOneQubitOpZYZ(
n = 1) decomposers are reused as the recursion base cases; the 1-qubit basecase's Y-angle extraction is made numerically robust (
atan2of magnitudesinstead of
acos/asin) so the near-degenerate sub-blocks that arise deep inthe recursion are handled correctly. Per review, the range is capped at 5 qubits
(the
4^nCNOT growth makes 6q+ impractical); larger power-of-two dimensions areleft unchanged in a composable way (the pass emits a warning /
LLVM_DEBUGnote and does not modify the IR — no
emitError, nosignalPassFailure). Tohandle the advertised 3–5 qubit range predictably, the always-on
reconstruction self-checks use a tolerance (
RECON_TOL = 1e-5) that is separatefrom, and looser than, the tight input-unitarity contract (
TOL = 1e-7). Thesplit is calibrated to measured behavior, not chosen arbitrarily: a correctly
synthesized unitary reconstructs to ~1e-10 when well-conditioned, while rare
near-degenerate sub-blocks push the residual up to ~8e-7 even though the emitted
circuit still reproduces the target to <4e-7 end-to-end; a genuinely wrong
decomposition reconstructs to O(0.1).
RECON_TOLtherefore sits ~1 order abovethe measured correct-synthesis residual and ~4 orders below the failure regime.
With this split, n=3 and n=4 synthesize 100% and n=5 synthesizes 99.9% of
Haar-random unitaries (all emitted circuits correct); the rare residual outlier
that still exceeds
RECON_TOLis left unchanged with a warning rather thanemitting a wrong circuit (see Verification).
Algorithm
One level of QSD on an
n-qubit unitaryU(acting on a most-significantmultiplexor qubit
q0andn-1lower qubits) performs:Cosine-Sine Decomposition (CSD) (arXiv quant-ph/0404089)
U = blockDiag(L0, L1) · [[C, -S], [S, C]] · blockDiag(R0, R1).Eigen has no built-in CSD, so it is assembled from the SVD of the top-left
block. Degenerate singular values (e.g. for permutation operators like
Toffoli) are handled by deterministically completing the singular-vector
basis. The central
[[C,-S],[S,C]]factor is a uniformly-controlledRyonq0with angles2·atan2(s_k, c_k).Multiplexor demultiplexing (arXiv quant-ph/0406176)
blockDiag(A, B) = (I ⊗ V) · blockDiag(D, D†) · (I ⊗ W).A·B†is unitary (hence normal); its complex Schur form yields an orthonormaleigenbasis.
blockDiag(D, D†)is a uniformly-controlledRzonq0. This isapplied to both
blockDiag(R0,R1)andblockDiag(L0,L1), producing four(n-1)-qubit sub-unitaries and three angle vectors (Rz, Ry, Rz).Uniformly-controlled rotation emission via the optimal Möttönen gray-code
construction (arXiv quant-ph/0407010, quant-ph/0406176): a
k-controlledmultiplexed rotation is emitted as
2^krotations interleaved with exactly2^kCNOTs. The per-state angles are mapped to the gray-code rotation anglesthrough the Walsh-Hadamard-like transform, and the CNOT after rotation
itargets the control flipped between gray-code words
iandi+1.The four sub-unitaries are synthesized recursively; each factorization step is
exact, so no extra global-phase correction is required at the QSD level (the base
cases track their own global phase).
CNOT cost and the gray-code (SBM) optimization
emitMuxuses the optimal gray-code construction emitting exactly2^kCNOTsper
k-controlled multiplexor, instead of the2^{k+1}-2produced by a naiverecursive split (which inserts a redundant CNOT pair at every recursion
boundary). For a 3-qubit decomposition the three top-level multiplexors each have
k = 2controls, contributing3 x 4 = 12CNOTs — down from3 x 6 = 18. Thisis the Shende-Bullock-Markov adjacent-CNOT merge specialized to a single
uniformly-controlled rotation. Further reductions for general
nare left asfuture work: the Block-ZXZ construction (arXiv:2403.13692) currently holds the
lowest known CNOT count (e.g. 19 vs. 20 for
n = 3) and would be a naturalfollow-up baseline.
Files changed
lib/Optimizer/Transforms/UnitarySynthesis.cppCSDComponents,completeUnitaryBasis,cosineSineDecomposition,demultiplex, andNQubitOpQSD(with the optimal gray-codeemitMuxfor uniformly-controlled rotations); dispatch2^n(3 <= n <= 5) dimensions inCustomUnitaryPattern. Make the 1-qubit base case numerically robust: the Y-rotation angle is2·atan2(|u01|,|u00|)instead ofacos(|u00|)/asin(|u01|), which are ill-conditioned near ±1 and returnNaNfor the near-degenerate sub-blocks (controlled single-qubit gates) produced deep in the recursion — this is what makes multi-controlled ops like a 4-qubitcccxsynthesize correctly. Reconstruction self-checks are always-on and composable: on failure (or a power-of-two dimension above the 5-qubit cap) the pass leaves thequake.custom_unitary_constantop unchanged and emits a warning /LLVM_DEBUGnote instead of asserting or failing the passpython/tests/custom/test_qsd_decomposition.pyscipy.stats.unitary_group.rvs(8, random_state=13), transcribed at full double precision): builds a flatsynth_kernel8(the exactcudaq-opt --unitary-synthesisoutput) and asserts the synthesized state matches the direct custom-op state. (4q/5q coverage lives in the execution targettests below)test/Transforms/UnitarySynthesis/random_unitary-5.qke--unitary-synthesis --canonicalize --apply-op-specialization --aggressive-inliningpipeline, with orderedCHECK:lines locking in the fully-lowered gray-code multiplexor emission sequencetargettests/execution/custom_operation_cccx.cpp--target quantinuum --emulate;// CHECK: { 1110:1000 }targettests/execution/custom_operation_c4x.cpp// CHECK: { 11110:1000 }python/tests/backends/test_Quantinuum_LocalEmulation_kernel.pytest_3q_unitary_synthesisupdated from expecting aRuntimeErrorto expecting a successful Toffoli ("110")targettests/execution/custom_operation_toffoli.cpp--emulateRUN line now expects success (110) instead of the legalization failuretest/Transforms/UnitarySynthesis/{bell_pair,random_unitary-0,random_unitary-1}.qkeryangleCHECKconstant updated in each (shift in the ~1e-8 digits) — a consequence of the robustatan2base case on these truncated-literal (not perfectly unitary) inputsToffoli custom-op execution coverage is also provided by the existing
python/tests/custom/test_custom_operations.py::test_three_qubit_op(registers the8×8 permutation and samples
"110"); that file is not modified by this PR.Test results
test/Transforms/UnitarySynthesis/*.qkepass with
cudaq-optbuilt against LLVM 22, including the newrandom_unitary-5.qkeand the 8 pre-existing ZYZ/KAK tests (no regression).
8×8 and a power-of-two
dim = 64(6-qubit) op throughcudaq-opt --unitary-synthesis --debug-only=unitary-synthesisleaves thequake.custom_unitary_constantop unchanged and prints theLLVM_DEBUGnote,with no abort and no
signalPassFailure— the pass stays composable.emits 12
quake.x(3 multiplexors x 4), down from 18 with the naiverecursive split — measured directly from the lowered IR.
k = 1..4controls and bothRy/Rz, the net rotation applied for every control basis state matches thetarget angle to
< 1e-12, confirming the angle transform and control mapping.--target quantinuum --emulate:custom_operation_toffoli.cpp→110,custom_operation_cccx.cpp(4q C3X) →{ 1110:1000 },custom_operation_c4x.cpp(5q C4X) →{ 11110:1000 }(the Quantinuum path runs the
--unitary-synthesisQSD legalization; the 4q/5qmulti-controlled cases previously produced a wrong circuit and now pass).
nthrough
cudaq-opt --unitary-synthesis, with theRECON_TOL/TOLsplit):n=3 0/1000 and n=4 0/1000 bail (every random unitary synthesizes; worst
end-to-end reconstruction 3.7e-12 at n=3, 1.7e-8 at n=4). At n=5 999/1000
synthesize (99.9%); every emitted circuit is correct (worst end-to-end
reconstruction 1.2e-6, zero circuits wrong by more than 1e-5). The single
remaining n=5 outlier whose residual still exceeds
RECON_TOLbailsgracefully — visible warning, op left unchanged, never a wrong circuit.
Instrumenting the C++ Eigen residuals confirms the distribution that motivates
RECON_TOL: per-node reconstruction residual peaks in the CSD/demultiplexblocks at ~8e-7 (well-conditioned nodes are ~1e-10), whereas a genuinely wrong
decomposition reconstructs to O(0.1) — so
RECON_TOL = 1e-5accepts theborderline-but-correct tail while still rejecting broken output by ~4 orders of
margin. This is the calibrated bound, not a blanket loosening: the
input-unitarity contract and the numerical-safety guards (division-by-near-zero,
Gram-Schmidt independence) stay at the tight
TOL = 1e-7.pytest python/tests/custom/test_qsd_decomposition.py -v→ 1/1 PASS(8×8 random). The test asserts the synthesized
synth_kernel8state equals thedirect custom-op state (
np.allclose(..., atol=1e-6)) and both match thematrix' first column. Toffoli custom-op execution is covered by
test_custom_operations.py::test_three_qubit_op("110").pytest python/tests/backends/test_Quantinuum_LocalEmulation_kernel.py -k unitary_synthesis -v→ 3/3 PASS(
test_1q/2q/3q_unitary_synthesis);test_3q_unitary_synthesisnow expects a successful Toffoli(
"110") instead of the formerRuntimeError.pytest test_kak_decomposition.py test_euler_decomposition.py→ 5/5 PASS;all 9
UnitarySynthesis/*.qkeFileCheck tests pass (the 8 pre-existing ZYZ/KAKplus the new
random_unitary-5.qke; three pre-existing files had onery-angleCHECKconstant updated for the robustatan2base case).dim = 64(6-qubit) op fed throughcudaq-opt --unitary-synthesisis left unchanged withreturncode 0— noabort, no
signalPassFailure.AI usage disclosure
In accordance with the unitaryHACK AI policy, an AI assistant (Claude) was used as
a co-pilot for: brainstorming the CSD/demultiplexing algorithm structure,
drafting test scaffolding and docstrings, and cross-checking the math against the
referenced papers. The QSD math was independently prototyped in NumPy/SciPy and
all code was read, understood, and verified by a human contributor, who can
explain every part of the implementation. No unexecuted or hallucinated APIs were
committed; the implementation builds and passes the local test suite.
Sign-off
Commits are signed off per the Developer Certificate of Origin
(
git commit -s), as required byContributing.md.