Summary
This issue tracks the v1 design and implementation of a distributed, state-partitioned runtime for Mahout QDP.
The goal is to extend QDP beyond single-device direct state preparation and toward multi-GPU / multi-node execution where a logical quantum state may be partitioned across devices.
This is intentionally more ambitious than sample-sharded batch execution. v1 focuses on establishing a clear distributed state model, partition scheme, execution semantics, and result-consumption semantics.
Scope
This issue covers:
- distributed state representation
- local/global qubit modeling
- state partitioning across devices
- multi-GPU and multi-node execution
- explicit gather/reduce semantics
- initial support for heterogeneous hardware
This issue does not aim to provide:
- a general-purpose distributed task framework
- a full distributed simulator
- full fault-tolerant state migration
- dynamic repartitioning in v1
- full optimizer/query-planner features
Core Execution Model
Logical Model
The runtime operates on a logical quantum state with global_qubits.
That logical state may be partitioned across multiple devices and nodes. Each device owns one partition of the global amplitude space.
This means the execution unit in v1 is not a batch of independent samples. The execution unit is a partition of a single logical state.
Partition Scheme
v1 uses contiguous amplitude-block partitioning.
For a logical state of size 2^global_qubits:
- the state is divided into
P partitions
- each partition owns a contiguous range of amplitudes
- each partition is assigned to one device
This keeps metadata, addressing, and gather behavior tractable.
v1 should initially prefer equal-size partitions.
Local / Global Qubits
v1 should explicitly distinguish:
global_qubits: total logical qubits of the state
local_qubits: qubits represented by data local to a partition
partition_count: number of state partitions
If partition_count = 2^k, then:
global_qubits = local_qubits + k
This provides a consistent mapping between the logical state and partition-local storage.
Distributed State Representation
Suggested conceptual model:
-
DistributedStateHandle
state_id
global_qubits
partition_scheme
partition_count
layout
dtype
ready_for_consumer
-
StatePartitionRef
partition_id
node_id
device_id
offset
length
local_qubits
storage_handle
ownership_epoch
The runtime should treat the distributed state as a first-class object rather than as an implicit collection of buffers.
Heterogeneous GPU Support
Heterogeneous hardware support is required, but v1 should keep it conservative.
Each worker should report:
- GPU model
- usable VRAM
- measured warmup throughput
- max safe allocation
- backend / host penalty if relevant
For v1:
- placement may be heterogeneous
- partition sizes should still remain equal-size initially
- stronger devices may be assigned more states/jobs, but not variable partition sizes of one state
This avoids overcomplicating address mapping and collectives in the first version.
Variable-size partitions can be considered later.
Result Semantics
v1 should make the result semantics explicit.
1. GatherFullState
All partitions are gathered into one complete state.
Use when:
- downstream consumer requires a full state in one place
- validation/debugging needs a reconstructed state
- export or comparison requires a monolithic state
2. CollectiveMetricReduce
Each partition computes local partial outputs and the runtime reduces metrics or summaries.
Use when:
- downstream computation naturally supports partial aggregation
- outputs are scalars or small tensors
- full-state gather is unnecessary
3. PartitionLocalConsume
Only allowed if the downstream stage is explicitly partition-aware.
Use when:
- the next stage can consume distributed state partitions directly
- no gather is required
v1 should not assume that arbitrary downstream consumers can operate on partitioned states.
v1 Constraints
To keep the first version implementable, v1 should enforce:
- contiguous amplitude-block partitions
- equal-size partitions
- no dynamic repartitioning
- no partition migration during execution
- no full fault-tolerant recovery of live distributed state
- explicit gather or partition-aware downstream consumption
Implementation Plan
Phase 1: Single-Node Multi-GPU State Partitioning
- define
DistributedStateHandle and StatePartitionRef
- implement contiguous equal-size partition layout
- add local/global qubit metadata
- support
GatherFullState
- validate correctness against single-device QDP
Phase 2: Multi-Node State-Partitioned Runtime
- add coordinator and worker registration
- distribute partitions across nodes
- add partition metadata tracking
- support
CollectiveMetricReduce
Phase 3: Partition-Aware Scheduling
- use device capability reports for placement
- add topology-aware placement hooks
- define which downstream stages are partition-aware
- improve failure reporting and reassignment policy
Testing Plan
Correctness
- gathered full state matches single-device output
- partition offsets and lengths are correct
- local/global qubit mapping is consistent
- partitioned execution preserves normalization and expected values
Runtime Semantics
- partition ownership is tracked correctly
- gather reconstructs correct global ordering
- metric reduction matches single-device reference
- partition-aware downstream stages consume partitions correctly
Heterogeneous Hardware
- workers report capabilities consistently
- placement behaves correctly across mixed GPUs
- weaker devices are not assigned unsafe allocations
Performance
- single GPU baseline
- single node multi-GPU partitioned execution
- two-node partitioned execution
- throughput, latency, transfer volume, gather cost, reduction cost
Summary
This issue tracks the v1 design and implementation of a distributed, state-partitioned runtime for Mahout QDP.
The goal is to extend QDP beyond single-device direct state preparation and toward multi-GPU / multi-node execution where a logical quantum state may be partitioned across devices.
This is intentionally more ambitious than sample-sharded batch execution. v1 focuses on establishing a clear distributed state model, partition scheme, execution semantics, and result-consumption semantics.
Scope
This issue covers:
This issue does not aim to provide:
Core Execution Model
Logical Model
The runtime operates on a logical quantum state with
global_qubits.That logical state may be partitioned across multiple devices and nodes. Each device owns one partition of the global amplitude space.
This means the execution unit in v1 is not a batch of independent samples. The execution unit is a partition of a single logical state.
Partition Scheme
v1 uses contiguous amplitude-block partitioning.
For a logical state of size
2^global_qubits:PpartitionsThis keeps metadata, addressing, and gather behavior tractable.
v1 should initially prefer equal-size partitions.
Local / Global Qubits
v1 should explicitly distinguish:
global_qubits: total logical qubits of the statelocal_qubits: qubits represented by data local to a partitionpartition_count: number of state partitionsIf
partition_count = 2^k, then:global_qubits = local_qubits + kThis provides a consistent mapping between the logical state and partition-local storage.
Distributed State Representation
Suggested conceptual model:
DistributedStateHandlestate_idglobal_qubitspartition_schemepartition_countlayoutdtypeready_for_consumerStatePartitionRefpartition_idnode_iddevice_idoffsetlengthlocal_qubitsstorage_handleownership_epochThe runtime should treat the distributed state as a first-class object rather than as an implicit collection of buffers.
Heterogeneous GPU Support
Heterogeneous hardware support is required, but v1 should keep it conservative.
Each worker should report:
For v1:
This avoids overcomplicating address mapping and collectives in the first version.
Variable-size partitions can be considered later.
Result Semantics
v1 should make the result semantics explicit.
1. GatherFullState
All partitions are gathered into one complete state.
Use when:
2. CollectiveMetricReduce
Each partition computes local partial outputs and the runtime reduces metrics or summaries.
Use when:
3. PartitionLocalConsume
Only allowed if the downstream stage is explicitly partition-aware.
Use when:
v1 should not assume that arbitrary downstream consumers can operate on partitioned states.
v1 Constraints
To keep the first version implementable, v1 should enforce:
Implementation Plan
Phase 1: Single-Node Multi-GPU State Partitioning
DistributedStateHandleandStatePartitionRefGatherFullStatePhase 2: Multi-Node State-Partitioned Runtime
CollectiveMetricReducePhase 3: Partition-Aware Scheduling
Testing Plan
Correctness
Runtime Semantics
Heterogeneous Hardware
Performance