[RFC][QDP][Feature] State-Partitioned Distributed Runtime for Mahout QDP

## Summary

This issue tracks the v1 design and implementation of a distributed, state-partitioned runtime for Mahout QDP.

The goal is to extend QDP beyond single-device direct state preparation and toward multi-GPU / multi-node execution where a logical quantum state may be partitioned across devices.

This is intentionally more ambitious than sample-sharded batch execution. v1 focuses on establishing a clear distributed state model, partition scheme, execution semantics, and result-consumption semantics.

## Scope

This issue covers:
- distributed state representation
- local/global qubit modeling
- state partitioning across devices
- multi-GPU and multi-node execution
- explicit gather/reduce semantics
- initial support for heterogeneous hardware

This issue does not aim to provide:
- a general-purpose distributed task framework
- a full distributed simulator
- full fault-tolerant state migration
- dynamic repartitioning in v1
- full optimizer/query-planner features

## Core Execution Model

### Logical Model

The runtime operates on a logical quantum state with `global_qubits`.

That logical state may be partitioned across multiple devices and nodes. Each device owns one partition of the global amplitude space.

This means the execution unit in v1 is not a batch of independent samples. The execution unit is a partition of a single logical state.

### Partition Scheme

v1 uses contiguous amplitude-block partitioning.

For a logical state of size `2^global_qubits`:
- the state is divided into `P` partitions
- each partition owns a contiguous range of amplitudes
- each partition is assigned to one device

This keeps metadata, addressing, and gather behavior tractable.

v1 should initially prefer equal-size partitions.

### Local / Global Qubits

v1 should explicitly distinguish:
- `global_qubits`: total logical qubits of the state
- `local_qubits`: qubits represented by data local to a partition
- `partition_count`: number of state partitions

If `partition_count = 2^k`, then:

`global_qubits = local_qubits + k`

This provides a consistent mapping between the logical state and partition-local storage.

## Distributed State Representation

Suggested conceptual model:

- `DistributedStateHandle`
  - `state_id`
  - `global_qubits`
  - `partition_scheme`
  - `partition_count`
  - `layout`
  - `dtype`
  - `ready_for_consumer`

- `StatePartitionRef`
  - `partition_id`
  - `node_id`
  - `device_id`
  - `offset`
  - `length`
  - `local_qubits`
  - `storage_handle`
  - `ownership_epoch`

The runtime should treat the distributed state as a first-class object rather than as an implicit collection of buffers.

## Heterogeneous GPU Support

Heterogeneous hardware support is required, but v1 should keep it conservative.

Each worker should report:
- GPU model
- usable VRAM
- measured warmup throughput
- max safe allocation
- backend / host penalty if relevant

For v1:
- placement may be heterogeneous
- partition sizes should still remain equal-size initially
- stronger devices may be assigned more states/jobs, but not variable partition sizes of one state

This avoids overcomplicating address mapping and collectives in the first version.

Variable-size partitions can be considered later.

## Result Semantics

v1 should make the result semantics explicit.

### 1. GatherFullState

All partitions are gathered into one complete state.

Use when:
- downstream consumer requires a full state in one place
- validation/debugging needs a reconstructed state
- export or comparison requires a monolithic state

### 2. CollectiveMetricReduce

Each partition computes local partial outputs and the runtime reduces metrics or summaries.

Use when:
- downstream computation naturally supports partial aggregation
- outputs are scalars or small tensors
- full-state gather is unnecessary

### 3. PartitionLocalConsume

Only allowed if the downstream stage is explicitly partition-aware.

Use when:
- the next stage can consume distributed state partitions directly
- no gather is required

v1 should not assume that arbitrary downstream consumers can operate on partitioned states.

## v1 Constraints

To keep the first version implementable, v1 should enforce:

- contiguous amplitude-block partitions
- equal-size partitions
- no dynamic repartitioning
- no partition migration during execution
- no full fault-tolerant recovery of live distributed state
- explicit gather or partition-aware downstream consumption

## Implementation Plan

### Phase 1: Single-Node Multi-GPU State Partitioning

- define `DistributedStateHandle` and `StatePartitionRef`
- implement contiguous equal-size partition layout
- add local/global qubit metadata
- support `GatherFullState`
- validate correctness against single-device QDP

### Phase 2: Multi-Node State-Partitioned Runtime

- add coordinator and worker registration
- distribute partitions across nodes
- add partition metadata tracking
- support `CollectiveMetricReduce`

### Phase 3: Partition-Aware Scheduling

- use device capability reports for placement
- add topology-aware placement hooks
- define which downstream stages are partition-aware
- improve failure reporting and reassignment policy

## Testing Plan

### Correctness
- gathered full state matches single-device output
- partition offsets and lengths are correct
- local/global qubit mapping is consistent
- partitioned execution preserves normalization and expected values

### Runtime Semantics
- partition ownership is tracked correctly
- gather reconstructs correct global ordering
- metric reduction matches single-device reference
- partition-aware downstream stages consume partitions correctly

### Heterogeneous Hardware
- workers report capabilities consistently
- placement behaves correctly across mixed GPUs
- weaker devices are not assigned unsafe allocations

### Performance
- single GPU baseline
- single node multi-GPU partitioned execution
- two-node partitioned execution
- throughput, latency, transfer volume, gather cost, reduction cost


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][QDP][Feature] State-Partitioned Distributed Runtime for Mahout QDP #1210

Summary

Scope

Core Execution Model

Logical Model

Partition Scheme

Local / Global Qubits

Distributed State Representation

Heterogeneous GPU Support

Result Semantics

1. GatherFullState

2. CollectiveMetricReduce

3. PartitionLocalConsume

v1 Constraints

Implementation Plan

Phase 1: Single-Node Multi-GPU State Partitioning

Phase 2: Multi-Node State-Partitioned Runtime

Phase 3: Partition-Aware Scheduling

Testing Plan

Correctness

Runtime Semantics

Heterogeneous Hardware

Performance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC][QDP][Feature] State-Partitioned Distributed Runtime for Mahout QDP #1210

Description

Summary

Scope

Core Execution Model

Logical Model

Partition Scheme

Local / Global Qubits

Distributed State Representation

Heterogeneous GPU Support

Result Semantics

1. GatherFullState

2. CollectiveMetricReduce

3. PartitionLocalConsume

v1 Constraints

Implementation Plan

Phase 1: Single-Node Multi-GPU State Partitioning

Phase 2: Multi-Node State-Partitioned Runtime

Phase 3: Partition-Aware Scheduling

Testing Plan

Correctness

Runtime Semantics

Heterogeneous Hardware

Performance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions