Self-review: gaps, bugs, and feature ideas (P0/P1/P2 + architecture)

# Self-review: gaps, bugs, and feature ideas

A critical agent-perspective audit of the repo after the initial flurry of features. Grouped by priority. Each item names the file/area and a concrete next step. Tick boxes as we land fixes.

## Summary

| Priority | Count | Rough effort |
|---|---|---|
| 🟥 P0 — must fix | 3 | ~6 h |
| 🟧 P1 — should fix | 5 | ~12 h |
| 🟨 P2 — nice to have | 7 | ~30 h |
| 🟦 Architecture thoughts | 3 | open-ended |

---

## 🟥 P0 — must fix

These risk data loss, broken end-user experience, or future bugs we can't catch with current tests.

- [ ] **`apply.py` is 870 lines — split it**
  - Currently mixes docker prefix detection, image build/probe, MIG mgmt, clock/power/persistence mgmt, `up()`, `down()`, `reset()`, `list_cosplay_containers()`, `host_gpu_status()`.
  - Suggested split:
    ```
    gpu_cosplay/
      _docker.py     # _docker(), _container_exists, _image_*, list_cosplay_containers
      _gpu.py        # _enable_mig, _lock_clock, _set_power, _query_persistence_mode
      apply.py       # up() / down() / reset() — orchestration only
      status.py      # host_gpu_status, list_cosplay_containers reports
    ```
  - Each file < 200 lines, single responsibility.

- [ ] **`state.json` has no file lock — concurrent writes corrupt it**
  - `state.add()` does `load → modify → save` without serialization. Two parallel `gpu-cosplay up` (very likely in cluster scheduling scenarios) race → silent entry loss → reset can't find the orphan.
  - Fix: wrap `_load + modify + _save` in `fcntl.flock(LOCK_EX)`. ~15 lines.

- [ ] **Zero unit tests for `gpu_cosplay_runtime.py` and `nvidia-smi` shim**
  - Runtime hook (269 lines) + shim (326 lines) = 600 lines of regex + monkey-patch code, entirely uncovered. The column-alignment bug, the pynvml mid-load bug, and the fraction-set-too-early bug were each caught only by E2E on a real H200.
  - Fix: add two test files:
    - [ ] `tests/test_shim_rewrite.py` — feed real `nvidia-smi` output snapshots in, assert name / MiB / column widths come out right.
    - [ ] `tests/test_runtime_patches.py` — mock `sys.modules['torch']` / `sys.modules['pynvml']`, verify `_patch_torch` / `_patch_pynvml` behavior.
  - Both runnable on a GPU-less CI host.

## 🟧 P1 — should fix

Friction and maintainability problems. Not blocking but the kind of debt that compounds.

- [ ] **Half-migrated naming: `Card` / `card_key` / `find_card` vs user-facing "GPU"**
  - User strings all say "GPU"; internals still say "Card". Confusing for future agents reading the code.
  - Two options: (a) document the convention in AGENTS.md and accept it; (b) one-shot rename with state.json migration (we have `test_session_compat.py` to cover legacy state).

- [ ] **`plan.py` has hardcoded host arch / SM / BW tables**
  - `_full_sm_count`, `_host_total_bw`, `_host_arch` are dicts inside Python. Every new host generation (B100/B200/GB200/R200) requires editing three places.
  - Fix: move to `gpu_cosplay/data/host_gpus.yaml`, mirror the structure of `gpus.yaml`.

- [ ] **`apply.up()` is a monolithic 150-line try/except**
  - Forward flow and rollback flow are interleaved. Adding a new mutation requires editing both branches.
  - Suggested refactor: `_RollbackStack` context manager.
    ```python
    with _RollbackStack() as rb:
        _set_persistence(gpu.index, True)
        rb.push(lambda: _set_persistence(gpu.index, original_persistence))
        _set_power(...)
        rb.push(lambda: _set_power(..., original_power))
        # ...
        rb.commit()  # success — don't roll back
    ```
  - Forward logic drops to ~30 lines; rollback mirrors it automatically.

- [ ] **Runtime hook mutates `builtins.__import__` globally**
  - If user code / IPython / debugpy also patches `__import__`, last writer wins. We'd win but break theirs.
  - Fix: preserve the previous hook, chain it. Or use `sys.meta_path` finder which is the proper extension point.

- [ ] **No integration test runs `gpu-cosplay verify` end-to-end**
  - CI runs lint + unit + docker build smoke. Nothing exercises the runtime hooks or the shim against a real GPU.
  - Options:
    - [ ] Self-hosted GH runner with NVIDIA T4/A10. Nightly workflow: spin up → `gpu-cosplay verify` → all 18 checks must PASS.
    - [ ] Stopgap: `tests/test_verify_script.py` that mocks `subprocess.run` to validate the verify script's PASS/FAIL/SKIP logic.

## 🟨 P2 — nice to have

Polish, ergonomics, and ecosystem.

- [ ] **No structured logging**
  - Scattered `print(...)` everywhere. No timestamps, no level filtering. Use stdlib `logging`; expose `--verbose` to drop to DEBUG.

- [ ] **No pre-built image on `ghcr.io/deepghs/gpu-cosplay`**
  - First `gpu-cosplay up` builds locally for ~8 min. CI already builds the image; add a push step + GHCR login + tag with git SHA and `:latest`.

- [ ] **No multi-GPU cosplay**
  - Many research questions need a 3090×2-style setup. Single-process multi-device is feasible via multiple MIG slices in the same container (NCCL P2P won't work — that's a known limit).
  - Surface: `gpu-cosplay up 3090 --num 2`; container sees `CUDA_VISIBLE_DEVICES=uuid1,uuid2`; runtime hook applies memory_fraction per device.

- [ ] **No `gpu-cosplay benchmark`**
  - `verify` checks identity disguise. We also need a command that runs a standard FP32/BF16/bandwidth microbench, compares to the target's datasheet, and prints the delta table. Quantifies "this cosplay is 33% slow on BF16 vs real 3090".
  - Most of the code exists in `examples/bench.py`; just needs a CLI wrapper.

- [ ] **Missing CONTRIBUTING.md and SECURITY.md**
  - AGENTS.md serves AI agents but human contributors who fork the repo don't find the conventions. Extract the "how to add a new card / a new host family / a new verify check" sections from AGENTS.md into CONTRIBUTING.md.

- [ ] **`examples/` only contains `bench.py`**
  - Add at least:
    - [ ] `examples/sd_inference.py` — Stable Diffusion inference (3090 vs 4090 cosplay comparison)
    - [ ] `examples/llm_oom.py` — load a 13B model and assert OOM behavior matches target VRAM
    - [ ] `examples/lora_train.py` — LoRA training on a consumer-card cosplay

- [ ] **"Hide MIG" mode incomplete**
  - `torch.cuda.get_device_properties(0).multi_processor_count` still returns the MIG slice's real SM count (32 for `2g.35gb`) instead of the target's (e.g. 82 for RTX 3090). NSight and other profilers read this.
  - Fix: extend `_CosplayDeviceProperties` proxy to lie about `multi_processor_count`, `major`/`minor`, etc.

- [ ] **No GPU spec auto-fetch**
  - When a new GPU launches, someone manually scrapes the whitepaper into `gpus.yaml`. Could implement `gpu-cosplay scrape rtx_5090 --from epoch-ai` to automate.

## 🟦 Architecture thoughts (not bugs, but worth discussing)

- [ ] **Cosplay is a single-host concept. Cluster-scale deployment isn't designed.**
  - Picture an 8-GPU H200 node with 56 × 1g.18gb slices serving 20 team members. Current CLI is single-user; no reservations / quotas / queues / GC.
  - Natural evolution: SLURM integration, K8s operator, central scheduler, web dashboard, 24h idle-session GC.
  - Not the repo's core mission, but a likely next layer.

- [ ] **Fidelity loss is undocumented quantitatively**
  - README has prose like "BF16 TC will be too strong". No numbers.
  - Should ship `docs/fidelity.md`: per (target × host) pair, run standard workloads (GEMM, SD inference, LLM forward), table real vs datasheet delta. Most convincing "honest claim".

- [ ] **Reverse direction: simulate stronger features on weaker hardware?**
  - Currently we throttle strong hardware to look weak. Could we go the other way? "I have an RTX 4090, want to develop FP8 training as if on H100" — NVIDIA's PTX JIT lets FP8 software-emulate via FP16. If supported, the tool becomes "cosplay any direction".
  - Research value is high. Not P0/P1 but worth scoping.

---

## Process notes

- All numbers above are from `wc -l` of the current `main` (commit `b6fe07d`) plus a manual read-through.
- Priority is the reviewer's call, not a contract. Argue if you disagree.
- Anything marked 🟥 should land before the next round of feature additions; otherwise we keep digging the hole.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-review: gaps, bugs, and feature ideas (P0/P1/P2 + architecture) #1

Self-review: gaps, bugs, and feature ideas

Summary

🟥 P0 — must fix

🟧 P1 — should fix

🟨 P2 — nice to have

🟦 Architecture thoughts (not bugs, but worth discussing)

Process notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Count	Rough effort
🟥 P0 — must fix	3	~6 h
🟧 P1 — should fix	5	~12 h
🟨 P2 — nice to have	7	~30 h
🟦 Architecture thoughts	3	open-ended

Self-review: gaps, bugs, and feature ideas (P0/P1/P2 + architecture) #1

Description

Self-review: gaps, bugs, and feature ideas

Summary

🟥 P0 — must fix

🟧 P1 — should fix

🟨 P2 — nice to have

🟦 Architecture thoughts (not bugs, but worth discussing)

Process notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions