You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A critical agent-perspective audit of the repo after the initial flurry of features. Grouped by priority. Each item names the file/area and a concrete next step. Tick boxes as we land fixes.
Summary
Priority
Count
Rough effort
🟥 P0 — must fix
3
~6 h
🟧 P1 — should fix
5
~12 h
🟨 P2 — nice to have
7
~30 h
🟦 Architecture thoughts
3
open-ended
🟥 P0 — must fix
These risk data loss, broken end-user experience, or future bugs we can't catch with current tests.
state.json has no file lock — concurrent writes corrupt it
state.add() does load → modify → save without serialization. Two parallel gpu-cosplay up (very likely in cluster scheduling scenarios) race → silent entry loss → reset can't find the orphan.
Zero unit tests for gpu_cosplay_runtime.py and nvidia-smi shim
Runtime hook (269 lines) + shim (326 lines) = 600 lines of regex + monkey-patch code, entirely uncovered. The column-alignment bug, the pynvml mid-load bug, and the fraction-set-too-early bug were each caught only by E2E on a real H200.
Fix: add two test files:
tests/test_shim_rewrite.py — feed real nvidia-smi output snapshots in, assert name / MiB / column widths come out right.
Friction and maintainability problems. Not blocking but the kind of debt that compounds.
Half-migrated naming: Card / card_key / find_card vs user-facing "GPU"
User strings all say "GPU"; internals still say "Card". Confusing for future agents reading the code.
Two options: (a) document the convention in AGENTS.md and accept it; (b) one-shot rename with state.json migration (we have test_session_compat.py to cover legacy state).
plan.py has hardcoded host arch / SM / BW tables
_full_sm_count, _host_total_bw, _host_arch are dicts inside Python. Every new host generation (B100/B200/GB200/R200) requires editing three places.
Fix: move to gpu_cosplay/data/host_gpus.yaml, mirror the structure of gpus.yaml.
apply.up() is a monolithic 150-line try/except
Forward flow and rollback flow are interleaved. Adding a new mutation requires editing both branches.
with_RollbackStack() asrb:
_set_persistence(gpu.index, True)
rb.push(lambda: _set_persistence(gpu.index, original_persistence))
_set_power(...)
rb.push(lambda: _set_power(..., original_power))
# ...rb.commit() # success — don't roll back
Forward logic drops to ~30 lines; rollback mirrors it automatically.
Runtime hook mutates builtins.__import__ globally
If user code / IPython / debugpy also patches __import__, last writer wins. We'd win but break theirs.
Fix: preserve the previous hook, chain it. Or use sys.meta_path finder which is the proper extension point.
No integration test runs gpu-cosplay verify end-to-end
CI runs lint + unit + docker build smoke. Nothing exercises the runtime hooks or the shim against a real GPU.
Options:
Self-hosted GH runner with NVIDIA T4/A10. Nightly workflow: spin up → gpu-cosplay verify → all 18 checks must PASS.
Stopgap: tests/test_verify_script.py that mocks subprocess.run to validate the verify script's PASS/FAIL/SKIP logic.
🟨 P2 — nice to have
Polish, ergonomics, and ecosystem.
No structured logging
Scattered print(...) everywhere. No timestamps, no level filtering. Use stdlib logging; expose --verbose to drop to DEBUG.
No pre-built image on ghcr.io/deepghs/gpu-cosplay
First gpu-cosplay up builds locally for ~8 min. CI already builds the image; add a push step + GHCR login + tag with git SHA and :latest.
No multi-GPU cosplay
Many research questions need a 3090×2-style setup. Single-process multi-device is feasible via multiple MIG slices in the same container (NCCL P2P won't work — that's a known limit).
Surface: gpu-cosplay up 3090 --num 2; container sees CUDA_VISIBLE_DEVICES=uuid1,uuid2; runtime hook applies memory_fraction per device.
No gpu-cosplay benchmark
verify checks identity disguise. We also need a command that runs a standard FP32/BF16/bandwidth microbench, compares to the target's datasheet, and prints the delta table. Quantifies "this cosplay is 33% slow on BF16 vs real 3090".
Most of the code exists in examples/bench.py; just needs a CLI wrapper.
Missing CONTRIBUTING.md and SECURITY.md
AGENTS.md serves AI agents but human contributors who fork the repo don't find the conventions. Extract the "how to add a new card / a new host family / a new verify check" sections from AGENTS.md into CONTRIBUTING.md.
examples/ only contains bench.py
Add at least:
examples/sd_inference.py — Stable Diffusion inference (3090 vs 4090 cosplay comparison)
examples/llm_oom.py — load a 13B model and assert OOM behavior matches target VRAM
examples/lora_train.py — LoRA training on a consumer-card cosplay
"Hide MIG" mode incomplete
torch.cuda.get_device_properties(0).multi_processor_count still returns the MIG slice's real SM count (32 for 2g.35gb) instead of the target's (e.g. 82 for RTX 3090). NSight and other profilers read this.
Fix: extend _CosplayDeviceProperties proxy to lie about multi_processor_count, major/minor, etc.
No GPU spec auto-fetch
When a new GPU launches, someone manually scrapes the whitepaper into gpus.yaml. Could implement gpu-cosplay scrape rtx_5090 --from epoch-ai to automate.
🟦 Architecture thoughts (not bugs, but worth discussing)
Cosplay is a single-host concept. Cluster-scale deployment isn't designed.
Picture an 8-GPU H200 node with 56 × 1g.18gb slices serving 20 team members. Current CLI is single-user; no reservations / quotas / queues / GC.
Natural evolution: SLURM integration, K8s operator, central scheduler, web dashboard, 24h idle-session GC.
Not the repo's core mission, but a likely next layer.
Fidelity loss is undocumented quantitatively
README has prose like "BF16 TC will be too strong". No numbers.
Should ship docs/fidelity.md: per (target × host) pair, run standard workloads (GEMM, SD inference, LLM forward), table real vs datasheet delta. Most convincing "honest claim".
Reverse direction: simulate stronger features on weaker hardware?
Currently we throttle strong hardware to look weak. Could we go the other way? "I have an RTX 4090, want to develop FP8 training as if on H100" — NVIDIA's PTX JIT lets FP8 software-emulate via FP16. If supported, the tool becomes "cosplay any direction".
Research value is high. Not P0/P1 but worth scoping.
Process notes
All numbers above are from wc -l of the current main (commit b6fe07d) plus a manual read-through.
Priority is the reviewer's call, not a contract. Argue if you disagree.
Anything marked 🟥 should land before the next round of feature additions; otherwise we keep digging the hole.
Self-review: gaps, bugs, and feature ideas
A critical agent-perspective audit of the repo after the initial flurry of features. Grouped by priority. Each item names the file/area and a concrete next step. Tick boxes as we land fixes.
Summary
🟥 P0 — must fix
These risk data loss, broken end-user experience, or future bugs we can't catch with current tests.
apply.pyis 870 lines — split itup(),down(),reset(),list_cosplay_containers(),host_gpu_status().state.jsonhas no file lock — concurrent writes corrupt itstate.add()doesload → modify → savewithout serialization. Two parallelgpu-cosplay up(very likely in cluster scheduling scenarios) race → silent entry loss → reset can't find the orphan._load + modify + _saveinfcntl.flock(LOCK_EX). ~15 lines.Zero unit tests for
gpu_cosplay_runtime.pyandnvidia-smishimtests/test_shim_rewrite.py— feed realnvidia-smioutput snapshots in, assert name / MiB / column widths come out right.tests/test_runtime_patches.py— mocksys.modules['torch']/sys.modules['pynvml'], verify_patch_torch/_patch_pynvmlbehavior.🟧 P1 — should fix
Friction and maintainability problems. Not blocking but the kind of debt that compounds.
Half-migrated naming:
Card/card_key/find_cardvs user-facing "GPU"test_session_compat.pyto cover legacy state).plan.pyhas hardcoded host arch / SM / BW tables_full_sm_count,_host_total_bw,_host_archare dicts inside Python. Every new host generation (B100/B200/GB200/R200) requires editing three places.gpu_cosplay/data/host_gpus.yaml, mirror the structure ofgpus.yaml.apply.up()is a monolithic 150-line try/except_RollbackStackcontext manager.Runtime hook mutates
builtins.__import__globally__import__, last writer wins. We'd win but break theirs.sys.meta_pathfinder which is the proper extension point.No integration test runs
gpu-cosplay verifyend-to-endgpu-cosplay verify→ all 18 checks must PASS.tests/test_verify_script.pythat mockssubprocess.runto validate the verify script's PASS/FAIL/SKIP logic.🟨 P2 — nice to have
Polish, ergonomics, and ecosystem.
No structured logging
print(...)everywhere. No timestamps, no level filtering. Use stdliblogging; expose--verboseto drop to DEBUG.No pre-built image on
ghcr.io/deepghs/gpu-cosplaygpu-cosplay upbuilds locally for ~8 min. CI already builds the image; add a push step + GHCR login + tag with git SHA and:latest.No multi-GPU cosplay
gpu-cosplay up 3090 --num 2; container seesCUDA_VISIBLE_DEVICES=uuid1,uuid2; runtime hook applies memory_fraction per device.No
gpu-cosplay benchmarkverifychecks identity disguise. We also need a command that runs a standard FP32/BF16/bandwidth microbench, compares to the target's datasheet, and prints the delta table. Quantifies "this cosplay is 33% slow on BF16 vs real 3090".examples/bench.py; just needs a CLI wrapper.Missing CONTRIBUTING.md and SECURITY.md
examples/only containsbench.pyexamples/sd_inference.py— Stable Diffusion inference (3090 vs 4090 cosplay comparison)examples/llm_oom.py— load a 13B model and assert OOM behavior matches target VRAMexamples/lora_train.py— LoRA training on a consumer-card cosplay"Hide MIG" mode incomplete
torch.cuda.get_device_properties(0).multi_processor_countstill returns the MIG slice's real SM count (32 for2g.35gb) instead of the target's (e.g. 82 for RTX 3090). NSight and other profilers read this._CosplayDevicePropertiesproxy to lie aboutmulti_processor_count,major/minor, etc.No GPU spec auto-fetch
gpus.yaml. Could implementgpu-cosplay scrape rtx_5090 --from epoch-aito automate.🟦 Architecture thoughts (not bugs, but worth discussing)
Cosplay is a single-host concept. Cluster-scale deployment isn't designed.
Fidelity loss is undocumented quantitatively
docs/fidelity.md: per (target × host) pair, run standard workloads (GEMM, SD inference, LLM forward), table real vs datasheet delta. Most convincing "honest claim".Reverse direction: simulate stronger features on weaker hardware?
Process notes
wc -lof the currentmain(commitb6fe07d) plus a manual read-through.