Hi YOLO-Master authors,
Thank you for sharing this amazing work! I’ve been experimenting with the pre-trained weights for my project on small object segmentation and deeply appreciate the idea of Instance-conditional adaptive computation.
However, while analyzing the expert utilization of the pre-trained weights (YOLO-Master-v0.1-N.pt) using the official script, I noticed some unusual statistics that look like a potential Routing Collapse. I am a bit confused and would like to seek your clarification.
I used the official script provided at ultralytics/nn/modules/moe/analysis.py to diagnose the YOLO-Master-v0.1-N.pt model on the MS COCO 2017 val dataset (5000 images).
The diagnosis report shows:
- Total Tokens Processed:
15,003. Since there are 3 router layers and 5001 forward passes (5000 val images + 1 warmup), this perfectly aligns with the instance-level routing design (1 token per image).
- Static Expert Activation: For all 5001 images, the routers exclusively selected the exact same two experts with exactly a 50/50 split, regardless of the image content or scene complexity.
For example, in model.12.routing:
- Expert 6: 50.00% (5001 Hits)
- Expert 15: 50.00% (5001 Hits)
(Other layers like model.6.routing and model.9.routing exhibit the exact same 100% static behavior with their respective two experts).
Since the paper emphasizes that the ES-MOE block dynamically allocates computational resources according to scene complexity, I expected the expert distribution to vary across different images (e.g., crowded scenes vs. simple backgrounds).
- Is this static routing behavior expected for the
v0.1-N.pt release? Did this specific checkpoint suffer from load balancing loss collapse during early training, causing it to fall back to a static network?
- Regarding
MoEPruner: I noticed the recent addition of the MoEPruner tool to prune experts with <15% utilization. Was this tool developed specifically to address this kind of routing redundancy observed in the current weights?
Environment:
- Weights: YOLO-Master-v0.1-N.pt (from assets)
- Dataset: MS COCO 2017 val
- Ultralytics Version: 8.3.240
- Device: CPU / GPU (both yield the same logic)
Looking forward to your insights! Thank you again for the great contribution to the community.
Hi YOLO-Master authors,
Thank you for sharing this amazing work! I’ve been experimenting with the pre-trained weights for my project on small object segmentation and deeply appreciate the idea of Instance-conditional adaptive computation.
However, while analyzing the expert utilization of the pre-trained weights (
YOLO-Master-v0.1-N.pt) using the official script, I noticed some unusual statistics that look like a potential Routing Collapse. I am a bit confused and would like to seek your clarification.I used the official script provided at
ultralytics/nn/modules/moe/analysis.pyto diagnose theYOLO-Master-v0.1-N.ptmodel on the MS COCO 2017 val dataset (5000 images).The diagnosis report shows:
15,003. Since there are 3 router layers and 5001 forward passes (5000 val images + 1 warmup), this perfectly aligns with the instance-level routing design (1 token per image).For example, in
model.12.routing:(Other layers like
model.6.routingandmodel.9.routingexhibit the exact same 100% static behavior with their respective two experts).Since the paper emphasizes that the ES-MOE block dynamically allocates computational resources according to scene complexity, I expected the expert distribution to vary across different images (e.g., crowded scenes vs. simple backgrounds).
v0.1-N.ptrelease? Did this specific checkpoint suffer from load balancing loss collapse during early training, causing it to fall back to a static network?MoEPruner: I noticed the recent addition of theMoEPrunertool to prune experts with <15% utilization. Was this tool developed specifically to address this kind of routing redundancy observed in the current weights?Environment:
Looking forward to your insights! Thank you again for the great contribution to the community.