Challenge score metric is slow and unstable due to stochastic tie-breaking

The current implementation of the challenge score metric uses Monte Carlo tie-breaking with `10**4` permutations to approximate the expected confusion matrix. This has several drawbacks:

- Computationally expensive (runtime grows as `O(num_permutations n log n)`)
- Non-reproducible due to randomness
- Results can vary between runs, which complicates benchmarking and CI.
- When using bootstrapping to estimate performance distributions (as we did in our paper), the repeated Monte Carlo sampling makes the metric prohibitively slow and effectively unusable.

**Proposal:**
Replace the sampling with an exact computation using the hypergeometric expectation. This removes stochasticity, guarantees reproducibility, and reduces runtime to `O(n log n)`.

**Related PR: #4** 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenge score metric is slow and unstable due to stochastic tie-breaking #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Challenge score metric is slow and unstable due to stochastic tie-breaking #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions