Skip to content

imallona/beam

Repository files navigation

Aim

beam aims to provide benchmark evaluation and metrics.

We believe we can formalize and store open and reusable performance metrics to be reused by anyone running any method comparison (aka benchmark). We also aim to automate decisions, to reduce implicit biases.

For that we can borrow ideas from measurement theory and other fields.

beam is under development.

Background

A metric result procedure (not a result from running the metric), such as accuracy or kBet or max RSS, ingests/produces similarly shaped and similarly interpretable results. Can be generated by multiple implementations.

  • Implementations: [{name : blabla, version: v1, language: python, license: MIT}; {name: fastbla, version: v2.2.2, language: Rust, license: GPL}]
  • Syntax
    • Dimension: vector, matrix, scalar, graph, complex (if complex, specify schema) etc
      • Schema: json schema for shape validation
    • Values: str, int, float32, bool, etc
    • File format: matrix market, csv, json, etc
  • Semantics
    • Interpretation: low is good, A better than B better than C, etc; linear or not linear
    • Range: ratio, natural plus zero, (0 - 11.5), etc
    • Scale: nominal, ordinal, interval, ratio
      • Allowed transformations: e.g poor = 0, mid = 1; or sqrt(x), arcsin(x) etc
    • Timeseries: no (whether repeated measures are taken at perhaps regular intervals)
  • Documentation
    • Description (human readable)
  • QA/QC
    • Example inputs for CI/CD / validity testing and their expected outputs
  • Taxonomy
    • Intrinsic or depending on a truth; if dependending on a truth, specify the truth
    • Truth
      • Syntax (borrow specs from above)
      • Documentation
  • Known applications: [clustering, classification]
  • Example applications: Anthony's clustering benchmark v1 with permalink X, spacehack v99.9 with permalink Y

Scales (e.g. measurement theory-wise) imply constraints in comparisons, e.g. makes no sense to build distances on nominal data, or variance on ordinal labels; but makes sense to cross-tabulate nominal data or log-tranform ratios.

Metric repository and formalization

We aim to provide tested software implementations for commonly used metrics (e.g. TPR). So they are properly annotated with metric nutrition labels, akin to dataset nutrition labels. Based on the metric nature, including syntax and semantics, so input and output interfaces are clear and reusable, independently of the language/interpreter.

Why?

This project is inspired by omnibenchark, a tool for open and continuous community benchmarking.

Resources

Started

21st Feb 2025

Status

Early stage, now installable end to end. The schema, the seed metric registry, the MCDA primitives, an ontology-aware entry point, four sensitivity primitives, a cross-dataset aggregator, and a simulated test suite are in place. The decision module is complete: five aggregations (SAW, TOPSIS, VIKOR, PROMETHEE II, COMET), five objective and subjective weighting schemes, and a re-analysis of the real Duo 2018 clustering benchmark. Phase 2 added the user-facing layer: a CSV loader, the beam.rank procedural API, a self-contained HTML report, a reproducibility manifest, a declarative beam.yaml runner, and a beam command-line interface. The heterogeneity module has started: the mixed-effects variance decomposition (beam.heterogeneity.mixed_effects, lme4 via a subprocess) is in; the Bradley-Terry trees and the Plackett-Luce extension are planned for later releases.

What is in HEAD:

  • A JSON Schema (draft 2020-12) for metric cards, covering identity, kind, inputs, output, semantics (scale, polarity, range, allowed transformations, uncertainty), comparability (including a recommended_aggregation_across_datasets enum), implementations, examples, and provenance.
  • Fourteen seed metric cards across three domains, showing the registry is domain neutral: clustering and efficiency (ari, runtime, nmi, peak_memory, accuracy, f1_score, silhouette, shannon_entropy_diff, nclust_deviation), forecasting (smape, mase), and the illustrative transportation metrics (speed, cost, co2).
  • beam.cards: card loader, MetricCard dataclass, Registry, and two bridges to the MCDA pipeline: polarities_for(metric_ids) (just polarity) and properties_for(metric_ids) (polarity, scale_type, range bounds, allowed transformations, recommended cross-dataset aggregation).
  • beam.mcda: normalization (min_max_normalize with optional declared bounds, plus the strategy dispatcher normalize), weighting (equal_weights, entropy_weights, standard_deviation_weights, critic_weights, merec_weights objective, and ahp_weights subjective), aggregation (weighted_sum for SAW, topsis, vikor, promethee_ii, comet), ranking (rank), and two entry points: the lower-level run(...) and the ontology-aware run_from_registry(...) which pulls polarity and bounds from the registry and refuses incompatible scale types via validate_for_aggregation. The five aggregations wrap pymcdm: beam normalizes by metric card, then calls pymcdm with an identity normalization so pymcdm runs on the already-normalized matrix. beam keeps the higher-is-better convention (VIKOR returns -Q). The weighting schemes stay beam's own, since pymcdm's reject the zeros beam's normalization produces and pymcdm has no AHP.
  • Sensitivity primitives: leave_one_metric_out (rank stability under metric omission), leave_one_dataset_out (rank stability under dataset omission, on a tensor input, nan-aware), smaa (Dirichlet weight sampling, rank acceptability index, central weight vector per tool, confidence factor, for all five aggregations), and smallest_weight_perturbation (Triantaphyllou-Sanchez closed-form weight delta for SAW, numeric for the other aggregations).
  • aggregate_across_datasets: reduce a tool by dataset matrix for one metric using the rule declared on its card (arithmetic_mean for bounded interval/ratio, geometric_mean for unbounded ratio per Smith 1988, median, or rank_mean).
  • beam.heterogeneity: the method-dataset heterogeneity diagnostics (PLAN Phase 4). mixed_effects fits a mixed-effects model on one metric (score ~ method + (1 | dataset), adding a (1 | dataset:method) term when a cell has replicates) in R's lme4 through a one-shot subprocess, and returns the per-method marginal means, the variance components, the dataset ICC, the interaction or residual share, and the largest-residual outlier cells. It quantifies how much of the score variance is a method-by-dataset interaction, the formal complement to leave_one_dataset_out. r_available gates the call; the R toolchain comes from the envs/heterogeneity.yml conda environment. The Bradley-Terry trees and the Plackett-Luce extension are still pending. See docs/explanations/heterogeneity-mixed-effects.md.
  • beam.scenarios: canonical simulated benchmark scenarios with documented ground truth (random with anti-correlated trade-offs, dominant, ties, odd_dataset where one method is best on most datasets but a different method is best on one odd dataset), plus two normalization-failure examples where the top-ranked method under plain min-max differs from the one under the card defaults. Used by the test suite and by a simulated scenarios vignette. beam.datasets.load_duo2018 loads the real Duo et al. 2018 clustering benchmark (14 methods by 12 datasets by 4 metrics) vendored under src/beam/data/. beam.datasets.load_m4 loads a small results table derived from the M4 forecasting competition (25 methods by 6 frequency bands by 2 metrics), reduced from the GPL-3 M4comp2018 data by src/beam/data/reduce_m4.R.
  • CI: ruff lint, ruff format, pytest on Python 3.12 and 3.13, R-side metric card validation via jsonvalidate, the mixed-effects heterogeneity tests in a micromamba environment with R and lme4, and Quarto rendering of the vignettes with artefact upload.
  • Four worked vignettes under examples/: the Duo 2018 walkthrough on the real published data, a simulated scenarios report that runs the full report layout on every canonical scenario, a cross-domain transportation example that exercises every aggregation and weighting scheme where terrain coverage is partial, and an M4 forecasting example on a large real non-bio benchmark. All four read metric semantics from the registry.

Install

python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,docs]"

[docs] pulls in Jupyter and matplotlib so Quarto can execute the Python code chunks in the vignettes. [io] pulls in pandas for the CSV adapter. [dev] covers the test suite.

Quick use

The five-line path, from a CSV to an HTML report:

import beam

scores = beam.load_scores("scores.csv")          # tool by metric, or a tool by dataset by metric tensor
result = beam.rank(scores, weights="entropy", method="topsis")
beam.report(result, "report.html")
print(result.top_tool, "ranks first")

beam.rank resolves polarity, normalization, bounds and baselines from the metric cards, runs the MCDA pipeline, runs the default sensitivity analysis on the same normalization context, and builds a run manifest. The returned RunResult carries the ranking, the sensitivity reports, and the manifest. beam.report writes one self-contained HTML file with the ranking, the sensitivity, a critical-difference section when the input has more than one dataset, and a plain-language recommendation.

The same from the command line:

beam validate scores.csv
beam rank scores.csv --weights entropy --method topsis --out result.json --report report.html
beam report result.json --out report.html
beam metric show ari
beam run beam.yaml

The lower-level entry point beam.mcda.run_from_registry(scores, metric_ids, weights=, method=) takes a 2D array directly and returns just the MCDA Result. See examples/duo2018/duo2018.qmd for the longer walkthrough and docs/tutorials/quickstart.md for a runnable quickstart.

Repository layout

src/beam/
  schema/                       metric_card.schema.json, JSON Schema (draft 2020-12), shipped as package data
  metrics/                      One YAML file per metric and version; LICENSE.md is CC-BY-4.0 (cards only)
  cards/                        Card loader, MetricCard, MetricProperties, Registry,
                                polarities_for, properties_for
  mcda/                         normalize, weights (equal, entropy, std, critic, merec, ahp),
                                topsis, weighted_sum, vikor, promethee_ii, comet, rank, run,
                                run_from_registry, registry_context, validate_for_aggregation,
                                leave_one_metric_out, leave_one_dataset_out, smaa,
                                smallest_weight_perturbation, critical_difference,
                                aggregate_across_datasets, reduce_tensor, Result
  api.py                        load-rank-report procedural API: rank, RunResult
  reporting/                    Self-contained HTML report (write_report, exposed as beam.report)
  manifest.py                   Run manifest: hashes, card versions, software fingerprint
  config.py                     Declarative beam.yaml runner (run_config)
  cli.py                        beam command-line interface (entry point beam = beam.cli:main)
  io/                           load_scores (stdlib CSV) and the optional pandas read_csv
  scenarios.py                  Canonical simulated scenarios and the transportation benchmark
  datasets.py                   load_duo2018 loader for the bundled Duo 2018 benchmark
  heterogeneity/                mixed_effects (lme4 via subprocess), r_available, mixed_effects.R
  data/                         DuoSCClustering2018.csv (Duo et al. 2018) and provenance
tests/
  test_schema.py                Python-side metric card validation
  validate_cards.R              R-side metric card validation (jsonvalidate)
  test_cards_*.py               Cards loader, registry, polarities_for, properties_for
  test_mcda_*.py                Normalize, weights, aggregate, topsis, facade, pipeline,
                                validate, run_from_registry, sensitivity, smaa, perturbation,
                                cross_dataset
  test_scenarios.py             Ground-truth checks on the four canonical scenarios
examples/
  duo2018/duo2018.qmd           Walkthrough vignette on the real Duo 2018 data
  scenarios/scenarios.qmd       Consistency-check vignette across canonical scenarios
  transportation/transportation.qmd  Cross-domain example across all methods
  m4/m4.qmd                     M4 forecasting competition, a large real non-bio benchmark
docs/
  adr/                          Architectural decision records
  findings/                     Empirical findings log
  explanations/                 Conceptual essays (measurement theory, cards-and-pipeline)
  paper/                        Manuscript folder
.github/workflows/ci.yml        CI: Python tests, R card validation, vignette rendering

Build artefacts

  • duo2018-vignette: a workflow artefact uploaded by CI on every push and pull request. The rendered Duo vignette as a self-contained HTML file. Download it from the Actions tab on GitHub.
  • scenarios-vignette: a second workflow artefact with the rendered simulated scenarios report.
  • metric_card.schema.json: the canonical schema. Any tool that validates JSON against it can ingest beam metric cards.
  • CITATION.cff: cff-version 1.2.0; GitHub renders a citation widget from it.
  • A wheel under dist/ after python -m build. Not currently published to PyPI.

Licence

  • Code: GPL-3.0-or-later (LICENSE).
  • Metric cards under src/beam/metrics/: CC-BY-4.0 (src/beam/metrics/LICENSE.md).

Citation

See CITATION.cff.

About

benchmark evaluation and metrics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors