Add FlashMLA flashmla benchmark perf gate#25
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new Python script, tools/benchmark_perf_gate.py, designed to compare baseline and current performance metrics and fail if regressions are detected. The feedback suggests improving usability by making the positional arguments optional when running the --self-test flag, and expanding the self_test function to assert regression and missing-case behaviors rather than just the happy path.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| parser.add_argument("baseline") | ||
| parser.add_argument("current") | ||
| parser.add_argument("--self-test", action="store_true") | ||
| args = parser.parse_args() | ||
| if args.self_test: | ||
| self_test() | ||
| return 0 |
There was a problem hiding this comment.
The positional arguments baseline and current are currently required by default in argparse. This means that running the script with --self-test (e.g., python tools/benchmark_perf_gate.py --self-test) will fail with an error stating that the positional arguments are required, forcing the user to provide dummy arguments (as noted in the PR description: python tools/benchmark_perf_gate.py --self-test x y).
To improve usability, we can make the positional arguments optional (nargs="?") and manually validate their presence only when --self-test is not specified.
| parser.add_argument("baseline") | |
| parser.add_argument("current") | |
| parser.add_argument("--self-test", action="store_true") | |
| args = parser.parse_args() | |
| if args.self_test: | |
| self_test() | |
| return 0 | |
| parser.add_argument("baseline", nargs="?", default=None) | |
| parser.add_argument("current", nargs="?", default=None) | |
| parser.add_argument("--self-test", action="store_true") | |
| args = parser.parse_args() | |
| if args.self_test: | |
| self_test() | |
| return 0 | |
| if not args.baseline or not args.current: | |
| parser.error("the following arguments are required: baseline, current") |
| def self_test() -> None: | ||
| data = compare({"case": 100.0}, {"case": 99.0}) | ||
| assert data["ok"] | ||
| print(json.dumps({"ok": True, "rows": len(data["rows"])}, ensure_ascii=False)) |
There was a problem hiding this comment.
The self_test function currently only verifies the happy path (where performance is within the tolerance limit). It does not verify that the performance gate actually detects regressions or missing cases, which are the primary failure modes the script is designed to catch.
Adding assertions for regression and missing cases will make the self-test much more robust and reliable.
| def self_test() -> None: | |
| data = compare({"case": 100.0}, {"case": 99.0}) | |
| assert data["ok"] | |
| print(json.dumps({"ok": True, "rows": len(data["rows"])}, ensure_ascii=False)) | |
| def self_test() -> None: | |
| # Test within tolerance | |
| data_ok = compare({"case": 100.0}, {"case": 99.0}) | |
| assert data_ok["ok"] | |
| # Test regression detection | |
| data_reg = compare({"case": 100.0}, {"case": 90.0}) | |
| assert not data_reg["ok"] | |
| assert data_reg["rows"][0]["status"] == "regression" | |
| # Test missing current case detection | |
| data_missing = compare({"case": 100.0}, {}) | |
| assert not data_missing["ok"] | |
| assert data_missing["rows"][0]["status"] == "missing-current" | |
| print(json.dumps({"ok": True, "rows": len(data_ok["rows"])}, ensure_ascii=False)) |
- Add FlashMLA benchmark performance gate - Strengthen benchmark perf gate self test
Summary
Validation
Review notes