Add MBPP benchmark by tugot17 · Pull Request #548 · sgl-project/SpecForge

tugot17 · 2026-04-27T15:54:21Z

Add MBPP to enable benchmarking consistent with the DFlash paper.

Usage

Against a running SGLang server on :30000:

import sys
sys.path.insert(0, "/path/to/SpecForge/benchmarks")
from benchmarker import BENCHMARKS

bench = BENCHMARKS.get("mbpp")(num_samples=50)
m = bench.run(host="127.0.0.1", port=30000, batch_size=1)[0]
print(f"thr={m.output_throughput:.1f} tok/s  accept_len={m.accept_length:.2f}")

…paper

gemini-code-assist

Code Review

This pull request introduces the MBPP (Mostly Basic Python Problems) benchmark implementation, including data loading from the 'sanitized' split, prompt construction, and accuracy computation. Feedback focuses on improving the robustness and security of the implementation. Key suggestions include executing model-generated code in a separate process with a timeout to mitigate security risks and infinite loops, refining the code extraction regex to handle various markdown formats, and refactoring shared utility functions to a common location to avoid duplication with other benchmarks.

gemini-code-assist · 2026-04-27T15:56:02Z

+        exec(code, namespace)
+        exec(test_code, namespace)


Using exec to run model-generated code is a significant security risk and can cause the benchmark to hang indefinitely if the model produces an infinite loop. While this pattern exists in other benchmarks in this repository, it is highly recommended to execute the code in a separate process with a strict timeout (e.g., using the multiprocessing module) to ensure the benchmarker remains robust and responsive.

The same pattern is used in humaneval.py (and the comment in check_code_passes_tests already calls out that this is a "simplified" evaluation, deferring to the official frameworks for rigorous pass@k). I'd prefer to keep mbpp consistent with humaneval

gemini-code-assist · 2026-04-27T15:56:02Z

+
+def extract_code_from_output(output: str) -> Optional[str]:
+    """Extract Python code from model output (markdown block or `def ...:`)."""
+    code_block_pattern = r"```(?:python)?\n(.*?)```"


The current regex for extracting code blocks is overly restrictive as it requires a newline immediately after the opening backticks. Some models may output code blocks with spaces or without a newline (e.g., ```python def ... ```). Using \s* instead of \n makes the extraction more robust to different model output formats.

Suggested change

code_block_pattern = r"```(?:python)?\n(.*?)```"

code_block_pattern = r"```(?:python)?\s*(.*?)\s*```"

gemini-code-assist · 2026-04-27T15:56:03Z

+def extract_code_from_output(output: str) -> Optional[str]:
+    """Extract Python code from model output (markdown block or `def ...:`)."""
+    code_block_pattern = r"```(?:python)?\n(.*?)```"
+    match = re.search(code_block_pattern, output, re.DOTALL)
+    if match:
+        return match.group(1).strip()
+    def_pattern = r"(def\s+\w+\([^)]*\):.*?)(?=\n\ndef\s+|\Z)"
+    match = re.search(def_pattern, output, re.DOTALL)
+    if match:
+        return match.group(1).strip()
+    return output.strip() if output.strip() else None
+
+
+def check_code_passes_tests(code: str, test_code: str) -> bool:
+    """Run `code` then `test_code` (which contains assertions) in a fresh namespace.
+
+    Returns True iff no exception is raised. Simplified vs. the official MBPP
+    evaluation framework — we just want a pass/fail signal.
+    """
+    try:
+        namespace: Dict[str, Any] = {}
+        exec(code, namespace)
+        exec(test_code, namespace)
+        return True
+    except Exception:
+        return False


The functions extract_code_from_output and check_code_passes_tests are identical to those in humaneval.py. To improve maintainability and adhere to DRY (Don't Repeat Yourself) principles, these utility functions should be moved to a shared location like benchmarks/benchmarker/utils.py and imported in both benchmarkers.

Considered this initially had it as a shared code_eval.py module imported by both humaneval and mbpp.py Pulled back to inline the helpers in mbpp.py to keep this PR scoped to MBPP only and avoid touching humaneval.py at all

Add MBPP benchmark to enable benchmarking consistent with the DFlash …

22965fb

…paper

tugot17 requested a review from FrankLeeeee as a code owner April 27, 2026 15:54

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

Fix regex in mbpp

e6d03c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MBPP benchmark#548

Add MBPP benchmark#548
tugot17 wants to merge 2 commits into
sgl-project:mainfrom
tugot17:add-mbpp-benchmark

tugot17 commented Apr 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

tugot17 Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

tugot17 Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

tugot17 Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	code_block_pattern = r"```(?:python)?\n(.*?)```"
	code_block_pattern = r"```(?:python)?\s(.?)\s*```"

Conversation

tugot17 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

tugot17 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

tugot17 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

tugot17 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tugot17 commented Apr 27, 2026 •

edited

Loading