Skip to content

Add MBPP benchmark#548

Open
tugot17 wants to merge 2 commits into
sgl-project:mainfrom
tugot17:add-mbpp-benchmark
Open

Add MBPP benchmark#548
tugot17 wants to merge 2 commits into
sgl-project:mainfrom
tugot17:add-mbpp-benchmark

Conversation

@tugot17
Copy link
Copy Markdown

@tugot17 tugot17 commented Apr 27, 2026

Add MBPP to enable benchmarking consistent with the DFlash paper.

Usage

Against a running SGLang server on :30000:

import sys
sys.path.insert(0, "/path/to/SpecForge/benchmarks")
from benchmarker import BENCHMARKS

bench = BENCHMARKS.get("mbpp")(num_samples=50)
m = bench.run(host="127.0.0.1", port=30000, batch_size=1)[0]
print(f"thr={m.output_throughput:.1f} tok/s  accept_len={m.accept_length:.2f}")

@tugot17 tugot17 requested a review from FrankLeeeee as a code owner April 27, 2026 15:54
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the MBPP (Mostly Basic Python Problems) benchmark implementation, including data loading from the 'sanitized' split, prompt construction, and accuracy computation. Feedback focuses on improving the robustness and security of the implementation. Key suggestions include executing model-generated code in a separate process with a timeout to mitigate security risks and infinite loops, refining the code extraction regex to handle various markdown formats, and refactoring shared utility functions to a common location to avoid duplication with other benchmarks.

Comment on lines +36 to +37
exec(code, namespace)
exec(test_code, namespace)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Using exec to run model-generated code is a significant security risk and can cause the benchmark to hang indefinitely if the model produces an infinite loop. While this pattern exists in other benchmarks in this repository, it is highly recommended to execute the code in a separate process with a strict timeout (e.g., using the multiprocessing module) to ensure the benchmarker remains robust and responsive.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same pattern is used in humaneval.py (and the comment in check_code_passes_tests already calls out that this is a "simplified" evaluation, deferring to the official frameworks for rigorous pass@k). I'd prefer to keep mbpp consistent with humaneval

Comment thread benchmarks/benchmarker/mbpp.py Outdated

def extract_code_from_output(output: str) -> Optional[str]:
"""Extract Python code from model output (markdown block or `def ...:`)."""
code_block_pattern = r"```(?:python)?\n(.*?)```"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current regex for extracting code blocks is overly restrictive as it requires a newline immediately after the opening backticks. Some models may output code blocks with spaces or without a newline (e.g., ```python def ... ```). Using \s* instead of \n makes the extraction more robust to different model output formats.

Suggested change
code_block_pattern = r"```(?:python)?\n(.*?)```"
code_block_pattern = r"```(?:python)?\s*(.*?)\s*```"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +15 to +40
def extract_code_from_output(output: str) -> Optional[str]:
"""Extract Python code from model output (markdown block or `def ...:`)."""
code_block_pattern = r"```(?:python)?\n(.*?)```"
match = re.search(code_block_pattern, output, re.DOTALL)
if match:
return match.group(1).strip()
def_pattern = r"(def\s+\w+\([^)]*\):.*?)(?=\n\ndef\s+|\Z)"
match = re.search(def_pattern, output, re.DOTALL)
if match:
return match.group(1).strip()
return output.strip() if output.strip() else None


def check_code_passes_tests(code: str, test_code: str) -> bool:
"""Run `code` then `test_code` (which contains assertions) in a fresh namespace.

Returns True iff no exception is raised. Simplified vs. the official MBPP
evaluation framework — we just want a pass/fail signal.
"""
try:
namespace: Dict[str, Any] = {}
exec(code, namespace)
exec(test_code, namespace)
return True
except Exception:
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The functions extract_code_from_output and check_code_passes_tests are identical to those in humaneval.py. To improve maintainability and adhere to DRY (Don't Repeat Yourself) principles, these utility functions should be moved to a shared location like benchmarks/benchmarker/utils.py and imported in both benchmarkers.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considered this initially had it as a shared code_eval.py module imported by both humaneval and mbpp.py Pulled back to inline the helpers in mbpp.py to keep this PR scoped to MBPP only and avoid touching humaneval.py at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant