Add MBPP benchmark#548
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the MBPP (Mostly Basic Python Problems) benchmark implementation, including data loading from the 'sanitized' split, prompt construction, and accuracy computation. Feedback focuses on improving the robustness and security of the implementation. Key suggestions include executing model-generated code in a separate process with a timeout to mitigate security risks and infinite loops, refining the code extraction regex to handle various markdown formats, and refactoring shared utility functions to a common location to avoid duplication with other benchmarks.
| exec(code, namespace) | ||
| exec(test_code, namespace) |
There was a problem hiding this comment.
Using exec to run model-generated code is a significant security risk and can cause the benchmark to hang indefinitely if the model produces an infinite loop. While this pattern exists in other benchmarks in this repository, it is highly recommended to execute the code in a separate process with a strict timeout (e.g., using the multiprocessing module) to ensure the benchmarker remains robust and responsive.
There was a problem hiding this comment.
The same pattern is used in humaneval.py (and the comment in check_code_passes_tests already calls out that this is a "simplified" evaluation, deferring to the official frameworks for rigorous pass@k). I'd prefer to keep mbpp consistent with humaneval
|
|
||
| def extract_code_from_output(output: str) -> Optional[str]: | ||
| """Extract Python code from model output (markdown block or `def ...:`).""" | ||
| code_block_pattern = r"```(?:python)?\n(.*?)```" |
There was a problem hiding this comment.
The current regex for extracting code blocks is overly restrictive as it requires a newline immediately after the opening backticks. Some models may output code blocks with spaces or without a newline (e.g., ```python def ... ```). Using \s* instead of \n makes the extraction more robust to different model output formats.
| code_block_pattern = r"```(?:python)?\n(.*?)```" | |
| code_block_pattern = r"```(?:python)?\s*(.*?)\s*```" |
| def extract_code_from_output(output: str) -> Optional[str]: | ||
| """Extract Python code from model output (markdown block or `def ...:`).""" | ||
| code_block_pattern = r"```(?:python)?\n(.*?)```" | ||
| match = re.search(code_block_pattern, output, re.DOTALL) | ||
| if match: | ||
| return match.group(1).strip() | ||
| def_pattern = r"(def\s+\w+\([^)]*\):.*?)(?=\n\ndef\s+|\Z)" | ||
| match = re.search(def_pattern, output, re.DOTALL) | ||
| if match: | ||
| return match.group(1).strip() | ||
| return output.strip() if output.strip() else None | ||
|
|
||
|
|
||
| def check_code_passes_tests(code: str, test_code: str) -> bool: | ||
| """Run `code` then `test_code` (which contains assertions) in a fresh namespace. | ||
|
|
||
| Returns True iff no exception is raised. Simplified vs. the official MBPP | ||
| evaluation framework — we just want a pass/fail signal. | ||
| """ | ||
| try: | ||
| namespace: Dict[str, Any] = {} | ||
| exec(code, namespace) | ||
| exec(test_code, namespace) | ||
| return True | ||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
The functions extract_code_from_output and check_code_passes_tests are identical to those in humaneval.py. To improve maintainability and adhere to DRY (Don't Repeat Yourself) principles, these utility functions should be moved to a shared location like benchmarks/benchmarker/utils.py and imported in both benchmarkers.
There was a problem hiding this comment.
Considered this initially had it as a shared code_eval.py module imported by both humaneval and mbpp.py Pulled back to inline the helpers in mbpp.py to keep this PR scoped to MBPP only and avoid touching humaneval.py at all
Add MBPP to enable benchmarking consistent with the DFlash paper.
Usage
Against a running SGLang server on
:30000: