feat: support multiple benchmark framework! by 123liuziming · Pull Request #200 · alibaba/loongsuite-python-agent

123liuziming · 2026-05-26T08:17:09Z

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Test A

Does This PR Require a Core Repo Change?

Yes. - Link to PR:
No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

Followed the style guidelines of this project
Changelogs have been updated
Unit tests have been added
Documentation has been updated

…ard v4 Introduce loongsuite-instrumentation-bfclv4 covering BFCL v4 (bfcl_eval) per the design in llm-dev/bfclv4/execute.md: * ENTRY span around bfcl_eval._llm_response_generation.generate_results, with a narrow swap of that module's ThreadPoolExecutor name to a contextvars-propagating subclass so worker threads inherit the ENTRY trace context. * AGENT span around BaseHandler.inference (kind=AGENT, op=invoke_agent), picking up token usage from the metadata BFCL writes back. * STEP spans created reflectively for every concrete handler discovered via bfcl_eval.constants.model_config.MODEL_CONFIG_MAPPING; each STEP re-invokes the handler's _parse_query_response_* to harvest token counts and latency. * Per-call TOOL spans emitted from bfcl_eval.eval_checker.multi_turn_eval.multi_turn_utils.execute_multi_turn_func_call (one span per func_call entry in the batch). * Provider override mapping that routes OSSMODEL handlers to vllm/sglang based on args.backend, plus contextvars-based bfcl.turn_idx / gen_ai.react.round tracking. LLM spans are intentionally not created by this plugin; they continue to be produced by the downstream vendor SDK probes (OpenAI / Anthropic / DashScope / etc.). (cherry picked from commit cccf54b) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

(cherry picked from commit 3d08e03) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

… responses When ``_query_FC`` / ``_query_prompting`` returns a streaming wrapper (e.g. ``openai-v2`` ``ChatStreamWrapper``), the LLM span and its OTel context attach are kept alive until the stream is consumed by BFCL's ``_parse_query_response_*`` after the STEP context manager has already exited. Non-LIFO context detach then leaves the prior LLM span as the "current" span, which causes subsequent STEP and TOOL spans to be parented under the previous STEP rather than under AGENT. Force-consume the streaming response inside the STEP context and replace it with a plain iterator over the cached chunks so that ``stop_llm`` (which detaches LLM context) runs in LIFO order before STEP detaches. (cherry picked from commit 5cbd049) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

Change-Id: I84e87248e0eec61fa8f7fa68dbe85e5181ddede8 (cherry picked from commit 2071e80) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

Change-Id: I71842eb28f7a3c8d5c0fb0e9e2caec31e69d19f0 (cherry picked from commit 9abf7a1) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

Change-Id: Ieea04708467272866f5b7d9b905a2a648e6adb2d (cherry picked from commit 80e202c) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

Change-Id: I0da98161cbdbe6a51b963bcc19f45a3d2d977968 (cherry picked from commit b7e7a4b) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

This commit introduces support for a new tool, enhancing the existing functionality and maintaining compatibility with previous integrations. Change-Id: I674acb157591b4bee6f951defbbc8a57135ce036 Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

This commit further refines the integration of the new tool, ensuring seamless functionality with existing systems and addressing compatibility issues identified in previous versions. Change-Id: I1234567890abcdef1234567890abcdef12345678 Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

Change-Id: I591e9e1b67fa5f3f9cd0d03270335160502d95f4

…wrapper This commit refines the instrumentation for the claw_eval framework by updating the span hierarchy and removing the UserAgentWrapper. The changes enhance clarity in the trace structure, ensuring that only relevant spans are produced during execution. Additionally, the handling of STEP spans around provider chat calls has been improved for better trace management. Change-Id: Iabcdef1234567890abcdef1234567890abcdef12

This commit introduces new attributes for capturing input and output messages in the GenAI context within the wildtool instrumentation. The changes include the addition of helper functions to serialize messages and updates to existing wrappers to utilize these new attributes. Tests have been added to ensure that the AGENT and ENTRY spans correctly capture and report these messages when content capture is enabled. Change-Id: Iabcdef1234567890abcdef1234567890abcdef12

This commit adds a new helper function, `_semconv_value`, to streamline the extraction of enum values within the instrumentation. The function is utilized in various span attribute settings to ensure consistent handling of enum values across the GenAI context. This enhancement improves code clarity and maintainability. Change-Id: Iabcdef1234567890abcdef1234567890abcdef12

Change-Id: Id33add56b2f784f4c46858f3b46134fd0076df9b

This commit introduces new functions for extracting tool definitions from test entries and improves the handling of output messages within the GenAI context. The changes include the addition of a context variable for accumulating output messages and updates to existing functions to ensure proper serialization of tool definitions. Tests have been added to validate the extraction of tool definitions and the correct behavior of the new functionality. Change-Id: Iabcdef1234567890abcdef1234567890abcdef12

This commit refactors the message handling logic within the GenAI context by removing unused classes and functions related to tool definitions and output accumulation. New helper functions for JSON serialization and message structuring have been introduced to enhance clarity and maintainability. The changes streamline the codebase and improve the overall efficiency of message processing. Change-Id: Iabcdef1234567890abcdef1234567890abcdef12

This commit introduces new functions for processing tool definitions and extracting messages from test entries within the GenAI context. Key additions include the implementation of `_test_entry_to_tool_definitions`, `_tool_description_map`, and `_parse_python_call_arguments`, which improve the handling of tool definitions and enhance the clarity of message extraction. Tests have been updated to validate these new functionalities, ensuring robust behavior in various scenarios. Change-Id: Iabcdef1234567890abcdef1234567890abcdef12

This commit introduces new attributes for tool spans, including `gen_ai.tool.call.id`, `gen_ai.tool.name`, and `gen_ai.tool.type`, to better capture tool-related metadata. Additionally, it refines the process of retrieving tool descriptions from executable classes, ensuring that relevant method docstrings are included. The changes improve the clarity and completeness of tool span data, while also updating the handling of tool call arguments and results. Tests have been adjusted to validate these enhancements. Change-Id: Iabcdef1234567890abcdef1234567890abcdef12

…atures This commit introduces several enhancements to the GenAI instrumentation, including the addition of a new function to enable message content capture based on environment variables. It refines the handling of input and output messages, ensuring that relevant attributes are recorded in spans. The changes also improve the integration of tool definitions and system instructions within the GenAI context, aligning with ARMS semantic conventions. Tests have been updated to validate these new functionalities and ensure robust behavior. Change-Id: Iabcdef1234567890abcdef1234567890abcdef12

…dling This commit introduces several new wrappers for enhanced span management within the GenAI context, including `_RunnerEntryWrapper`, `_MiniSWEObservationWrapper`, and updates to existing wrappers for better token and message handling. It refines the process of capturing input and output messages, ensuring that relevant attributes are recorded in spans. Additionally, the changes improve the integration of task and step spans, aligning with ARMS semantic conventions. Tests have been updated to validate these new functionalities and ensure robust behavior. Change-Id: Iabcdef1234567890abcdef1234567890abcdef12

Change-Id: Ia3e1ef993ef4a8578ffb15f627a7ea4967054aa2

_run_agent_loop returns None, so the AGENT span previously surfaced input only. Capture the chat's last assistant message plus react rounds and pending_completion state as the AGENT output, so the span has the expected output payload. Change-Id: I58a3cc8b5308ba41eed8143b44f5bd9f0a6feb59 Co-developed-by: Claude <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The instrumentor previously imported aliyun.semconv.trace_v2 and aliyun.sdk.extension.arms.self_monitor.self_monitor_decorator.hook_advice, neither of which is published to PyPI — non-ARMS deployments had to ship a no-op shim. Inline the gen-ai attribute keys / enum string values (mirroring the pattern used in loongsuite-instrumentation-claw-eval) and drop the self-monitoring decorator (it only collected timing metrics; functional behaviour is unchanged). Change-Id: I79da592374814632b463448929f344165566a4f1 Co-developed-by: Claude <noreply@anthropic.com>

CLAassistant · 2026-05-26T08:17:18Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ 123liuziming
❌ musi

musi seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Copilot

Pull request overview

This PR expands the LoongSuite/OpenTelemetry instrumentation surface to cover additional benchmark frameworks (e.g., WildToolBench, WideSearch, WebArena, VitaBench, slop-code-bench, OpenHands V0, BFCL v4, mini-swe-agent, claw-eval, AlgoTune), and updates the shared GenAI util types to carry more framework-level metadata.

Changes:

Extend EntryInvocation to support system_instruction and tool_definitions.
Add multiple new instrumentation packages (each with packaging metadata, utilities, and initial tests/docs) for additional benchmark frameworks.
Add/adjust framework-specific utilities, wrappers, and test scaffolding across the new instrumentation packages.

Reviewed changes

Copilot reviewed 128 out of 139 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
util/opentelemetry-util-genai/src/opentelemetry/util/genai/extended_types.py	Extends shared GenAI invocation model (adds system instruction + tool definitions to ENTRY).
packages.txt	Adds a pinned environment/package snapshot file.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/test_instrumentor.py	Adds WildToolBench instrumentor lifecycle tests.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/test_error_scenarios.py	Adds WildToolBench error/edge-case tests.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/test_entry_span.py	Adds WildToolBench ENTRY span tests.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/test_agent_span.py	Adds WildToolBench AGENT span tests.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/conftest.py	Adds WildToolBench test fixtures/exporters and env setup.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/init.py	Initializes WildToolBench test package.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/src/opentelemetry/instrumentation/wildtool/version.py	Introduces WildToolBench instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/src/opentelemetry/instrumentation/wildtool/utils.py	Adds small WildToolBench helper utilities.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/src/opentelemetry/instrumentation/wildtool/package.py	Declares WildToolBench instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/src/opentelemetry/instrumentation/wildtool/init.py	Implements WildToolBench instrumentor and patch lifecycle.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/README.md	Documents WildToolBench instrumentation usage/topology.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/pyproject.toml	Adds WildToolBench packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/tests/init.py	Initializes WideSearch test package.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/src/opentelemetry/instrumentation/widesearch/version.py	Introduces WideSearch instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/src/opentelemetry/instrumentation/widesearch/utils.py	Adds WideSearch conversion/extraction helpers (invocations/messages/tools).
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/src/opentelemetry/instrumentation/widesearch/package.py	Declares WideSearch instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/src/opentelemetry/instrumentation/widesearch/init.py	Implements WideSearch instrumentor and wrap points.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/README.md	Documents WideSearch instrumentation usage.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/pyproject.toml	Adds WideSearch packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/version.py	Introduces WebArena instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/package.py	Declares WebArena instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/internal/_state.py	Adds WebArena cross-wrapper state management via ContextVars.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/internal/_attrs.py	Adds WebArena attribute constants + truncation/serialization helpers.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/internal/init.py	Initializes WebArena internal package.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/config.py	Adds WebArena env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/pyproject.toml	Adds WebArena packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-vita/tests/conftest.py	Adds VitaBench test fixtures/exporters and env setup.
instrumentation-loongsuite/loongsuite-instrumentation-vita/tests/init.py	Initializes VitaBench test package.
instrumentation-loongsuite/loongsuite-instrumentation-vita/src/opentelemetry/instrumentation/vita/version.py	Introduces VitaBench instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-vita/src/opentelemetry/instrumentation/vita/utils.py	Adds VitaBench message/tool conversion helpers.
instrumentation-loongsuite/loongsuite-instrumentation-vita/src/opentelemetry/instrumentation/vita/package.py	Declares VitaBench instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-vita/README.md	Documents VitaBench instrumentation usage and DashScope notes.
instrumentation-loongsuite/loongsuite-instrumentation-vita/pyproject.toml	Adds VitaBench packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-vita/examples/vitabench-dashscope/setup.sh	Adds VitaBench DashScope example setup script.
instrumentation-loongsuite/loongsuite-instrumentation-vita/examples/vitabench-dashscope/README.md	Adds VitaBench DashScope example instructions.
instrumentation-loongsuite/loongsuite-instrumentation-vita/examples/vitabench-dashscope/cmd.sh	Adds VitaBench DashScope example run script.
instrumentation-loongsuite/loongsuite-instrumentation-vita/examples/init.py	Initializes VitaBench examples package.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/tests/init.py	Initializes Terminus2 test package.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/test-requirements.txt	Adds Terminus2 test requirements list.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/src/opentelemetry/instrumentation/terminus2/version.py	Introduces Terminus2 instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/src/opentelemetry/instrumentation/terminus2/package.py	Declares Terminus2 instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/pyproject.toml	Adds Terminus2 packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_workflow_span.py	Adds slop-code workflow/CHAIN tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_task_span.py	Adds slop-code TASK span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_step_span.py	Adds slop-code STEP span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_llm_span.py	Adds slop-code LLM span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_hierarchy.py	Adds slop-code parent/child hierarchy tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_entry_span.py	Adds slop-code ENTRY span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_agent_span.py	Adds slop-code AGENT span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/init.py	Initializes slop-code test package.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/test-requirements.txt	Adds slop-code test requirements list.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/workflow.py	Adds slop-code CHAIN/workflow wrapper implementation.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/tool.py	Adds slop-code TOOL wrapper implementation.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/task.py	Adds slop-code ENTRY+TASK wrapper implementation for checkpoints.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/step.py	Adds slop-code STEP wrapper implementation for ReAct rounds.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/llm.py	Adds slop-code LLM wrapper implementation for rubric judge calls.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/entry.py	Adds slop-code ENTRY wrapper implementations.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/agent.py	Adds slop-code AGENT wrapper implementation.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/init.py	Initializes slop-code wrappers package.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/version.py	Introduces slop-code instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/utils.py	Adds slop-code helper utilities (safe getters, truncation, message schema).
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/package.py	Declares slop-code instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/README.md	Documents slop-code instrumentation span tree and usage.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/pyproject.toml	Adds slop-code packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/tests/test_v0_wrappers.py	Adds OpenHands V0 wrapper behavior tests.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/tests/init.py	Initializes OpenHands test package.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/test-requirements.txt	Adds OpenHands test requirements list.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/version.py	Introduces OpenHands instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/package.py	Declares OpenHands instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/internal/utils.py	Adds OpenHands serialization helpers for semconv I/O/message capture.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/internal/session_context.py	Adds OpenHands cross-thread context bridge and tool registry.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/internal/constants.py	Adds OpenHands constant attribute keys/framework identity.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/internal/init.py	Initializes OpenHands internal package.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/config.py	Adds OpenHands env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/README.rst	Documents OpenHands V0 instrumentation behavior and topology.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/pyproject.toml	Adds OpenHands packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/version.py	Introduces mini-swe-agent instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/package.py	Declares mini-swe-agent instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/internal/delegates.py	Adds mini-swe-agent TOOL span delegate (environment execute).
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/internal/cli_wrappers.py	Adds mini-swe-agent CLI ENTRY wrapper via Typer app proxy.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/internal/agent_wrappers.py	Adds mini-swe-agent AGENT/STEP wrappers for DefaultAgent.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/internal/init.py	Initializes mini-swe-agent internal package.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/config.py	Adds mini-swe-agent env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/init.py	Implements mini-swe-agent instrumentor and patch lifecycle.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/pyproject.toml	Adds mini-swe-agent packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/src/opentelemetry/instrumentation/claw_eval/version.py	Introduces claw-eval instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/src/opentelemetry/instrumentation/claw_eval/package.py	Declares claw-eval instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/src/opentelemetry/instrumentation/claw_eval/internal/init.py	Initializes claw-eval internal package.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/src/opentelemetry/instrumentation/claw_eval/config.py	Adds claw-eval env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/pyproject.toml	Adds claw-eval packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/tests/test_instrumentor.py	Adds BFCL v4 smoke tests (graceful instrument/uninstrument).
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/tests/init.py	Initializes BFCL v4 test package.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/version.py	Introduces BFCL v4 instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/utils.py	Adds BFCL v4 content-capture helper utilities.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/package.py	Declares BFCL v4 instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/threading_propagation.py	Adds BFCL v4 context-propagating ThreadPoolExecutor.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/state.py	Adds BFCL v4 per-thread ReAct state via contextvars.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/provider.py	Adds BFCL v4 provider inference/mapping logic.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/attributes.py	Adds BFCL v4 attribute key constants.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/init.py	Initializes BFCL v4 internal package.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/README.md	Documents BFCL v4 instrumentation usage/topology.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/pyproject.toml	Adds BFCL v4 packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/CHANGELOG.md	Adds BFCL v4 changelog for initial release.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/version.py	Introduces AlgoTune instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/package.py	Declares AlgoTune instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/internal/utils.py	Adds AlgoTune shared utilities (truncation, provider inference, STEP cleanup).
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/internal/init.py	Initializes AlgoTune internal package.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/config.py	Adds AlgoTune env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/pyproject.toml	Adds AlgoTune packaging metadata and deps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+aiohappyeyeballs==2.6.1
+aiohttp==3.10.2
+aiosignal==1.3.1
+aliyun-instrumentation-sglang @ file:///Users/liuziming/Desktop/loongsuite-python-agent/instrumentation/aliyun-instrumentation-sglang
+aliyun-instrumentation-vllm @ file:///Users/liuziming/Desktop/loongsuite-python-agent/instrumentation/aliyun-instrumentation-vllm
+-e git+https://github.com/alibaba/loongsuite-python-agent.git@fe5b8bf1938dcd449dfa335234b58af81b00bc98#egg=aliyun_sdk_extension_arms&subdirectory=sdk-extension/aliyun-sdk-extension-arms
+aliyun-semantic-conventions==1.2.0
+annotated-types==0.7.0
+anyio==4.10.0
+asgiref==3.8.1


+        problem_name = args[1] if len(args) > 1 else kwargs.get("problem_name", "unknown")
+        config = args[2] if len(args) > 2 else kwargs.get("config")
+
+        span_name = f"chain {problem_name}"
+
+        attrs = {


+        entry_attrs = {
+            gen_ai_attributes.GEN_AI_OPERATION_NAME: "enter",
+            gen_ai_attributes.GEN_AI_SYSTEM: SYSTEM_NAME,
+            gen_ai_extended_attributes.GEN_AI_SPAN_KIND: "ENTRY",
+            "gen_ai.framework": SYSTEM_NAME,
+            "gen_ai.session.id": str(problem_name),
+        }
+        task_attrs = {
+            gen_ai_attributes.GEN_AI_OPERATION_NAME: "run_task",
+            gen_ai_attributes.GEN_AI_SYSTEM: SYSTEM_NAME,
+            gen_ai_extended_attributes.GEN_AI_SPAN_KIND: "TASK",
+            "gen_ai.framework": SYSTEM_NAME,
+            "input.value": str(checkpoint_name),
+            "input.mime_type": "text/plain",
+            "slop_code.checkpoint.name": str(checkpoint_name),
+            "slop_code.is_first_checkpoint": bool(is_first_checkpoint),
+        }
+        if checkpoint_order is not None:
+            task_attrs["slop_code.checkpoint.order"] = checkpoint_order
+
+        with self._tracer.start_as_current_span(
+            name="enter_ai_application_system",
+            kind=SpanKind.INTERNAL,
+            attributes=entry_attrs,
+        ) as entry_span:
+            with self._tracer.start_as_current_span(
+                name=f"run_task {checkpoint_name}",
+                kind=SpanKind.INTERNAL,
+                attributes=task_attrs,
+            ) as task_span:


+    def __call__(self, wrapped, instance, args, kwargs):
+        with self._tracer.start_as_current_span(
+            name="enter_ai_application_system",
+            kind=SpanKind.INTERNAL,
+            attributes={
+                gen_ai_attributes.GEN_AI_OPERATION_NAME: "enter",
+                gen_ai_attributes.GEN_AI_SYSTEM: SYSTEM_NAME,
+                gen_ai_extended_attributes.GEN_AI_SPAN_KIND: gen_ai_extended_attributes.GenAiSpanKindValues.ENTRY.value,
+                "gen_ai.framework": SYSTEM_NAME,
+            },
+        ) as span:


+        with self._tracer.start_as_current_span(
+            name=f"invoke_agent {agent_name}",
+            kind=SpanKind.INTERNAL,
+            attributes=attrs,
+        ) as span:


+    def __call__(self, wrapped, instance, args, kwargs):
+        usage = safe_get(instance, "usage")
+        current_steps = safe_get(usage, "steps", 0) if usage else 0
+        step_num = current_steps + 1
+
+        messages = safe_get(instance, "_messages", [])
+        attrs = {
+            gen_ai_attributes.GEN_AI_OPERATION_NAME: "react",
+            gen_ai_attributes.GEN_AI_SYSTEM: SYSTEM_NAME,
+            gen_ai_extended_attributes.GEN_AI_SPAN_KIND: gen_ai_extended_attributes.GenAiSpanKindValues.STEP.value,
+            gen_ai_extended_attributes.GEN_AI_REACT_ROUND: step_num,
+            "gen_ai.framework": SYSTEM_NAME,
+        }
+        if messages:
+            attrs["gen_ai.input.messages"] = genai_messages(messages)
+
+        span = self._tracer.start_span("react step", kind=SpanKind.INTERNAL, attributes=attrs)
+        token = context_api.attach(trace_api.set_span_in_context(span))
+        setattr(instance, _STEP_SPAN_ATTR, span)
+        setattr(instance, _STEP_TOKEN_ATTR, token)
+


+        if handler_cls in self._patched_handler_classes:
+            return
+        self._patched_handler_classes.add(handler_cls)
+


The AGENT/ENTRY spans previously JSON-stringified BFCL's nested ``[[{...}],[{...}]]`` question/result structure into a single message content, producing the surprising "content has a serialised array inside it" pattern. Now flattens the structure one level so each role/content pair becomes its own ``{role, parts:[{type,content}]}`` message on both ``gen_ai.input.messages`` and ``gen_ai.output.messages``. Also surfaces BFCL-captured error strings (``Error during inference:``, ``Error during execution:``) and unhandled wrapped exceptions via ``span.record_exception`` so spans marked ERROR carry a visible exception event with the error message instead of just a status code. Change-Id: I372e87b683f907431889ac4d306bf6c235ec36ac Co-developed-by: Claude <noreply@anthropic.com>

…ments DashScope's OpenAI-compatible streaming response can emit tool-call argument deltas with `arguments=None`, which made `"".join(tool_call.arguments)` raise `TypeError: sequence item N: expected str instance, NoneType found` during span finalization and aborted every bfclv4 benchmark run. Filter out None parts at both legacy and current stream-wrapper join sites. Change-Id: I76b8e0104dacac1a1ecebd41be74283700d46f2c Co-developed-by: Claude <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…hecker multi_turn_checker re-exports execute_multi_turn_func_call and invokes it twice per entry during evaluation: once to replay the model's tool-call trace, once to replay the ground-truth trace. Both calls run *after* inference, outside any ENTRY/AGENT/STEP context, so each one produced a trace-rooted orphan TOOL span. Drop checker from the wrap targets; the two inference-side bindings (multi_turn_utils source module + base_handler re-export) still cover every TOOL span we actually want. Change-Id: Ife24d8ba2595fc2c10a0dcfc47de5521ad67d3c7 Co-developed-by: Claude <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wrapping multi_turn_utils.execute_multi_turn_func_call replaces the source module attribute, which is fine at instrument time but causes orphan TOOL spans later: bfcl_eval/__main__.py lazily imports multi_turn_checker.py during `bfcl evaluate`, and that import resolves the wrapper into checker's local binding. Ground-truth replay then emits TOOL spans outside any ENTRY/AGENT/STEP context. Wrap only base_handler's local binding, which is set during BaseHandler.inference wrap in step 2 and is the sole inference-time caller of execute_multi_turn_func_call. Change-Id: I9145484b1fa9b8bf9cc3899b6a551aec62b856ac Co-developed-by: Claude <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

musi and others added 23 commits May 7, 2026 01:41

feat: support bfclv4

7b144aa

(cherry picked from commit 3d08e03) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

feat: support widesearch

74ab41d

Change-Id: I84e87248e0eec61fa8f7fa68dbe85e5181ddede8 (cherry picked from commit 2071e80) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

feat: support vita

eab9439

Change-Id: I71842eb28f7a3c8d5c0fb0e9e2caec31e69d19f0 (cherry picked from commit 9abf7a1) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

feat: support slop code

4f2b049

Change-Id: Ieea04708467272866f5b7d9b905a2a648e6adb2d (cherry picked from commit 80e202c) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

feat: support wild-tool

de59631

Change-Id: I0da98161cbdbe6a51b963bcc19f45a3d2d977968 (cherry picked from commit b7e7a4b) Co-authored-by: 123liuziming <32130965+123liuziming@users.noreply.github.com>

feat: support mini-swe agent

6adc21b

Change-Id: I591e9e1b67fa5f3f9cd0d03270335160502d95f4

fix: fix no input/output in widesearch

2dd09c9

Change-Id: Id33add56b2f784f4c46858f3b46134fd0076df9b

feat: remove useless token usage

635370e

Change-Id: Ia3e1ef993ef4a8578ffb15f627a7ea4967054aa2

Copilot AI review requested due to automatic review settings May 26, 2026 08:17

Copilot started reviewing on behalf of 123liuziming May 26, 2026 08:17 View session

github-actions Bot assigned 123liuziming, Cirilla-zmh and ralf0131 May 26, 2026

Copilot AI reviewed May 26, 2026

View reviewed changes

123liuziming and others added 2 commits May 26, 2026 16:36

Copilot AI review requested due to automatic review settings May 26, 2026 09:44

Copilot started reviewing on behalf of 123liuziming May 26, 2026 09:45 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

123liuziming and others added 2 commits May 26, 2026 18:36

Copilot AI review requested due to automatic review settings May 26, 2026 11:50

Copilot started reviewing on behalf of 123liuziming May 26, 2026 12:16 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support multiple benchmark framework!#200

feat: support multiple benchmark framework!#200
123liuziming wants to merge 27 commits into
mainfrom
feat/bench

123liuziming commented May 26, 2026

Uh oh!

CLAassistant commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

123liuziming commented May 26, 2026

Description

Type of change

How Has This Been Tested?

Does This PR Require a Core Repo Change?

Checklist:

Uh oh!

CLAassistant commented May 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants