Skip to content

feat | instrumentation engine, attestation, and adapter integrations#88

Open
garrettallen14 wants to merge 38 commits into
mainfrom
development
Open

feat | instrumentation engine, attestation, and adapter integrations#88
garrettallen14 wants to merge 38 commits into
mainfrom
development

Conversation

@garrettallen14
Copy link
Copy Markdown
Contributor

No description provided.

garrettallen14 and others added 17 commits March 24, 2026 17:43
…ing (#83)

* feat: context propagation and upload circuit breaker

* feat: updates + new adapters

* feat: unify context model, per-client uploads, and adapter hardening

* fix: update crewai
* feat: context propagation and upload circuit breaker

* feat: updates + new adapters

* feat: unify context model, per-client uploads, and adapter hardening

* fix: update crewai

* feat: new adapters
feat | agentforce, agno, autogen, bedrock adapters
m-peko added 12 commits May 18, 2026 15:06
Brings in the SDK samples overhaul (70+ samples), the auto-release
workflow, CHANGELOG, the custom-model update/delete API, and the
restored test files we'd lost on this branch (test_samples.py,
test_samples_e2e.py, test_mcp_server.py, etc.).

pyproject.toml conflicted in the tests/** per-file-ignores section -- I
kept main's broader set (T201, T203, ARG, B007) so the restored
test_samples_e2e.py keeps linting, and dropped my orphan
examples/instrument_{openai,langchain}.py files in favour of main's
renamed samples/integrations/{openai,langchain}_instrumented.py.
tests/conftest.py auto-merged main's "live" pytest marker.
Mirrors the _FRAMEWORK_PACKAGES pattern from ateam without dragging in
the singleton machinery. discover_installed() uses importlib.util
find_spec so detection is cheap and has no import side effects;
auto(client) instantiates and connects whichever frameworks are
importable in the current env.

Providers stay explicit since they need the user's SDK client, which we
don't have at auto() time. Same for agentforce and langfuse, which
need credentials at connect.

Both helpers are re-exported from layerlens.instrument. Drift-guard
tests pin the three lookup tables to stay consistent.
After every node exit we hash the output state and emit the digest as
agent.state.change, so the dashboard can diff state across nodes
without needing the raw payloads. Uses the same compute_hash as the
attestation chain so the format matches.

Constructor knobs:
- emit_state_hash=False to turn it off entirely
- state_include_keys / state_exclude_keys to scope the hash to a
  subset of the state dict

Non-serialisable state falls back to a repr-based hash so we still
emit something stable. agent.state.change is in _ALWAYS_ENABLED, so no
layer gating needed.
Whenever the active langgraph_node transitions between distinct named
agents we emit agent.handoff. That puts LangGraph in line with the
OpenAI Agents and Google ADK adapters, which already detect handoffs
natively.

HandoffDetector is intentionally framework-agnostic -- I'll reuse it
for the CrewAI delegation work next. Same-node revisits and the first
node observed don't emit, so the noise stays low.

Context gets scrubbed through the same allow-list ateam uses (task,
messages, objective, etc.) with long strings truncated and long lists
collapsed to placeholders, then hashed so dashboards can correlate
handoffs without seeing the raw state.
Two related pieces.

inject_headers / extract_headers let user code stitch our traces into a
wider distributed-tracing system. If OpenTelemetry is installed we
delegate to its propagator; otherwise we build traceparent by hand
from the active TraceCollector and current span. Our 16-hex trace ids
get zero-padded to 32 hex on the wire and shortened back on extract.

gen_ai_attributes() returns a dict of OTel GenAI semconv attributes
(gen_ai.system, gen_ai.operation.name, request params, response model
/ id / finish_reasons, usage tokens). The provider emit helper now
embeds this under otel_gen_ai on every model.invoke, so OTel-aware
tooling can read the standard names without having to re-map our
internal field names.
Hierarchical crews delegate work through the built-in "Delegate work
to coworker" and "Ask question to coworker" tools, but older crewai
versions don't fire AgentDelegationStartedEvent for them. That left
the handoff invisible in our traces.

The tool-call path now matches those tool names case-insensitively and
synthesises agent.handoff with from_agent (current agent role),
to_agent (coworker arg), tool_name, a sequence number, and a sha256
hash over the scrubbed task+context. The typed-event handler bumps
the same sequence so newer crewai versions emit identical payloads.

tool_args is parsed robustly -- crewai sometimes passes it as a dict,
sometimes as a JSON string. Context scrubbing reuses _handoff.scrub_context
for parity with the LangGraph handoff format.
Wraps semantic-kernel's AgentChat / AgentGroupChat invoke (an async
generator that yields ChatMessageContent) and processes each yielded
message for:

- tool calls / results from FunctionCall items
- model.invoke + cost.record derived from message.metadata
- agent.handoff on agent_name turn transitions, via the shared
  HandoffDetector

A one-shot environment.config event fires per chat instance on its
first invocation, capturing the chat type, agents, plugins, and
selection / termination strategy class names.

Provider detection covers the usual suspects (gpt/o1/o3 -> openai,
claude -> anthropic, gemini -> google, etc.) and falls back to
azure_openai, since that's what MS Agent Framework fronts most of the
time.

Registered in the auto-detection tables, so layerlens.instrument.auto()
picks it up when semantic-kernel is installed. Coexists fine with the
existing SemanticKernelAdapter -- they instrument different surfaces
(filters vs AgentChat wrapping).
EmbeddingAdapter wraps OpenAI / Cohere / sentence-transformers and
emits embedding.create with provider, model, batch size, vector
dimensions, token usage, and latency. Pass-through when no collector
is active so it adds no overhead outside a trace.

VectorStoreAdapter does the same for Pinecone, Chroma, and Weaviate
(near_vector / near_text). retrieval.query events carry the query
shape, result count, and a min/max/mean over scores or distances.

BenchmarkImporter lives under layerlens.benchmarks rather than the
adapters tree -- it's a data-conversion utility, not an instrumentation
tracer (ateam's own docstring flagged the naming inconsistency in
their version). Reads HuggingFace Datasets, HELM result JSON, CSV,
JSON arrays, and JSONL. Optional schema_mapping renames source fields
to layerlens canonical names.
TracedMemory is a transparent proxy around any LangChain memory object.
save_context and clear are intercepted; before the call we hash the
memory's loaded variables, after the call we hash again, and if the
hash changes we emit agent.state.change. Everything else passes
through.

For workflows where save_context happens outside our control (e.g.
inside a third-party agent), MemoryMutationTracker is a context manager
that frames a logical operation and emits one event per logical
operation rather than one per save_context call.

Hashing uses the same compute_hash as the attestation chain, so
before/after digests are comparable across the LangChain and LangGraph
adapters. Non-serialisable memory contents fall back to a repr-based
hash so we still get a stable identifier.

Exported as wrap_memory / TracedMemory / MemoryMutationTracker from
layerlens.instrument.adapters.frameworks.langchain.
The existing layerlens.replay subpackage already drives full replay via
ReplayController; this fills in the missing persistence piece so a
TraceCollector can round-trip through disk.

TraceCollector.to_replay_dict() returns the same payload that flush()
uploads (trace_id, events, capture_config, attestation), but without
sealing the hash chain -- the collector stays usable for further
emits. _build_trace_payload now takes a seal flag; flush() still seals,
to_replay_dict doesn't.

New layerlens.replay.snapshot module:
- dump / dump_collector / load_snapshot for the file-IO side
- replay_events to re-emit captured events into a fresh collector
- serialize_adapter mirrors the per-adapter serialize_for_replay
  pattern from ateam, bundling AdapterInfo + current trace into one
  dict
GA check for protocol adapter classes. Verifies the class extends
BaseProtocolAdapter, sets non-empty PROTOCOL and PROTOCOL_VERSION,
implements connect / disconnect / adapter_info, returns the right
types from adapter_info() (AdapterInfo with adapter_type="protocol")
and probe_health() (ProtocolHealth), and that negotiate_version picks
an exact match when offered.

Result types are JSON-serialisable dataclasses; failures are
partitioned by severity, so "couldn't instantiate the class to check
runtime shape" surfaces as a warning while contract violations are
errors.

Runs against the three shipped shim adapters (a2ui, ap2, ucp) in a
parametrised test, so a regression in any of them surfaces on the next
run.

Also has to defensively ensure an asyncio event loop exists before
instantiation -- BaseProtocolAdapter creates an asyncio.Semaphore in
__init__ and the suite would otherwise break when run after
asyncio-heavy tests that closed their loop.
Two coupled changes that have to land together because either alone
breaks rye run lint.

~265 files reformatted by ruff-format. These accumulated as the
pre-commit hook's format pass touched the broader repo after main's
70+ new samples and the restored test files came in. Format-only --
no semantic edits.

The formatter also wrapped three suppressions onto the wrong line,
which I had to put back:

- probe_health's # noqa: ARG002 ended up on the return-type line
  instead of the line declaring the unused `endpoint` arg.
- langchain_core's BaseCallbackHandler import got wrapped onto
  multiple lines, leaving the # pyright: ignore on the closing paren
  where pyright doesn't honour it. Pinned to a one-liner inside
  # fmt: off/on.
- ProtocolCertificationSuite._safe_instantiate took a parameter named
  `cls`, which pyright reserves for classmethods. Renamed to
  target_cls.

Also added .venv* to .gitignore so locally-created Python alt envs
don't show up in status.
Newer crewai inspects each handler's parameter count and passes a
third `state` positional when there are 3 params. We were using
`def _handler(source, event, _m=method)` — i.e. a default-arg closure
to capture the bound method — which crewai then clobbered by passing
state as the third arg, leaving _m = state (not callable) and the
handler raising 'NoneType' object is not callable.

Switched to a factory closure (`_make_handler(target)`) so the
visible signature is exactly (source, event) — crewai takes the
2-arg path and the bound method is captured in the closure properly.

Surfaced by the new tests/e2e CrewAI delegation tests under Python
3.11 with crewai 1.14, where the real event bus dispatches handlers
through a ThreadPoolExecutor. The existing unit tests didn't catch
it because they invoke the adapter's _on_* methods directly rather
than going through the event bus.
@m-peko m-peko changed the title feat | instrumentation engine, attestation, and 16 adapter integrations feat | instrumentation engine, attestation, and adapter integrations May 19, 2026
Copy link
Copy Markdown
Contributor

@stepdi stepdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Substantial implementation across instrumentation, attestation, replay, and adapter integrations (~43k LOC, 30 commits). Cross-cutting code quality is unusually clean — zero hallucinated imports across 359 from layerlens.* references (verified — every target module exists in the repo), zero TODO/FIXME/HACK/XXX markers in src/, 1.19× test-to-source LOC ratio (33,433 / 28,174), all framework/provider/protocol extras properly tiered in pyproject.toml. The _base_provider.py / _base_framework.py hierarchies are well-designed and reduce per-adapter duplication.

Before merge I'd want six items addressed below — three blockers, three scope/naming decisions.


Blockers

1. CI red on Python 3.10 / 3.11 / 3.12 (3.9 passes)

Four distinct failures, all targeted:

  • test_openai_agents.py — 17+ failures with TypeError: SpanImpl.__init__() got an unexpected keyword argument 'tracing_api_key'. Upstream openai-agents SDK API change. Pin a compatible version or update the test fixture.
  • test_e2e_crewai_delegation.py::test_chain_of_delegations_keeps_sequenceassert len(handoffs) == 3 → 2. Off-by-one in chain-delegation detection.
  • SemanticKernel plugin detectionassert "MathPlugin" in plugin_names → 'MathPlugin' in set(). Likely version-pin sensitive.
  • Test isolationassert _current_collector.get() is None → <TraceCollector object ...>. ContextVar not cleaned between tests.

2. ms_agent_framework.py imports semantic_kernel, not agent-framework

  • ms_agent_framework.py:33 does import semantic_kernel.
  • :53 sets package = "semantic-kernel".
  • Class docstring (:41): "Microsoft Agent Framework (semantic-kernel agents)."
  • _registry.py:35-38 acknowledges: "MS Agent Framework ships as part of semantic-kernel; we share the detection key. Both adapters can coexist — they instrument different surface areas (filters vs AgentChat wrapping)."

Two different PyPI packages exist:

  • semantic-kernel — already instrumented separately by semantic_kernel.py.
  • agent-framework — v1.4.0 on PyPI ("Microsoft Agent Framework for building AI Agents with Python"). Not currently instrumented.

Two options: rename module/class to semantic_kernel_agents to honestly describe what it instruments, or replace semantic_kernel imports with agent_framework.

3. Bedrock streaming + tool-call extraction incomplete

bedrock.py declares streaming support by wrapping methods (:63-70) but emits a placeholder event extra={"streaming": True, "method": method} (:187) and never aggregates chunks. The module docstring (lines 7-9) is explicit about the StreamingBody single-read constraint, but the practical effect: customers running Bedrock streaming get traces with no content, no usage, no cost.

Separately, _extract_invoke_output (:241-263) and _extract_converse_output (:284-292) both filter to text-only blocks ("text" in block), dropping tool_use blocks entirely. Direct Anthropic adapter handles tool_use (anthropic.py:48, 51, 112, 116, 285, 312); the parsing could be lifted for Bedrock-Anthropic invoke and Converse toolUse content blocks.

Also: bedrock.py:41 inherits from BaseAdapter (not MonkeyPatchProvider like other providers), re-implements emission, and reaches a private helper via from ._emit_helpers import _emit_cost # type: ignore[attr-defined] (bedrock.py:26; helper at _emit_helpers.py:164). Either promote _emit_cost to public or refactor to inherit from the base.


Concerns

4. A2UI and UCP shipped without upstream protocol references

a2ui.py (110 LOC) and ucp.py (163 LOC) sit alongside A2A, MCP, AG-UI, and AP2 protocol adapters but differ:

  • No pyproject.toml extras for either (vs a2a-sdk, mcp, etc.).
  • Zero upstream imports in either file.
  • No spec URL in module docstrings.
  • a2ui — no PyPI package by that name.
  • ucp — a ucp package exists on PyPI but it's an unrelated SMS protocol wrapper ("Python EMI UCP protocol wrapper"). A third-party universal-commerce-protocol v0.0.1 also exists from upsonic/universal-commerce-protocol on GitHub, but LayerLens's ucp.py doesn't import or interop with it.

a2ui.py defines commerce.ui.* event vocabulary with method names on_surface_created, on_user_action (lines 36-37). ucp.py defines discover_suppliers, browse_catalog, start_checkout, complete_checkout, issue_refund (lines 37-41).

Both pass ProtocolCertificationSuite.certify() with the same stamp as A2A and MCP, because the suite is a structural conformance checker (verifies issubclass, PROTOCOL_VERSION non-empty, methods callable) — not a real protocol handshake.

If A2UI/UCP are internal LayerLens proposals: prefix with layerlens_ and document them as internal observability schemas. If they're stubs for protocols that don't exist yet: drop until there's an upstream spec.

5. Haystack adapter wired into auto-detection

haystack.py _on_connect mutates _hs_tracing.tracer.actual_tracer = self._tracer globally; _registry.py:88-90 wires HaystackAdapter into the auto-detection list. If Haystack was meant to be a separate product-surface decision, this commits us to it implicitly. Worth confirming with product before merge.

6. Protocol version strings drift from upstream

Adapter PROTOCOL_VERSION in PR Upstream spec Upstream Python SDK on PyPI
A2A "0.3.0" (a2a/adapter.py:38) v1.0.0 (released 2026-03-12; v0.3.0 was current 2025-07-30) a2a-sdk v1.0.3
MCP "1.0.0" (mcp/adapter.py:43) date-format: LATEST_PROTOCOL_VERSION = "2025-11-25" in upstream python-sdk; SUPPORTED list: "2024-11-05", "2025-03-26", "2025-06-18", "2025-11-25" mcp v1.27.1
AG-UI "0.1.0" (agui/adapter.py:27) (didn't verify spec version directly) ag-ui-protocol v0.1.18
AP2 "0.1.0" (ap2.py:41) v0.2.0 (released 2026-04-28) — full name is Agent Payments Protocol, not "Agent Protocol 2" ap2 v0.1.1

Two issues:

  • MCP "1.0.0" is neither a valid protocol-spec version (which is date-formatted) nor the SDK version (semver 1.27.x). Either pull from mcp.types.LATEST_PROTOCOL_VERSION at runtime, or set to a sentinel until negotiation is implemented.
  • A2A "0.3.0", AP2 "0.1.0" lag current upstream. These show up in trace events and certification output.

Also: AP2 stands for Agent Payments Protocol, not "Agent Protocol 2." If any docstrings/README/marketing reference the latter, they need updating.


Minor / nice-to-have

7. LiteLLM adapter doesn't wire new base hooks

MonkeyPatchProvider defines extract_tool_calls (_base_provider.py:38) and aggregate_stream (:44). litellm.py delegates extract_output and extract_meta to OpenAIProvider but doesn't delegate the two new hooks — so tool calls are dropped and streaming aggregation returns the no-op default. Two-line fix:

extract_tool_calls = staticmethod(OpenAIProvider.extract_tool_calls)
aggregate_stream  = staticmethod(OpenAIProvider.aggregate_stream)

8. Ollama cost_per_second parameter is unused

OllamaProvider.__init__ accepts cost_per_second: float | None = None (ollama.py:34); stored on the instance at :36. The module docstring (line 4) advertises "an optional cost_per_second lets callers account for compute time" — but the parameter is never referenced in cost computation anywhere in the file. Either remove or apply it via duration when calculating cost.

9. CrewAI memory integration

Delegation/handoff coverage is in place via crewai_event_bus subscriptions (crewai.py lines 156-178) and agent.handoff emission at :523, 550. There's no analogue to _langchain_memory.py for CrewAI's memory store — read/write hooks aren't proxied. Not a blocker, but worth a follow-up if memory-state tracing is a goal for CrewAI parity.

10. Five silent-pass sites in framework adapters

Bare except Exception: pass at:

  • crewai.py:643
  • agno.py:187
  • llamaindex.py:610
  • pydantic_ai.py:443
  • mcp/tool_wrapper.py:49 (this one has comment # pragma: no cover - defensive)

These swallow exceptions silently. The rest of the codebase consistently either logs at debug or attaches an error to the emitted event. A short justifying comment (or logger.debug(...)) would prevent these from being read as defects in future review.

11. Attestation envelope mutability + concurrency

  • AttestationEnvelope (_envelope.py:16-17) is @dataclass without frozen=True. The envelopes property (_chain.py:28) returns a shallow copy via [copy(e) for e in self._chain] — a real defence — but frozen=True would make immutability load-bearing rather than convention-bound.
  • _chain.py contains no threading.Lock or asyncio.Lock; add_event (:39) is unprotected. Single-writer usage is fine, but a lock guard would make multi-threaded use safer.

12. Evaluation runner swallows scorer exceptions

runner.py:103-105:
```python
except Exception as exc:
log.debug("scorer %s raised on item %s: %s", name, item.id, exc)
item_scores[name] = 0.0
```

A broken scorer becomes indistinguishable from a legitimately failing item. Suggest attaching the exception to EvaluationRunItem.error and surfacing it in the aggregate.

13. Replay store / dataset store default to in-memory

InMemoryReplayStore is the default in ReplayController.__init__ (controller.py:39: self._store: ReplayStore = store or InMemoryReplayStore()). InMemoryDatasetStore is the only implementation shipped. The interfaces are Protocol-based so swap-in is one line — but defaults lose state on restart. Either a docstring warning or a JSONFileStore reference impl would help.

14. Empty PR description

43k LOC merge across 336 files with no PR body. A short changelog grouped by subsystem (instrument / attestation / replay / synthetic / evaluation_runs / cli / docs) would help anyone trying to bisect later.


Strengths

  • No fake data. StochasticProvider (synthetic/providers.py:89) tags id=f"synth_{uuid.uuid4().hex[:16]}", created_at="synthetic", data["synthetic"]=True (:133-140).
  • Honest fallbacks. pricing.calculate_cost returns None for unpriced models; _emit_cost propagates None (callers see explicit None, not fake 0.0). cli/commands/evaluations.py:94 raises click.UsageError("remote dataset lookup is not yet implemented — pass --dataset-file") instead of returning empty data.
  • Real cryptographic chain. _chain.py:43 includes _previous_hash in the hashed payload (payload = {**data, "_previous_hash": self._last_hash}), not just adjacent — tampering with previous_hash breaks the hash. _signing.py:19 uses hmac_mod.compare_digest (timing-safe).
  • Defensive extractors in provider adapters — getattr(..., default) plus try/except around attribute walks. Won't crash on unexpected SDK shapes.
  • No AdapterCapability enum. Capability-without-implementation can't happen by construction.
  • Test density. 33,433 test LOC vs 28,174 source LOC. 93 test_*.py files (+ 20 conftest/init = 113 .py total in tests/). 1741 test functions, 3694 asserts. Zero empty-test files.
  • Dependency tiering. Runtime deps are just httpx + pydantic. All frameworks/providers are optional extras with Python-version gating for 3.10-only packages.
  • Zero broken imports across 359 from layerlens.* references (verified by enumerating all module paths under src/layerlens/ and checking every import target resolves).

Verdict

Three blockers (CI, MS Agent Framework naming, Bedrock streaming/tool extraction) + three product-decision items (A2UI/UCP, Haystack, protocol versions). Everything else can land as follow-ups.

garrettallen14 and others added 8 commits May 20, 2026 14:49
…-2879/2881/2883)

Per Marc's TEL-026 / TEL-028 / TEL-029 acceptance criteria, map provider-specific
fields to vendor-namespaced OTel GenAI attributes:

  gen_ai.openai.response.system_fingerprint           (TEL-026)
  gen_ai.openai.response.service_tier                 (TEL-026)
  gen_ai.openai.request.seed                          (TEL-026)
  gen_ai.anthropic.cache_read_input_tokens            (TEL-028)
  gen_ai.anthropic.cache_creation_input_tokens        (TEL-028)
  gen_ai.response.finish_reasons (now also from Anthropic stop_reason)

Wire OTel attribute mapping into the bespoke Bedrock emit path that bypasses
the standard MonkeyPatchProvider flow. Add response_id extraction across the
remaining adapters per TEL-029:

  - Bedrock: ResponseMetadata.RequestId
  - Vertex: response.response_id / response.id
  - Per-family stop_reason extraction for Bedrock invoke_model (anthropic,
    cohere, amazon, meta, mistral)

22 new tests covering vendor-namespacing edge cases and end-to-end
finish_reasons + response.id coverage across all 7 adapters (OpenAI,
Anthropic, Azure OpenAI, Vertex, Bedrock, Ollama, LiteLLM).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-3327/3330)

Per Marc's ADP-071 Claude Code Prompt, wrap pricing in a class with the
contract he spelled out:

  PricingTable.from_default() / .from_dict() / .from_json_file()
  PricingTable.calculate_cost(model, input_tokens, output_tokens) -> CostRecord
  PricingTable.has_model() / .models() / .as_dict()

Fuzzy resolution: ``gpt-4o-2024-08-06`` -> ``gpt-4o`` (date-suffix strip),
``claude-3-5-sonnet-20990101`` -> ``claude-3-5-sonnet``. Longest-prefix
fallback disambiguates ``gpt-4o`` from ``gpt-4`` for unrecognised dated
variants. Added base-name entries for the Claude family so fuzzy-stripped
lookups resolve.

LAYERLENS_PRICING_TABLE env var loads JSON overrides at runtime, satisfying
LAY-3327's "pricing updateable without code changes" AC. Override precedence:
env > caller-supplied table > bundled PRICING. Bad JSON / unreadable files
log a warning and fall back to defaults rather than crashing the request
path.

CostRecord dataclass carries cost_usd + model + input/output/cached token
counts so callers can pipe it directly into the cost.record event payload.

36 new pricing tests covering defaults, fuzzy matching, caller overrides,
cached-token discounts (Anthropic 90% / Google 75% / others 50%), env
loading, malformed-JSON resilience, and graceful unknown-model handling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… (LAY-3326/3329/3331/3332)

Per Marc's ADP-071 Claude Code Prompt, lift streaming logic into
src/layerlens/instrument/adapters/providers/_streaming.py:

  - StreamingResponseWrapper tracks first-chunk arrival + chunk list
  - stream_chunks_sync / stream_chunks_async preserve the SDK iterator
    contract (downstream consumers see identical chunks) while feeding the
    wrapper
  - On normal completion: emit consolidated model.invoke with ttft_ms and
    streaming_duration_ms in event metadata
  - On mid-stream exception: emit agent.error with partial_meta extracted
    from accumulated chunks plus partial_chunks count, per LAY-3329/3332 DoD

_base_provider.py now delegates _wrap_stream_iterator and
_wrap_async_stream_iterator to the new module. Same behavioural contract,
one implementation shared by every monkey-patched provider.

emit_llm_events grew ttft_ms / streaming_duration_ms kwargs; emit_llm_error
grew partial_meta / partial_chunks + error_type for richer agent.error
payloads.

OpenAI tool-call JSON parsing now logs a WARNING when arguments are
malformed (LAY-3331 DoD) with the offending snippet truncated for log
hygiene, rather than silently returning the raw string.

27 streaming tests including end-to-end TTFT (sync + async),
iterator-contract preservation, partial_meta on mid-stream error,
malformed-JSON warning, "no tool_calls = no events emitted", and parallel
tool-call fragment assembly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…op (LAY-3328/3332/3333/3334)

Tighten _CAPTURE_PARAMS so raw ``system``, ``messages``, ``tools``,
``tool_choice``, ``metadata``, and ``thinking`` payloads NEVER reach the
event parameters dict. derive_params builds privacy-safe summaries instead,
per LAY-3334 ACs:

  has_system: bool, system_length: int   (presence + length, NOT content)
  messages_count, message_roles          (count + role distribution, no content)
  tools_count, tool_names                (no schemas / descriptions)
  tool_choice_type, tool_choice_name     (type + name only)
  metadata_user_id                       (only field captured from metadata,
                                          for Anthropic's cost-attribution use)
  thinking_budget_tokens, thinking_type  (broken out from the thinking config)

extract_meta now surfaces content_block_counts (text / tool_use / thinking),
tool_use_names, and has_thinking on every response per LAY-3334.

Streaming aggregator:
  - explicit message_stop handler (Marc's AC literally names it)
  - TTFT anchored on first content_block_delta (not message_start, which
    fires before any content is generated)
  - defensive thinking_tokens read from message_start.usage and
    message_delta.usage so we pick up any future SDK signal
  - partial_meta emission on mid-stream exception including any cache
    tokens already received

11 new tests covering privacy boundaries (system content never leaks,
metadata sibling fields not captured), thinking budget capture, baseline
non-thinking responses unchanged, content-block counts incl. tool_use
names, mid-stream errors, and message_stop receipt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per ADP-070, ``from layerlens.adapters.providers import AzureOpenAIAdapter``
(and the other 6) must succeed and expose ``connect_client(client)`` +
``health_check() -> AdapterHealth``. The canonical implementation lives at
``layerlens.instrument.adapters.providers.*Provider`` with ``.connect()``;
this commit adds a thin shim at the legacy path so the AC bullets are
verifiable without forking the code.

Wrappers cover OpenAI, Anthropic, Azure OpenAI, Vertex, Bedrock, Ollama,
LiteLLM. ``health_check`` returns a self-contained AdapterHealth dataclass
matching the legacy pydantic model's shape; no dependency on any other
adapter module so the shim works on a clean checkout.

12 tests verify:
  - Each adapter is importable from the legacy path
  - AdapterHealth + AdapterStatus have the expected shape
  - Health flips from DISCONNECTED to HEALTHY after connect_client
  - connect_client wires up real tracing end-to-end for OpenAI, Anthropic,
    Bedrock (boto3-shape mock incl. ResponseMetadata.RequestId), Vertex
    (mocked generate_content), and Ollama (mocked chat) — each producing
    model.invoke (+ cost.record where the model is priced)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
….3450)

- pyproject: add langgraph, crewai, autogen, agentforce extras + all-frameworks omnibus
- requires_pydantic="2" markers on LangGraph + CrewAI adapters
- PEP 562 lazy public-API exports for the 6 framework adapters in frameworks/__init__.py
- Five runnable sample scripts under samples/instrument/ that exit 0 with install hints when the SDK is absent
- Five reference docs under docs/adapters/frameworks/ (Agentforce includes Connected App / OAuth setup section)

Lint + 80 framework tests + 488 wider instrument tests all green.
….3450)

Asserts importing the frameworks package never eagerly pulls langgraph,
langchain-core, crewai, autogen, autogen-core, autogen-agentchat, or
semantic_kernel. Also covers AttributeError for unknown names, __dir__
advertising all 6 public adapters, and resolving AgentforceAdapter (the
only adapter whose dep ships with the default install) without leaking
the others.

mypy --strict pass over the 5 M2 adapters + the lazy-export __init__ —
zero issues.
Adapters
- google_vertex: capture GenerativeModel.model_name on connect (strip
  `models/` prefix) and inject into response meta via overridden
  _extractors so cost-record events resolve against PRICING.
- ollama: bind OLLAMA_HOST endpoint into meta on every invoke; when
  cost_per_second is set, compute infra_cost_usd from eval_duration +
  prompt_eval_duration and include in the model.invoke payload.

Consumable surface
- pyproject: new `providers-vertex` and `providers-ollama` extras (canonical
  M3 names per AC); existing `google-vertex` and `ollama` kept as aliases.
- providers/__init__.py: PEP 562 lazy public API for OpenAI, Anthropic,
  AzureOpenAI, Bedrock, GoogleVertex, Ollama, LiteLLM. Default install
  stays lean.
- samples: google_vertex/example.py + ollama/example.py, both exit 0 with
  install / setup hints when SDK or daemon is absent.
- docs: google_vertex.md (SA-JSON + ADC sections per AC) and ollama.md
  (`ollama serve` setup + cost_per_second explanation per AC).

Marc-prep
- 19 new adapter unit tests (Vertex: SimpleNamespace SDK mocks; Ollama:
  dict-shape fixtures matching the ollama package).
- 4 new lazy-import regression tests for providers/__init__.py.
- mypy --strict clean over the 3 edited source files.
- ruff check + ruff format clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants