Skip to content

Surface prompt_tokens_details.cached_tokens (and parallel cache fields) in TokenUsage #236

@chubes4

Description

@chubes4

Summary

OpenAI returns usage.prompt_tokens_details.cached_tokens on every text-generation response for cache-eligible models (gpt-4o family, gpt-5 family, etc.), but AbstractOpenAiCompatibleTextGenerationModel::parseGenerationResponse only reads the three top-level fields and discards the rest of the usage object before any downstream code can see it.

This makes OpenAI prompt caching unobservable from any caller — and OpenAI exposes cache hit rate only through this field. Caching is a major cost lever (50–75% input discount on hit), and without this number downstream applications cannot measure whether their prompt structure is actually cache-friendly, nor whether changes to that structure improved or regressed cache hit rate.

The same gap exists for Anthropic when an Anthropic provider implementation lands: Anthropic returns usage.cache_creation_input_tokens and usage.cache_read_input_tokens in the same response shape, and the current base class would discard those too.

Where the data is dropped

src/Providers/OpenAiCompatibleImplementation/AbstractOpenAiCompatibleTextGenerationModel.php (around lines 604–618):

if (isset($responseData['usage']) && is_array($responseData['usage'])) {
    $usage = $responseData['usage'];

    $tokenUsage = new TokenUsage(
        $usage['prompt_tokens'] ?? 0,
        $usage['completion_tokens'] ?? 0,
        $usage['total_tokens'] ?? 0
    );
} else {
    $tokenUsage = new TokenUsage(0, 0, 0);
}

// Use any other data from the response as provider-specific response metadata.
$additionalData = $responseData;
unset($additionalData['id'], $additionalData['choices'], $additionalData['usage']);

The prompt_tokens_details.cached_tokens value is present on the response, never read into TokenUsage, and then explicitly stripped from $additionalData so callers cannot recover it through the metadata passthrough either.

Symptom

Callers measuring per-request usage see promptTokens: N, no indication of which portion of N was served from cache. Multi-turn / repeated-prefix workloads are especially affected because that is exactly the case where caching helps most and the metric is most useful.

Suggested shape

Add nullable cache fields to TokenUsage, parsed when the provider supplies them, left null otherwise:

  • ?int $cachedTokens — for OpenAI's prompt_tokens_details.cached_tokens and Anthropic's cache_read_input_tokens.
  • ?int $cacheCreationTokens — for Anthropic's cache_creation_input_tokens. (No OpenAI analog today, since OpenAI's prompt caching is automatic and not write-tracked, but the field is useful for clean Anthropic surfacing and for any future provider that distinguishes cache-create vs cache-read.)

This follows the same precedent established by commit 305b80d ("Add support for thought tokens in token usage data") — both are provider-optional usage fields that benefit from being first-class on the DTO rather than buried in additionalData.

Relationship to #158

#158 (Improve handling of unprovided usage statistics) proposes making the existing TokenUsage properties nullable and the TokenUsage instance itself nullable on GenerativeAiResult. That refactor and this addition are aligned — they both push toward a more flexible TokenUsage shape — but they are independent enough to land in either order:

I'm happy to send a PR either way once a maintainer indicates a preference on which lands first and on the field naming.

Ecosystem signal

This is a widely-recognized gap in OpenAI-compatible clients right now; recent issues filed against other libraries with the same shape of bug include NousResearch/hermes-agent#25400, braintrustdata/braintrust-sdk-dotnet#53, PostHog/posthog-python#574, and BerriAI/litellm#22984. Including the field puts php-ai-client on the right side of the pattern as more callers start measuring cache hit rate as a first-class metric.

AI assistance

  • AI assistance: Yes
  • Tool(s): opencode (Claude Opus 4.7)
  • Used for: Drafted the issue body after Chris asked me to diagnose why a downstream Data Machine workload could not observe prompt-cache hit rate. I located the parsing gap, verified the discarded field, checked for existing issues/PRs in this repo and the wider ecosystem, and drafted this issue for Chris to review. Chris is the submitter and is responsible for the report; AI assistance is descriptive, not a disclaimer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions