Skip to content

ParseResponse serialization duplicates keys #64

@DemirTonchev

Description

@DemirTonchev

ParseResponse serialization produces both class and class_ fields when using parse method and dumping to file:

response.model_dump_json(indent=2)

produces json like this:

{
  "chunks": [...], 
  "class_": "full",  # <<<<<<<
  "identifier": "full",
  "markdown": "<string>",
  "pages": [0],
  "class": "full"  # <<<<<<<
}

further more downloading blob from s3 by using the job output_url and then trying to load the model like this:

import json
from io import BytesIO
import httpx

async def download_blob(presigned_url: str):
    async with httpx.AsyncClient() as client:
        async with client.stream("GET", presigned_url) as response:
            response.raise_for_status()
            buffer = BytesIO()
            async for chunk in response.aiter_bytes():
                buffer.write(chunk)
            return buffer
buf = await download_blob(response.output_url)
parsed = ParseResponse.model_validate_json(buf.getvalue())

Fails with:

ValidationError: 1 validation error for ParseResponse splits.0.class
Field required [type=missing, input_value={'class_': 'full', 'ident...a94-9681-e3c8860228dd']}, input_type=dict]

using ParseResponse.model_validate_json(buf.getvalue(), by_name=True) succeeds this not found in the documentation.

Expected behavior:

model_dump_json() should produce only 'class' or 'class_' not both
JSON from S3 should deserialize correctly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions