Skip to content

Fix handling of non-alphanumeric entities in cleaned text#81

Open
jansaldo wants to merge 1 commit into
release/v1.5.0from
fix/drop-non-alphanumeric-entities
Open

Fix handling of non-alphanumeric entities in cleaned text#81
jansaldo wants to merge 1 commit into
release/v1.5.0from
fix/drop-non-alphanumeric-entities

Conversation

@jansaldo
Copy link
Copy Markdown
Contributor

@jansaldo jansaldo commented May 22, 2026

This pull request refines the anonymization post-processing logic to better handle empty or invalid entities. The main improvements include filtering out entities with empty cleaned text and updating the entity processing pipeline for greater robustness.

Entity filtering and cleaning improvements:

  • Updated the process method in AnonymizationEntityCleaner to return None for entities whose cleaned text is empty, ensuring that such entities are removed from the results.
  • Modified the entity processing pipeline in the __call__ method to filter out entities with empty cleaned text after processing, instead of only filtering out punctuation.

Code cleanup:

  • Removed unused imports (punctuation from string) and reordered imports for clarity in core.py.

Summary by Sourcery

Refine anonymization post-processing to drop entities whose cleaned text becomes empty and simplify entity handling in the transform.

Bug Fixes:

  • Ensure entities with empty cleaned text are excluded from anonymization results instead of being retained as invalid entries.

Enhancements:

  • Streamline the anonymization entity processing pipeline to process and filter entities in a single pass and remove unused imports.

@jansaldo jansaldo requested a review from Copilot May 22, 2026 16:41
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 22, 2026

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Refines anonymization post-processing by dropping entities whose cleaned text becomes empty and simplifying the entity filtering pipeline, plus minor import cleanup.

Flow diagram for updated anonymization entity cleaning pipeline

flowchart TD
    A[Input DataItem] --> B[deepcopy item]
    B --> C[get_element item_field_entities]
    C --> D{for each ent in ents}
    D -->|ent| E[process ent]

    subgraph Process_ent
        E --> F[cleaned_text = pattern.sub]
        F --> G{cleaned_text empty?}
        G -->|yes| H[return None]
        G -->|no| I[update ent attrs aymurai_alt_text and indices]
        I --> J[return ent]
    end

    H --> K[skip entity]
    J --> L[include entity in new list]

    D -->|all processed| M[set item_field_entities to filtered list]
    M --> N[return item]
Loading

File-Level Changes

Change Details Files
Return None for entities whose cleaned text becomes empty during processing so they can be removed downstream.
  • After cleaning an entity's text with the regex pattern, check if the resulting cleaned_text is falsy
  • If cleaned_text is empty, immediately return None instead of updating attributes and indices
  • Preserve the existing logic for updating alt text and index attributes when cleaned_text is non-empty
aymurai/transforms/anonymization_postprocess/core.py
Update the entity processing pipeline to rely on process() for filtering and to write filtered entities back into the item.
  • Stop pre-filtering entities by checking that ent['text'] is not a punctuation-only token
  • Map entities through process(ent) and inline-filter out any for which process returns None
  • Assign the resulting filtered list directly to item[self.field]["entities"] before returning the item
aymurai/transforms/anonymization_postprocess/core.py
Clean up and reorder imports for clarity and to remove unused symbols.
  • Remove unused import of punctuation from the string module
  • Reorder imports so Transform is imported once, grouped with other aymurai imports
aymurai/transforms/anonymization_postprocess/core.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The process method can now return None but its return type annotation and any related docstring still indicate dict; consider updating the signature (e.g., to Optional[dict]) to reflect the new behavior and avoid confusion for callers and type checkers.
  • In __call__, the comment refers to filtering out entities with empty alt text, but the logic actually filters based on cleaned_text in process; it may be clearer to align the terminology in the comment with the actual field/behavior used in process.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `process` method can now return `None` but its return type annotation and any related docstring still indicate `dict`; consider updating the signature (e.g., to `Optional[dict]`) to reflect the new behavior and avoid confusion for callers and type checkers.
- In `__call__`, the comment refers to filtering out entities with empty alt text, but the logic actually filters based on `cleaned_text` in `process`; it may be clearer to align the terminology in the comment with the actual field/behavior used in `process`.

## Individual Comments

### Comment 1
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="47-48" />
<code_context>
         # Clean the text
         cleaned_text = pattern.sub("", original_text)

+        if not cleaned_text:
+            return None
+
         # Update the entity's alt text and indices
</code_context>
<issue_to_address>
**suggestion:** Align the return type annotation and docstring with the new `None` return path.

The method can now return `None` when `cleaned_text` is empty, but the type hint and docstring still indicate it always returns a `dict`. Please update the signature (e.g. `-> dict | None`) and docstring to match, or avoid returning `None` by returning `ent` and handling the filtering elsewhere.

Suggested implementation:

```python
class AnonymizationEntityCleaner(Transform):
    def _clean_entity(
        self,
        ent: dict,
        pattern,
        original_text: str,
        start_char: int,
        leading_chars_removed: int,
    ) -> dict | None:
        """Clean a single anonymization entity.

        The entity's alternative text and index attributes are updated in place.

        Args:
            ent: The entity to clean.
            pattern: Compiled regex used to remove unwanted text.
            original_text: The original text for the entity.
            start_char: The original start character index of the entity.
            leading_chars_removed: Number of leading characters removed from the original text.

        Returns:
            dict | None: The updated entity, or ``None`` if no cleaned text remains.
        """
        # Clean the text
        cleaned_text = pattern.sub("", original_text)

        if not cleaned_text:
            return None

        # Update the entity's alt text and indices
        ent["attrs"]["aymurai_alt_text"] = cleaned_text
        ent["attrs"]["aymurai_alt_start_char"] = start_char + leading_chars_removed
        return ent

    def __call__(self, item: DataItem) -> DataItem:
        """
        Post-process anonymization entities in the given item.

        Returns:
            DataItem: processed item
        """
        item = deepcopy(item)
        ents = get_element(item, [self.field, "entities"]) or []

```

I had to reconstruct some surrounding context because only a fragment of the class was provided:

1. If the helper method currently has a different name or signature than `_clean_entity(...): -> dict`, adjust the `def _clean_entity(...)` line in the replacement block to match your existing parameters, and only change the return annotation to `-> dict | None`.
2. If there is already a `__call__` (or other transform entrypoint) method in `AnonymizationEntityCleaner`, remove or merge the `__call__` implementation I added so you do not duplicate the method. Keep your existing logic, only updating any calls to the helper method to handle the new `None` return (e.g. by filtering out `None` values).
3. Ensure your project’s minimum Python version supports `dict | None`. If you are constrained to Python < 3.10, change it to `-> Optional[dict]` and import `Optional` from `typing`.
</issue_to_address>

### Comment 2
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="66-69" />
<code_context>
         """
         item = deepcopy(item)
-
         ents = get_element(item, [self.field, "entities"]) or []

-        # Filter out predictions that are punctuation marks only
-        ents = [ent for ent in ents if ent["text"] not in punctuation]
-        ents = [self.process(ent) for ent in ents]
+        # Filter out predictions with empty alt text and update the rest
+        item[self.field]["entities"] = [
+            out for ent in ents if (out := self.process(ent)) is not None
+        ]
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against missing `self.field` / `entities` when assigning back to `item`.

`get_element` can return `None` when `self.field` or its `"entities"` key is missing, but the new code always does `item[self.field]["entities"] = ...`, which will raise in that case. Since the previous version only read from `ents`, this wasn’t exposed before. If these keys might be absent, initialize them first (e.g. `item.setdefault(self.field, {}).setdefault("entities", [])`) before assigning.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +47 to +48
if not cleaned_text:
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Align the return type annotation and docstring with the new None return path.

The method can now return None when cleaned_text is empty, but the type hint and docstring still indicate it always returns a dict. Please update the signature (e.g. -> dict | None) and docstring to match, or avoid returning None by returning ent and handling the filtering elsewhere.

Suggested implementation:

class AnonymizationEntityCleaner(Transform):
    def _clean_entity(
        self,
        ent: dict,
        pattern,
        original_text: str,
        start_char: int,
        leading_chars_removed: int,
    ) -> dict | None:
        """Clean a single anonymization entity.

        The entity's alternative text and index attributes are updated in place.

        Args:
            ent: The entity to clean.
            pattern: Compiled regex used to remove unwanted text.
            original_text: The original text for the entity.
            start_char: The original start character index of the entity.
            leading_chars_removed: Number of leading characters removed from the original text.

        Returns:
            dict | None: The updated entity, or ``None`` if no cleaned text remains.
        """
        # Clean the text
        cleaned_text = pattern.sub("", original_text)

        if not cleaned_text:
            return None

        # Update the entity's alt text and indices
        ent["attrs"]["aymurai_alt_text"] = cleaned_text
        ent["attrs"]["aymurai_alt_start_char"] = start_char + leading_chars_removed
        return ent

    def __call__(self, item: DataItem) -> DataItem:
        """
        Post-process anonymization entities in the given item.

        Returns:
            DataItem: processed item
        """
        item = deepcopy(item)
        ents = get_element(item, [self.field, "entities"]) or []

I had to reconstruct some surrounding context because only a fragment of the class was provided:

  1. If the helper method currently has a different name or signature than _clean_entity(...): -> dict, adjust the def _clean_entity(...) line in the replacement block to match your existing parameters, and only change the return annotation to -> dict | None.
  2. If there is already a __call__ (or other transform entrypoint) method in AnonymizationEntityCleaner, remove or merge the __call__ implementation I added so you do not duplicate the method. Keep your existing logic, only updating any calls to the helper method to handle the new None return (e.g. by filtering out None values).
  3. Ensure your project’s minimum Python version supports dict | None. If you are constrained to Python < 3.10, change it to -> Optional[dict] and import Optional from typing.

Comment on lines 66 to +69
ents = get_element(item, [self.field, "entities"]) or []

# Filter out predictions that are punctuation marks only
ents = [ent for ent in ents if ent["text"] not in punctuation]
ents = [self.process(ent) for ent in ents]
# Filter out predictions with empty alt text and update the rest
item[self.field]["entities"] = [
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Guard against missing self.field / entities when assigning back to item.

get_element can return None when self.field or its "entities" key is missing, but the new code always does item[self.field]["entities"] = ..., which will raise in that case. Since the previous version only read from ents, this wasn’t exposed before. If these keys might be absent, initialize them first (e.g. item.setdefault(self.field, {}).setdefault("entities", [])) before assigning.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the anonymization post-processing transform to drop entities whose cleaned text becomes empty, and refactors the entity update/filter pipeline accordingly.

Changes:

  • Make AnonymizationEntityCleaner.process() return None when cleaning produces an empty string, so these entities can be removed.
  • Replace the punctuation-only filter with a single-pass process+filter step that writes the processed entities back into the item.
  • Clean up imports in core.py.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +47 to +49
if not cleaned_text:
return None

Comment on lines +68 to +69
# Filter out predictions with empty alt text and update the rest
item[self.field]["entities"] = [
Comment on lines +68 to +71
# Filter out predictions with empty alt text and update the rest
item[self.field]["entities"] = [
out for ent in ents if (out := self.process(ent)) is not None
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants