Fix handling of non-alphanumeric entities in cleaned text by jansaldo · Pull Request #81 · AymurAI/backend

jansaldo · 2026-05-22T16:41:41Z

This pull request refines the anonymization post-processing logic to better handle empty or invalid entities. The main improvements include filtering out entities with empty cleaned text and updating the entity processing pipeline for greater robustness.

Entity filtering and cleaning improvements:

Updated the process method in AnonymizationEntityCleaner to return None for entities whose cleaned text is empty, ensuring that such entities are removed from the results.
Modified the entity processing pipeline in the __call__ method to filter out entities with empty cleaned text after processing, instead of only filtering out punctuation.

Code cleanup:

Removed unused imports (punctuation from string) and reordered imports for clarity in core.py.

Summary by Sourcery

Refine anonymization post-processing to drop entities whose cleaned text becomes empty and simplify entity handling in the transform.

Bug Fixes:

Ensure entities with empty cleaned text are excluded from anonymization results instead of being retained as invalid entries.

Enhancements:

Streamline the anonymization entity processing pipeline to process and filter entities in a single pass and remove unused imports.

…ty cleaned text

sourcery-ai · 2026-05-22T16:41:47Z

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Refines anonymization post-processing by dropping entities whose cleaned text becomes empty and simplifying the entity filtering pipeline, plus minor import cleanup.

Flow diagram for updated anonymization entity cleaning pipeline

flowchart TD
    A[Input DataItem] --> B[deepcopy item]
    B --> C[get_element item_field_entities]
    C --> D{for each ent in ents}
    D -->|ent| E[process ent]

    subgraph Process_ent
        E --> F[cleaned_text = pattern.sub]
        F --> G{cleaned_text empty?}
        G -->|yes| H[return None]
        G -->|no| I[update ent attrs aymurai_alt_text and indices]
        I --> J[return ent]
    end

    H --> K[skip entity]
    J --> L[include entity in new list]

    D -->|all processed| M[set item_field_entities to filtered list]
    M --> N[return item]

File-Level Changes

Change	Details	Files
Return None for entities whose cleaned text becomes empty during processing so they can be removed downstream.	After cleaning an entity's text with the regex pattern, check if the resulting cleaned_text is falsy If cleaned_text is empty, immediately return None instead of updating attributes and indices Preserve the existing logic for updating alt text and index attributes when cleaned_text is non-empty	`aymurai/transforms/anonymization_postprocess/core.py`
Update the entity processing pipeline to rely on process() for filtering and to write filtered entities back into the item.	Stop pre-filtering entities by checking that ent['text'] is not a punctuation-only token Map entities through process(ent) and inline-filter out any for which process returns None Assign the resulting filtered list directly to item[self.field]["entities"] before returning the item	`aymurai/transforms/anonymization_postprocess/core.py`
Clean up and reorder imports for clarity and to remove unused symbols.	Remove unused import of punctuation from the string module Reorder imports so Transform is imported once, grouped with other aymurai imports	`aymurai/transforms/anonymization_postprocess/core.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The process method can now return None but its return type annotation and any related docstring still indicate dict; consider updating the signature (e.g., to Optional[dict]) to reflect the new behavior and avoid confusion for callers and type checkers.
In __call__, the comment refers to filtering out entities with empty alt text, but the logic actually filters based on cleaned_text in process; it may be clearer to align the terminology in the comment with the actual field/behavior used in process.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `process` method can now return `None` but its return type annotation and any related docstring still indicate `dict`; consider updating the signature (e.g., to `Optional[dict]`) to reflect the new behavior and avoid confusion for callers and type checkers.
- In `__call__`, the comment refers to filtering out entities with empty alt text, but the logic actually filters based on `cleaned_text` in `process`; it may be clearer to align the terminology in the comment with the actual field/behavior used in `process`.

## Individual Comments

### Comment 1
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="47-48" />
<code_context>
         # Clean the text
         cleaned_text = pattern.sub("", original_text)

+        if not cleaned_text:
+            return None
+
         # Update the entity's alt text and indices
</code_context>
<issue_to_address>
**suggestion:** Align the return type annotation and docstring with the new `None` return path.

The method can now return `None` when `cleaned_text` is empty, but the type hint and docstring still indicate it always returns a `dict`. Please update the signature (e.g. `-> dict | None`) and docstring to match, or avoid returning `None` by returning `ent` and handling the filtering elsewhere.

Suggested implementation:

```python
class AnonymizationEntityCleaner(Transform):
    def _clean_entity(
        self,
        ent: dict,
        pattern,
        original_text: str,
        start_char: int,
        leading_chars_removed: int,
    ) -> dict | None:
        """Clean a single anonymization entity.

        The entity's alternative text and index attributes are updated in place.

        Args:
            ent: The entity to clean.
            pattern: Compiled regex used to remove unwanted text.
            original_text: The original text for the entity.
            start_char: The original start character index of the entity.
            leading_chars_removed: Number of leading characters removed from the original text.

        Returns:
            dict | None: The updated entity, or ``None`` if no cleaned text remains.
        """
        # Clean the text
        cleaned_text = pattern.sub("", original_text)

        if not cleaned_text:
            return None

        # Update the entity's alt text and indices
        ent["attrs"]["aymurai_alt_text"] = cleaned_text
        ent["attrs"]["aymurai_alt_start_char"] = start_char + leading_chars_removed
        return ent

    def __call__(self, item: DataItem) -> DataItem:
        """
        Post-process anonymization entities in the given item.

        Returns:
            DataItem: processed item
        """
        item = deepcopy(item)
        ents = get_element(item, [self.field, "entities"]) or []

```

I had to reconstruct some surrounding context because only a fragment of the class was provided:

1. If the helper method currently has a different name or signature than `_clean_entity(...): -> dict`, adjust the `def _clean_entity(...)` line in the replacement block to match your existing parameters, and only change the return annotation to `-> dict | None`.
2. If there is already a `__call__` (or other transform entrypoint) method in `AnonymizationEntityCleaner`, remove or merge the `__call__` implementation I added so you do not duplicate the method. Keep your existing logic, only updating any calls to the helper method to handle the new `None` return (e.g. by filtering out `None` values).
3. Ensure your project’s minimum Python version supports `dict | None`. If you are constrained to Python < 3.10, change it to `-> Optional[dict]` and import `Optional` from `typing`.
</issue_to_address>

### Comment 2
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="66-69" />
<code_context>
         """
         item = deepcopy(item)
-
         ents = get_element(item, [self.field, "entities"]) or []

-        # Filter out predictions that are punctuation marks only
-        ents = [ent for ent in ents if ent["text"] not in punctuation]
-        ents = [self.process(ent) for ent in ents]
+        # Filter out predictions with empty alt text and update the rest
+        item[self.field]["entities"] = [
+            out for ent in ents if (out := self.process(ent)) is not None
+        ]
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against missing `self.field` / `entities` when assigning back to `item`.

`get_element` can return `None` when `self.field` or its `"entities"` key is missing, but the new code always does `item[self.field]["entities"] = ...`, which will raise in that case. Since the previous version only read from `ents`, this wasn’t exposed before. If these keys might be absent, initialize them first (e.g. `item.setdefault(self.field, {}).setdefault("entities", [])`) before assigning.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-05-22T16:43:22Z

+        if not cleaned_text:
+            return None


suggestion: Align the return type annotation and docstring with the new None return path.

The method can now return None when cleaned_text is empty, but the type hint and docstring still indicate it always returns a dict. Please update the signature (e.g. -> dict | None) and docstring to match, or avoid returning None by returning ent and handling the filtering elsewhere.

Suggested implementation:

class AnonymizationEntityCleaner(Transform): def _clean_entity( self, ent: dict, pattern, original_text: str, start_char: int, leading_chars_removed: int, ) -> dict | None: """Clean a single anonymization entity. The entity's alternative text and index attributes are updated in place. Args: ent: The entity to clean. pattern: Compiled regex used to remove unwanted text. original_text: The original text for the entity. start_char: The original start character index of the entity. leading_chars_removed: Number of leading characters removed from the original text. Returns: dict | None: The updated entity, or ``None`` if no cleaned text remains. """ # Clean the text cleaned_text = pattern.sub("", original_text) if not cleaned_text: return None # Update the entity's alt text and indices ent["attrs"]["aymurai_alt_text"] = cleaned_text ent["attrs"]["aymurai_alt_start_char"] = start_char + leading_chars_removed return ent def __call__(self, item: DataItem) -> DataItem: """ Post-process anonymization entities in the given item. Returns: DataItem: processed item """ item = deepcopy(item) ents = get_element(item, [self.field, "entities"]) or []

I had to reconstruct some surrounding context because only a fragment of the class was provided:

If the helper method currently has a different name or signature than _clean_entity(...): -> dict, adjust the def _clean_entity(...) line in the replacement block to match your existing parameters, and only change the return annotation to -> dict | None.

If there is already a __call__ (or other transform entrypoint) method in AnonymizationEntityCleaner, remove or merge the __call__ implementation I added so you do not duplicate the method. Keep your existing logic, only updating any calls to the helper method to handle the new None return (e.g. by filtering out None values).

Ensure your project’s minimum Python version supports dict | None. If you are constrained to Python < 3.10, change it to -> Optional[dict] and import Optional from typing.

sourcery-ai · 2026-05-22T16:43:22Z

        ents = get_element(item, [self.field, "entities"]) or []

-        # Filter out predictions that are punctuation marks only
-        ents = [ent for ent in ents if ent["text"] not in punctuation]
-        ents = [self.process(ent) for ent in ents]
+        # Filter out predictions with empty alt text and update the rest
+        item[self.field]["entities"] = [


issue (bug_risk): Guard against missing self.field / entities when assigning back to item.

get_element can return None when self.field or its "entities" key is missing, but the new code always does item[self.field]["entities"] = ..., which will raise in that case. Since the previous version only read from ents, this wasn’t exposed before. If these keys might be absent, initialize them first (e.g. item.setdefault(self.field, {}).setdefault("entities", [])) before assigning.

Copilot

Pull request overview

This PR updates the anonymization post-processing transform to drop entities whose cleaned text becomes empty, and refactors the entity update/filter pipeline accordingly.

Changes:

Make AnonymizationEntityCleaner.process() return None when cleaning produces an empty string, so these entities can be removed.
Replace the punctuation-only filter with a single-pass process+filter step that writes the processed entities back into the item.
Clean up imports in core.py.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if not cleaned_text:
+            return None
+


+        # Filter out predictions with empty alt text and update the rest
+        item[self.field]["entities"] = [


+        # Filter out predictions with empty alt text and update the rest
+        item[self.field]["entities"] = [
+            out for ent in ents if (out := self.process(ent)) is not None
+        ]


🐛 Fix handling of non-alphanumeric entities by returning None for emp…

ad347a4

…ty cleaned text

jansaldo requested a review from Copilot May 22, 2026 16:41

jansaldo requested a review from conrabeatriz May 22, 2026 16:41

Copilot started reviewing on behalf of jansaldo May 22, 2026 16:42 View session

sourcery-ai Bot reviewed May 22, 2026

View reviewed changes

Copilot AI reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of non-alphanumeric entities in cleaned text#81

Fix handling of non-alphanumeric entities in cleaned text#81
jansaldo wants to merge 1 commit into
release/v1.5.0from
fix/drop-non-alphanumeric-entities

jansaldo commented May 22, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 22, 2026 •

edited

Loading

Reviewer's Guide

Flow diagram for updated anonymization entity cleaning pipeline

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot May 22, 2026

Uh oh!

sourcery-ai Bot May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Filter out predictions with empty alt text and update the rest
		item[self.field]["entities"] = [

Conversation

jansaldo commented May 22, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Flow diagram for updated anonymization entity cleaning pipeline

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jansaldo commented May 22, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented May 22, 2026 •

edited

Loading