Fix handling of non-alphanumeric entities in cleaned text#81
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideRefines anonymization post-processing by dropping entities whose cleaned text becomes empty and simplifying the entity filtering pipeline, plus minor import cleanup. Flow diagram for updated anonymization entity cleaning pipelineflowchart TD
A[Input DataItem] --> B[deepcopy item]
B --> C[get_element item_field_entities]
C --> D{for each ent in ents}
D -->|ent| E[process ent]
subgraph Process_ent
E --> F[cleaned_text = pattern.sub]
F --> G{cleaned_text empty?}
G -->|yes| H[return None]
G -->|no| I[update ent attrs aymurai_alt_text and indices]
I --> J[return ent]
end
H --> K[skip entity]
J --> L[include entity in new list]
D -->|all processed| M[set item_field_entities to filtered list]
M --> N[return item]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The
processmethod can now returnNonebut its return type annotation and any related docstring still indicatedict; consider updating the signature (e.g., toOptional[dict]) to reflect the new behavior and avoid confusion for callers and type checkers. - In
__call__, the comment refers to filtering out entities with empty alt text, but the logic actually filters based oncleaned_textinprocess; it may be clearer to align the terminology in the comment with the actual field/behavior used inprocess.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `process` method can now return `None` but its return type annotation and any related docstring still indicate `dict`; consider updating the signature (e.g., to `Optional[dict]`) to reflect the new behavior and avoid confusion for callers and type checkers.
- In `__call__`, the comment refers to filtering out entities with empty alt text, but the logic actually filters based on `cleaned_text` in `process`; it may be clearer to align the terminology in the comment with the actual field/behavior used in `process`.
## Individual Comments
### Comment 1
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="47-48" />
<code_context>
# Clean the text
cleaned_text = pattern.sub("", original_text)
+ if not cleaned_text:
+ return None
+
# Update the entity's alt text and indices
</code_context>
<issue_to_address>
**suggestion:** Align the return type annotation and docstring with the new `None` return path.
The method can now return `None` when `cleaned_text` is empty, but the type hint and docstring still indicate it always returns a `dict`. Please update the signature (e.g. `-> dict | None`) and docstring to match, or avoid returning `None` by returning `ent` and handling the filtering elsewhere.
Suggested implementation:
```python
class AnonymizationEntityCleaner(Transform):
def _clean_entity(
self,
ent: dict,
pattern,
original_text: str,
start_char: int,
leading_chars_removed: int,
) -> dict | None:
"""Clean a single anonymization entity.
The entity's alternative text and index attributes are updated in place.
Args:
ent: The entity to clean.
pattern: Compiled regex used to remove unwanted text.
original_text: The original text for the entity.
start_char: The original start character index of the entity.
leading_chars_removed: Number of leading characters removed from the original text.
Returns:
dict | None: The updated entity, or ``None`` if no cleaned text remains.
"""
# Clean the text
cleaned_text = pattern.sub("", original_text)
if not cleaned_text:
return None
# Update the entity's alt text and indices
ent["attrs"]["aymurai_alt_text"] = cleaned_text
ent["attrs"]["aymurai_alt_start_char"] = start_char + leading_chars_removed
return ent
def __call__(self, item: DataItem) -> DataItem:
"""
Post-process anonymization entities in the given item.
Returns:
DataItem: processed item
"""
item = deepcopy(item)
ents = get_element(item, [self.field, "entities"]) or []
```
I had to reconstruct some surrounding context because only a fragment of the class was provided:
1. If the helper method currently has a different name or signature than `_clean_entity(...): -> dict`, adjust the `def _clean_entity(...)` line in the replacement block to match your existing parameters, and only change the return annotation to `-> dict | None`.
2. If there is already a `__call__` (or other transform entrypoint) method in `AnonymizationEntityCleaner`, remove or merge the `__call__` implementation I added so you do not duplicate the method. Keep your existing logic, only updating any calls to the helper method to handle the new `None` return (e.g. by filtering out `None` values).
3. Ensure your project’s minimum Python version supports `dict | None`. If you are constrained to Python < 3.10, change it to `-> Optional[dict]` and import `Optional` from `typing`.
</issue_to_address>
### Comment 2
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="66-69" />
<code_context>
"""
item = deepcopy(item)
-
ents = get_element(item, [self.field, "entities"]) or []
- # Filter out predictions that are punctuation marks only
- ents = [ent for ent in ents if ent["text"] not in punctuation]
- ents = [self.process(ent) for ent in ents]
+ # Filter out predictions with empty alt text and update the rest
+ item[self.field]["entities"] = [
+ out for ent in ents if (out := self.process(ent)) is not None
+ ]
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against missing `self.field` / `entities` when assigning back to `item`.
`get_element` can return `None` when `self.field` or its `"entities"` key is missing, but the new code always does `item[self.field]["entities"] = ...`, which will raise in that case. Since the previous version only read from `ents`, this wasn’t exposed before. If these keys might be absent, initialize them first (e.g. `item.setdefault(self.field, {}).setdefault("entities", [])`) before assigning.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if not cleaned_text: | ||
| return None |
There was a problem hiding this comment.
suggestion: Align the return type annotation and docstring with the new None return path.
The method can now return None when cleaned_text is empty, but the type hint and docstring still indicate it always returns a dict. Please update the signature (e.g. -> dict | None) and docstring to match, or avoid returning None by returning ent and handling the filtering elsewhere.
Suggested implementation:
class AnonymizationEntityCleaner(Transform):
def _clean_entity(
self,
ent: dict,
pattern,
original_text: str,
start_char: int,
leading_chars_removed: int,
) -> dict | None:
"""Clean a single anonymization entity.
The entity's alternative text and index attributes are updated in place.
Args:
ent: The entity to clean.
pattern: Compiled regex used to remove unwanted text.
original_text: The original text for the entity.
start_char: The original start character index of the entity.
leading_chars_removed: Number of leading characters removed from the original text.
Returns:
dict | None: The updated entity, or ``None`` if no cleaned text remains.
"""
# Clean the text
cleaned_text = pattern.sub("", original_text)
if not cleaned_text:
return None
# Update the entity's alt text and indices
ent["attrs"]["aymurai_alt_text"] = cleaned_text
ent["attrs"]["aymurai_alt_start_char"] = start_char + leading_chars_removed
return ent
def __call__(self, item: DataItem) -> DataItem:
"""
Post-process anonymization entities in the given item.
Returns:
DataItem: processed item
"""
item = deepcopy(item)
ents = get_element(item, [self.field, "entities"]) or []I had to reconstruct some surrounding context because only a fragment of the class was provided:
- If the helper method currently has a different name or signature than
_clean_entity(...): -> dict, adjust thedef _clean_entity(...)line in the replacement block to match your existing parameters, and only change the return annotation to-> dict | None. - If there is already a
__call__(or other transform entrypoint) method inAnonymizationEntityCleaner, remove or merge the__call__implementation I added so you do not duplicate the method. Keep your existing logic, only updating any calls to the helper method to handle the newNonereturn (e.g. by filtering outNonevalues). - Ensure your project’s minimum Python version supports
dict | None. If you are constrained to Python < 3.10, change it to-> Optional[dict]and importOptionalfromtyping.
| ents = get_element(item, [self.field, "entities"]) or [] | ||
|
|
||
| # Filter out predictions that are punctuation marks only | ||
| ents = [ent for ent in ents if ent["text"] not in punctuation] | ||
| ents = [self.process(ent) for ent in ents] | ||
| # Filter out predictions with empty alt text and update the rest | ||
| item[self.field]["entities"] = [ |
There was a problem hiding this comment.
issue (bug_risk): Guard against missing self.field / entities when assigning back to item.
get_element can return None when self.field or its "entities" key is missing, but the new code always does item[self.field]["entities"] = ..., which will raise in that case. Since the previous version only read from ents, this wasn’t exposed before. If these keys might be absent, initialize them first (e.g. item.setdefault(self.field, {}).setdefault("entities", [])) before assigning.
There was a problem hiding this comment.
Pull request overview
This PR updates the anonymization post-processing transform to drop entities whose cleaned text becomes empty, and refactors the entity update/filter pipeline accordingly.
Changes:
- Make
AnonymizationEntityCleaner.process()returnNonewhen cleaning produces an empty string, so these entities can be removed. - Replace the punctuation-only filter with a single-pass process+filter step that writes the processed entities back into the item.
- Clean up imports in
core.py.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not cleaned_text: | ||
| return None | ||
|
|
| # Filter out predictions with empty alt text and update the rest | ||
| item[self.field]["entities"] = [ |
| # Filter out predictions with empty alt text and update the rest | ||
| item[self.field]["entities"] = [ | ||
| out for ent in ents if (out := self.process(ent)) is not None | ||
| ] |
This pull request refines the anonymization post-processing logic to better handle empty or invalid entities. The main improvements include filtering out entities with empty cleaned text and updating the entity processing pipeline for greater robustness.
Entity filtering and cleaning improvements:
processmethod inAnonymizationEntityCleanerto returnNonefor entities whose cleaned text is empty, ensuring that such entities are removed from the results.__call__method to filter out entities with empty cleaned text after processing, instead of only filtering out punctuation.Code cleanup:
punctuationfromstring) and reordered imports for clarity incore.py.Summary by Sourcery
Refine anonymization post-processing to drop entities whose cleaned text becomes empty and simplify entity handling in the transform.
Bug Fixes:
Enhancements: