Skip to content

Added language detection throught the distant path of AA.#1031

Draft
eephyne wants to merge 4 commits into
calibrain:mainfrom
eephyne:feature-search_wider
Draft

Added language detection throught the distant path of AA.#1031
eephyne wants to merge 4 commits into
calibrain:mainfrom
eephyne:feature-search_wider

Conversation

@eephyne

@eephyne eephyne commented May 28, 2026

Copy link
Copy Markdown
  • Added language detection throught the distant path of AA.
  • Added option in setting to disable this feature.

This is to avoid shelfmark dumping many result due to missing language in AA.
With this option enabled, it will parse the distant path (don’t know how its named) to look for language and set the language accordingly.

I tested it with different request and language and the parsing seem ok to me but it can probably be improved.
The option can be enabled or disabled in the setting Direct Download > Download Source

- Added option in setting to disable this feature.
Copilot AI review requested due to automatic review settings May 28, 2026 14:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in feature to the Direct Download source that infers a book's language from the "distant path" (file path in search results) when the language metadata is missing, with corresponding settings, parsing logic, and tests.

Changes:

  • New DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH checkbox setting (default off).
  • Logic in direct_download.py to extract a distant path from result rows, detect language via bracket/keyed/name/code patterns with alias mapping from book-languages.json, and apply local language filtering after parsing.
  • Extensive new tests covering detection, false-positive avoidance, legacy behavior, and search-level local filtering.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
shelfmark/config/settings.py Adds new CheckboxField for the path-language toggle.
shelfmark/release_sources/direct_download.py Implements distant-path extraction, language inference, alias map, and conditional local filtering in search_books.
tests/config/test_download_settings.py Verifies the new settings field exists with expected default/description.
tests/direct_download/test_search_queries.py Adds tests for distant-path language detection, edge cases, and the new local filtering path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +709 to +722
detected_from_path = _detect_language_from_distant_path(distant_path)

# Temporary visual diagnostics for field mapping and path-language inference.
if _is_language_from_path_enabled():
logger.info(
"DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s",
record_id,
_short_debug(title),
_short_debug(language),
_short_debug(detected_from_path),
_short_debug(distant_path, limit=260),
_short_debug(cells[10].get_text(" ", strip=True), limit=140),
_short_debug(row.get_text(" ", strip=True), limit=260),
)
Comment on lines +712 to +713
if _is_language_from_path_enabled():
logger.info(
_short_debug(row.get_text(" ", strip=True), limit=260),
)

if _is_language_from_path_enabled() and _is_missing_or_placeholder_language(language):
Comment on lines +290 to +292

def _extract_distant_path(row: Tag) -> str | None:
"""Extract distant path hints from a direct-download search row."""
_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
_LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None
_LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"})
_AMBIGUOUS_SHORT_LANGUAGE_CODES = frozenset({"de", "en", "it", "la", "no", "or", "is", "in"})
r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b",
re.IGNORECASE,
)
_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
Comment on lines +576 to +581
# When path-language inference is enabled, language filtering must happen
# after row parsing, otherwise source-side lang filters drop rows too early.
if not (path_language_enabled and requested_langs):
for value in filters.lang or []:
if value and value != "all":
filters_query += f"&lang={quote(value)}"
Comment on lines +463 to +480
import shelfmark.release_sources.direct_download as dd

captured_url: dict[str, str] = {}

original_get = dd.config.get

def _fake_get(key: str, default=None, user_id=None):
del user_id
if key == "DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH":
return True
return original_get(key, default)

monkeypatch.setattr(dd.config, "get", _fake_get)
monkeypatch.setattr(dd.network, "get_aa_base_url", lambda: "https://mirror.example")
monkeypatch.setattr(dd.network, "AAMirrorSelector", lambda: object())

def _fake_html_get_page(url: str, selector, allow_bypasser_fallback=False):
del selector, allow_bypasser_fallback
Comment thread shelfmark/release_sources/direct_download.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 29, 2026 07:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Comment on lines +726 to +729
# Temporary visual diagnostics for field mapping and path-language inference.
if _is_language_from_path_enabled():
logger.info(
"DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s",
Comment on lines +224 to +226
_LANGUAGE_CODE_TOKEN_PATTERN = re.compile(
r"(?:^|[\s_./\\\-\[(])([A-Za-z]{2,3})(?=$|[\s_./\\\-)\]])"
)
r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b",
re.IGNORECASE,
)
_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
Comment on lines +266 to +270
def _language_alias_to_code() -> dict[str, str]:
"""Build alias->code map from bundled language metadata."""
global _LANGUAGE_ALIAS_TO_CODE
if _LANGUAGE_ALIAS_TO_CODE is not None:
return _LANGUAGE_ALIAS_TO_CODE
Comment on lines +593 to +596
if not (path_language_enabled and requested_langs):
for value in filters.lang or []:
if value and value != "all":
filters_query += f"&lang={quote(value)}"
Comment on lines +462 to +465
def test_search_books_filters_language_locally_when_path_language_enabled(monkeypatch):
import shelfmark.release_sources.direct_download as dd

captured_url: dict[str, str] = {}
)
_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
_LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None
_LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"})
eephyne added 2 commits May 29, 2026 13:39
…eedback

Downgrade temporary per-row language diagnostics from INFO to DEBUG to reduce production log noise.
Cache path-language toggle per parsed row and skip distant-path extraction when the feature is disabled.
Make language alias cache initialization thread-safe with a lock to avoid race conditions on first load.
Reduce language false positives by tightening name-token matching and preferring non-ambiguous strong candidates.
Keep server-side language filtering enabled and apply local path-based filtering as an additional refinement.
Remove redundant normalized placeholder handling for em-dash language values.
Update Direct Download setting description to document language-filter trade-offs.
Fix test indentation consistency and update assertions for restored server-side lang query behavior.
Add regression coverage for bracket-order ambiguity (e.g., EN marker appearing before FR marker).
…epte les livres sans métadonnées linguistiques et ajuste les filtres de langue pour les fichiers lgli.
Copilot AI review requested due to automatic review settings June 4, 2026 14:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

Comment on lines +1443 to +1452
CheckboxField(
key="DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH",
label="Detect Language From Distant Path",
description=(
"When language metadata is missing, parse the distant path and set language "
"from tags like [BD FR]. Falls back to unknown when not detected. "
"Note: source-side language filters still apply and may exclude poorly tagged rows."
),
default=False,
),
Comment on lines +758 to +769
# Temporary visual diagnostics for field mapping and path-language inference.
if path_language_enabled:
logger.debug(
"DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s",
record_id,
_short_debug(title),
_short_debug(language),
_short_debug(detected_from_path),
_short_debug(distant_path, limit=260),
_short_debug(cells[10].get_text(" ", strip=True), limit=140),
_short_debug(row.get_text(" ", strip=True), limit=260),
)
Comment on lines +774 to +779
logger.debug(
"DD lang debug resolved | id=%s | final_lang=%s | fallback=%s",
record_id,
_short_debug(language),
"unknown" if detected_from_path is None else "detected",
)
Comment on lines +323 to +328
normalized = re.sub(
r"\s+\.(epub|mobi|azw3|fb2|djvu|cbz|cbr|pdf|zip|rar|m4b|mp3)\b",
r".\1",
normalized,
flags=re.IGNORECASE,
)
Comment on lines +280 to +286
mapping: dict[str, str] = {}
data_path = Path(__file__).resolve().parents[2] / "data" / "book-languages.json"

try:
raw = json.loads(data_path.read_text(encoding="utf-8"))
except (OSError, ValueError, TypeError):
_LANGUAGE_ALIAS_TO_CODE = {}
Comment on lines +434 to +440
def _book_matches_requested_languages(book_language: str | None, requested: set[str]) -> bool:
"""Return True when a book language matches normalized requested filters.

A book whose language is unknown (None) passes through: the server-side
``&lang=`` filter already constrained the result set, so dropping rows
that simply lack metadata would hide relevant results.
"""
@eephyne eephyne marked this pull request as draft June 4, 2026 15:10
calibrain added a commit that referenced this pull request Jun 15, 2026
…#1010, #1021, #1025, #1040) (#1066)

## Backport bug fixes from `NemesisHubris/litfinder`

Forwards a curated set of bug fixes from
[NemesisHubris/litfinder](https://github.com/NemesisHubris/litfinder) —
a community fork of this project — that address open issues here. All
commits preserve original authorship via `git cherry-pick`; this PR is a
backport rather than original work. Each fix has been reviewed locally,
lint/format-cleaned to match this repo's existing ruff config, and
verified with the test suite. Rebrand strings, license switches, and
features have been deliberately excluded.

### Upstream issues addressed

- **#999** — Mirror URLs with query params no longer break search
requests (strip query string/fragment in `normalize_http_url`)
- **#956** — Apprise notifications now respect the configured proxy
(proxy env vars injected before dispatch)
- **#1025** — rTorrent: separate `RTORRENT_AUDIOBOOK_LABEL` setting,
falls back to book label if unset
- **#1010** — Stop button in Activity no longer makes the panel
disappear (snapshot refresh on cancel)
- **#1021** — Anna's Archive slow-download countdown now caps retries
instead of looping forever
- **#1040** — Empty destination directory cleaned up when write probe
fails
- **PR #1031** — Language detection from Anna's Archive distant path
when listing metadata is missing

### Additional fixes (no open issue but clear bugs)

- **fix: Python 2 `except` syntax across 27 files** — `except X, Y:` is
a SyntaxError in Python 3 and prevents affected modules from importing
at runtime. Mechanical sweep to `except (X, Y):`.
- **fix(abb): info hash validation with magnet fallback** — adds
SHA-1/SHA-256 hex validation on extracted info hashes; falls back to
scanning the full page for a magnet link (e.g. posted in comments) when
the table value is malformed. Also extends the exact-phrase fallback to
manual queries and defaults the ABB listing language to `en` when
missing, preventing valid results from being hidden by the language
filter. Includes a small test-fixture fix (`test(abb): use valid hex
info hashes in scraper test fixtures`) since the existing fixtures used
non-hex placeholders that the new validation correctly rejects.
- **fix: Anna's Archive title parser** — handles nested edition spans
and filters `lgli` catalog descriptor entries (e.g. "Book/Online Audio")
that were polluting search results.

### Deliberately not included

- LitFinder rebranding (UI strings, Apprise app ID, logo). The `fix:
three upstream bugs` commit (#999/#956/#1025) was cherry-picked with
Apprise app-id, description, and logo-URL strings reverted from
"LitFinder" back to "Shelfmark"; noted in the commit body.
- Features from the LitFinder fork (multi-variant title search,
multi-book flat-folder grouping, fuzzy text matching, "Leave in Place"
output handler, admin display name, custom-source plugin system). These
are larger behavior changes that each warrant their own focused review —
happy to send any of them separately if of interest.
- LitFinder-specific test environment and CI infrastructure.

### Verification

- Backend: **1879 passed**, 96 skipped (1 preexisting failure on
`seleniumbase`-dependent test in local venv; runs fine in the standard
Docker image with the `browser` extra)
- Lint, format, dead-code: all clean against this repo's existing
ruff/vulture config
- One follow-up cleanup commit (`style: ruff lint and format fixes for
ported commits`) brings the cherry-picked code into compliance with this
repo's ruff settings — no behavior changes there

### Etiquette / credit

Per-commit authorship preserved by cherry-pick. The only edits to the
original commits are:
- `fix: three upstream bugs` — Apprise rebrand strings reverted to
"Shelfmark" (noted in commit body, original author retained as
`Co-Authored-By` via cherry-pick)
- One follow-up `style:` commit for ruff config alignment

Big thanks to [@NemesisHubris](https://github.com/NemesisHubris) for the
original work in LitFinder; this PR exists to make sure these fixes
reach Shelfmark's wider user base. Happy to revise scope, split into
smaller PRs, or split off the Py2 cleanup separately if that's
preferable.

---------

Co-authored-by: NemesisHubris <155838970+NemesisHubris@users.noreply.github.com>
Co-authored-by: CaliBrain <calibrain@l4n.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants