Added language detection throught the distant path of AA. by eephyne · Pull Request #1031 · calibrain/shelfmark

eephyne · 2026-05-28T14:47:53Z

Added language detection throught the distant path of AA.
Added option in setting to disable this feature.

This is to avoid shelfmark dumping many result due to missing language in AA.
With this option enabled, it will parse the distant path (don’t know how its named) to look for language and set the language accordingly.

I tested it with different request and language and the parsing seem ok to me but it can probably be improved.
The option can be enabled or disabled in the setting Direct Download > Download Source

- Added option in setting to disable this feature.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in feature to the Direct Download source that infers a book's language from the "distant path" (file path in search results) when the language metadata is missing, with corresponding settings, parsing logic, and tests.

Changes:

New DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH checkbox setting (default off).
Logic in direct_download.py to extract a distant path from result rows, detect language via bracket/keyed/name/code patterns with alias mapping from book-languages.json, and apply local language filtering after parsing.
Extensive new tests covering detection, false-positive avoidance, legacy behavior, and search-level local filtering.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
shelfmark/config/settings.py	Adds new CheckboxField for the path-language toggle.
shelfmark/release_sources/direct_download.py	Implements distant-path extraction, language inference, alias map, and conditional local filtering in `search_books`.
tests/config/test_download_settings.py	Verifies the new settings field exists with expected default/description.
tests/direct_download/test_search_queries.py	Adds tests for distant-path language detection, edge cases, and the new local filtering path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        detected_from_path = _detect_language_from_distant_path(distant_path)
+
+        # Temporary visual diagnostics for field mapping and path-language inference.
+        if _is_language_from_path_enabled():
+            logger.info(
+                "DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s",
+                record_id,
+                _short_debug(title),
+                _short_debug(language),
+                _short_debug(detected_from_path),
+                _short_debug(distant_path, limit=260),
+                _short_debug(cells[10].get_text(" ", strip=True), limit=140),
+                _short_debug(row.get_text(" ", strip=True), limit=260),
+            )


+        if _is_language_from_path_enabled():
+            logger.info(


+                _short_debug(row.get_text(" ", strip=True), limit=260),
+            )
+
+        if _is_language_from_path_enabled() and _is_missing_or_placeholder_language(language):


+
+def _extract_distant_path(row: Tag) -> str | None:
+    """Extract distant path hints from a direct-download search row."""


+_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
+_LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None
+_LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"})
+_AMBIGUOUS_SHORT_LANGUAGE_CODES = frozenset({"de", "en", "it", "la", "no", "or", "is", "in"})


+    r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b",
+    re.IGNORECASE,
+)
+_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")


+    # When path-language inference is enabled, language filtering must happen
+    # after row parsing, otherwise source-side lang filters drop rows too early.
+    if not (path_language_enabled and requested_langs):
+        for value in filters.lang or []:
+            if value and value != "all":
+                filters_query += f"&lang={quote(value)}"


+        import shelfmark.release_sources.direct_download as dd
+
+        captured_url: dict[str, str] = {}
+
+        original_get = dd.config.get
+
+        def _fake_get(key: str, default=None, user_id=None):
+                del user_id
+                if key == "DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH":
+                        return True
+                return original_get(key, default)
+
+        monkeypatch.setattr(dd.config, "get", _fake_get)
+        monkeypatch.setattr(dd.network, "get_aa_base_url", lambda: "https://mirror.example")
+        monkeypatch.setattr(dd.network, "AAMirrorSelector", lambda: object())
+
+        def _fake_html_get_page(url: str, selector, allow_bypasser_fallback=False):
+                del selector, allow_bypasser_fallback


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

+        # Temporary visual diagnostics for field mapping and path-language inference.
+        if _is_language_from_path_enabled():
+            logger.info(
+                "DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s",


+_LANGUAGE_CODE_TOKEN_PATTERN = re.compile(
+    r"(?:^|[\s_./\\\-\[(])([A-Za-z]{2,3})(?=$|[\s_./\\\-)\]])"
+)


+    r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b",
+    re.IGNORECASE,
+)
+_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")


+def _language_alias_to_code() -> dict[str, str]:
+    """Build alias->code map from bundled language metadata."""
+    global _LANGUAGE_ALIAS_TO_CODE
+    if _LANGUAGE_ALIAS_TO_CODE is not None:
+        return _LANGUAGE_ALIAS_TO_CODE


+    if not (path_language_enabled and requested_langs):
+        for value in filters.lang or []:
+            if value and value != "all":
+                filters_query += f"&lang={quote(value)}"


+def test_search_books_filters_language_locally_when_path_language_enabled(monkeypatch):
+        import shelfmark.release_sources.direct_download as dd
+
+        captured_url: dict[str, str] = {}


+)
+_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
+_LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None
+_LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"})


…eedback Downgrade temporary per-row language diagnostics from INFO to DEBUG to reduce production log noise. Cache path-language toggle per parsed row and skip distant-path extraction when the feature is disabled. Make language alias cache initialization thread-safe with a lock to avoid race conditions on first load. Reduce language false positives by tightening name-token matching and preferring non-ambiguous strong candidates. Keep server-side language filtering enabled and apply local path-based filtering as an additional refinement. Remove redundant normalized placeholder handling for em-dash language values. Update Direct Download setting description to document language-filter trade-offs. Fix test indentation consistency and update assertions for restored server-side lang query behavior. Add regression coverage for bracket-order ambiguity (e.g., EN marker appearing before FR marker).

…epte les livres sans métadonnées linguistiques et ajuste les filtres de langue pour les fichiers lgli.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

+        CheckboxField(
+            key="DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH",
+            label="Detect Language From Distant Path",
+            description=(
+                "When language metadata is missing, parse the distant path and set language "
+                "from tags like [BD FR]. Falls back to unknown when not detected. "
+                "Note: source-side language filters still apply and may exclude poorly tagged rows."
+            ),
+            default=False,
+        ),


+        # Temporary visual diagnostics for field mapping and path-language inference.
+        if path_language_enabled:
+            logger.debug(
+                "DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s",
+                record_id,
+                _short_debug(title),
+                _short_debug(language),
+                _short_debug(detected_from_path),
+                _short_debug(distant_path, limit=260),
+                _short_debug(cells[10].get_text(" ", strip=True), limit=140),
+                _short_debug(row.get_text(" ", strip=True), limit=260),
+            )


+            logger.debug(
+                "DD lang debug resolved | id=%s | final_lang=%s | fallback=%s",
+                record_id,
+                _short_debug(language),
+                "unknown" if detected_from_path is None else "detected",
+            )


+        normalized = re.sub(
+            r"\s+\.(epub|mobi|azw3|fb2|djvu|cbz|cbr|pdf|zip|rar|m4b|mp3)\b",
+            r".\1",
+            normalized,
+            flags=re.IGNORECASE,
+        )


+        mapping: dict[str, str] = {}
+        data_path = Path(__file__).resolve().parents[2] / "data" / "book-languages.json"
+
+        try:
+            raw = json.loads(data_path.read_text(encoding="utf-8"))
+        except (OSError, ValueError, TypeError):
+            _LANGUAGE_ALIAS_TO_CODE = {}


+def _book_matches_requested_languages(book_language: str | None, requested: set[str]) -> bool:
+    """Return True when a book language matches normalized requested filters.
+
+    A book whose language is unknown (None) passes through: the server-side
+    ``&lang=`` filter already constrained the result set, so dropping rows
+    that simply lack metadata would hide relevant results.
+    """


@NemesisHubris

…#1010, #1021, #1025, #1040) (#1066) ## Backport bug fixes from `NemesisHubris/litfinder` Forwards a curated set of bug fixes from [NemesisHubris/litfinder](https://github.com/NemesisHubris/litfinder) — a community fork of this project — that address open issues here. All commits preserve original authorship via `git cherry-pick`; this PR is a backport rather than original work. Each fix has been reviewed locally, lint/format-cleaned to match this repo's existing ruff config, and verified with the test suite. Rebrand strings, license switches, and features have been deliberately excluded. ### Upstream issues addressed - **#999** — Mirror URLs with query params no longer break search requests (strip query string/fragment in `normalize_http_url`) - **#956** — Apprise notifications now respect the configured proxy (proxy env vars injected before dispatch) - **#1025** — rTorrent: separate `RTORRENT_AUDIOBOOK_LABEL` setting, falls back to book label if unset - **#1010** — Stop button in Activity no longer makes the panel disappear (snapshot refresh on cancel) - **#1021** — Anna's Archive slow-download countdown now caps retries instead of looping forever - **#1040** — Empty destination directory cleaned up when write probe fails - **PR #1031** — Language detection from Anna's Archive distant path when listing metadata is missing ### Additional fixes (no open issue but clear bugs) - **fix: Python 2 `except` syntax across 27 files** — `except X, Y:` is a SyntaxError in Python 3 and prevents affected modules from importing at runtime. Mechanical sweep to `except (X, Y):`. - **fix(abb): info hash validation with magnet fallback** — adds SHA-1/SHA-256 hex validation on extracted info hashes; falls back to scanning the full page for a magnet link (e.g. posted in comments) when the table value is malformed. Also extends the exact-phrase fallback to manual queries and defaults the ABB listing language to `en` when missing, preventing valid results from being hidden by the language filter. Includes a small test-fixture fix (`test(abb): use valid hex info hashes in scraper test fixtures`) since the existing fixtures used non-hex placeholders that the new validation correctly rejects. - **fix: Anna's Archive title parser** — handles nested edition spans and filters `lgli` catalog descriptor entries (e.g. "Book/Online Audio") that were polluting search results. ### Deliberately not included - LitFinder rebranding (UI strings, Apprise app ID, logo). The `fix: three upstream bugs` commit (#999/#956/#1025) was cherry-picked with Apprise app-id, description, and logo-URL strings reverted from "LitFinder" back to "Shelfmark"; noted in the commit body. - Features from the LitFinder fork (multi-variant title search, multi-book flat-folder grouping, fuzzy text matching, "Leave in Place" output handler, admin display name, custom-source plugin system). These are larger behavior changes that each warrant their own focused review — happy to send any of them separately if of interest. - LitFinder-specific test environment and CI infrastructure. ### Verification - Backend: **1879 passed**, 96 skipped (1 preexisting failure on `seleniumbase`-dependent test in local venv; runs fine in the standard Docker image with the `browser` extra) - Lint, format, dead-code: all clean against this repo's existing ruff/vulture config - One follow-up cleanup commit (`style: ruff lint and format fixes for ported commits`) brings the cherry-picked code into compliance with this repo's ruff settings — no behavior changes there ### Etiquette / credit Per-commit authorship preserved by cherry-pick. The only edits to the original commits are: - `fix: three upstream bugs` — Apprise rebrand strings reverted to "Shelfmark" (noted in commit body, original author retained as `Co-Authored-By` via cherry-pick) - One follow-up `style:` commit for ruff config alignment Big thanks to [@NemesisHubris](https://github.com/NemesisHubris) for the original work in LitFinder; this PR exists to make sure these fixes reach Shelfmark's wider user base. Happy to revise scope, split into smaller PRs, or split off the Py2 cleanup separately if that's preferable. --------- Co-authored-by: NemesisHubris <155838970+NemesisHubris@users.noreply.github.com> Co-authored-by: CaliBrain <calibrain@l4n.xyz>

- Added language detection throught the distant path of AA.

94f74b9

- Added option in setting to disable this feature.

Copilot AI review requested due to automatic review settings May 28, 2026 14:47

Copilot AI reviewed May 28, 2026

View reviewed changes

eephyne mentioned this pull request May 28, 2026

Can't find certain books #996

Open

Potential fix for pull request finding

a24d325

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 29, 2026 07:42

Copilot AI reviewed May 29, 2026

View reviewed changes

eephyne added 2 commits May 29, 2026 13:39

Améliore la gestion des langues pour les résultats de recherche : acc…

e0912b4

…epte les livres sans métadonnées linguistiques et ajuste les filtres de langue pour les fichiers lgli.

Copilot AI review requested due to automatic review settings June 4, 2026 14:22

Copilot AI reviewed Jun 4, 2026

View reviewed changes

eephyne marked this pull request as draft June 4, 2026 15:10

This was referenced Jun 14, 2026

Port: LitFinder v1.4.6 (bug fixes and features, no plugin system) spin-drift/shelfmark-plus#2

Merged

Backport bug fixes from NemesisHubris/litfinder (addresses #999, #956, #1010, #1021, #1025, #1040) #1066

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added language detection throught the distant path of AA.#1031

Added language detection throught the distant path of AA.#1031
eephyne wants to merge 4 commits into
calibrain:mainfrom
eephyne:feature-search_wider

eephyne commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		def _extract_distant_path(row: Tag) -> str \| None:
		"""Extract distant path hints from a direct-download search row."""

Conversation

eephyne commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants