Skip to content

Implement NIF-based citation extraction with precise character offsets#839

Open
anshuman9468 wants to merge 6 commits intodbpedia:masterfrom
anshuman9468:fix-nif-citations
Open

Implement NIF-based citation extraction with precise character offsets#839
anshuman9468 wants to merge 6 commits intodbpedia:masterfrom
anshuman9468:fix-nif-citations

Conversation

@anshuman9468
Copy link
Copy Markdown
Contributor

@anshuman9468 anshuman9468 commented Mar 16, 2026

This PR implements NIF-based citation extraction to provide precise character offsets and direct links to external sources using itsrdf:taIdentRef. This replaces the previous line-number-based citations.

Summary by CodeRabbit

  • New Features

    • Citation link detection and tracking with citation ID support
    • Citation URL mapping to improve resolved citation links
    • Added English language mapping for addons
  • Bug Fixes

    • Safer disambiguation title handling with a fallback default
    • More robust language list retrieval with improved error handling
    • Adjusted handling of reference/superscript elements during HTML extraction
  • Chores

    • Streamlined snapshot deployment command/configuration

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 16, 2026

📝 Walkthrough

Walkthrough

Adds citation support to NIF extraction: Link model gains citation fields; LinkExtractor marks citation anchors; HtmlNifExtractor collects citation target URLs and emits citation and taIdentRef triples; removes sup.reference from nif removal; safer disambiguation title lookup.

Changes

Cohort / File(s) Summary
CI & Config
.github/workflows/snapshot_deploy.yml, core/src/main/resources/nifextractionconfig.json
Adds Maven property skipNexusStagingDeployMojo=true to snapshot deploy step; removes "sup.reference" from default nif-remove-elements.
Link model
core/src/main/java/org/dbpedia/extraction/nif/Link.java
Adds citation (boolean) and citationId (String) fields with public getters/setters.
Extraction parsing
core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java
Detects anchors like #cite_note-..., marks links as citations and records citation IDs; reflows/formatting in parsing logic.
Citation mapping & emission
core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala
Adds citationMap to collect citation-ID → external URL from list items; emits Citation-type and taIdentRef triples for citation links when URL available.
Disambiguation fallback
core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala
Replaces direct map access with getOrElse(language.wikiCode, " (disambiguation)") to provide a safe default.
Language list retrieval
core/src/main/scala/org/dbpedia/extraction/util/Language.scala
Wraps HTTP call and response parsing with safer control flow, UA handling, 2xx status check, trimming/filtering of lines, and robust error handling.
Language data
core/src/main/resources/addonlangs.json
Adds en mapping entry (wikiCode/name/isoCode/iso639_3).

Sequence Diagram

sequenceDiagram
    participant HTML as HTML Parser
    participant LE as LinkExtractor
    participant EM as ElementMap
    participant HNE as HtmlNifExtractor
    participant TG as TripleGenerator

    HTML->>LE: encounter anchor (`<a href="#cite_note-X">`)
    LE->>LE: set link.citation = true\nset citationId = "cite_note-X"
    LE->>EM: store Link (citation=true)

    HTML->>HNE: scan list item `id=cite_note-X` with external link
    HNE->>HNE: extract external URL\ncitationMap["cite_note-X"] = URL

    EM->>HNE: provide Link for anchor
    HNE->>HNE: detect Link.citation == true
    HNE->>TG: emit Citation-type triple
    HNE->>TG: emit taIdentRef triple with URL from citationMap
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main objective of the PR: implementing NIF-based citation extraction with precise character offsets, which aligns with the core changes across multiple files that add citation handling to the NIF extraction pipeline.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala (1)

289-300: Consider extracting hardcoded URIs to constants.

The ontology and ITS RDF namespace URIs are hardcoded. For consistency with the rest of the codebase (which uses RdfNamespace), consider using constants:

Extract to constants
// Could define at class or package level:
private val CitationType = "http://dbpedia.org/ontology/Citation"
private val ItsRdfTaIdentRef = "http://www.w3.org/2005/11/its/rdf#taIdentRef"

// Or use RdfNamespace if appropriate entries exist/can be added
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala` around
lines 289 - 300, Replace the hardcoded ontology/ITS URIs in the HtmlNifExtractor
branch (the strings passed to nifLinks when link.isCitation and in the UriUtils
Success case) with shared constants: add private vals (e.g., CitationType =
"http://dbpedia.org/ontology/Citation" and ItsRdfTaIdentRef =
"http://www.w3.org/2005/11/its/rdf#taIdentRef") at the class or package level
and reference those constants in the calls to nifLinks (the occurrences around
link.isCitation, nifLinks(..., RdfNamespace.RDF.append("type"), CitationType,
...) and nifLinks(..., ItsRdfTaIdentRef, ...)); alternatively, if appropriate
entries exist in RdfNamespace, use RdfNamespace for the ITS URI instead of a raw
string.
core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java (1)

197-211: Confusing substring logic works by accident.

The code checks for ?title= in the URI but standard wiki links (e.g., /wiki/Philosopher) don't contain this parameter. The extraction works by coincidence:

  • indexOf("?title=") returns -1 when not found
  • -1 + 7 = 6, which equals /wiki/.length()

This makes the code fragile and confusing. Consider rewriting to explicitly extract the path segment:

Clearer implementation
 private String cleanLink(String uri, boolean external) {
     if (!external) {
         String linkPrefix = "/wiki/";
         String linkPrefix2 = "./";
         if (uri.contains(linkPrefix)) {
-            uri = uri.substring(uri.indexOf("?title=") + 7);
+            uri = uri.substring(uri.indexOf(linkPrefix) + linkPrefix.length());
         } else if (uri.contains(linkPrefix2)) {
-            uri = uri.substring(uri.indexOf("?title=") + 3);
+            uri = uri.substring(uri.indexOf(linkPrefix2) + linkPrefix2.length());
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around
lines 197 - 211, The substring logic in LinkExtractor that uses
indexOf("?title=")+7 is fragile; update the block that handles non-external URIs
to explicitly extract the article name: if uri contains "/wiki/" take the
substring after the literal "/wiki/" (e.g., uri.substring(uri.indexOf("/wiki/")
+ "/wiki/".length())), else if uri contains "./" or starts with "./" take the
substring after "./" (uri.substring(uri.indexOf("./") + 2)), and then build the
full DBpedia resource URL using this cleaned article name and
this.context.language (preserving the existing conditional that uses
"http://{lang}.dbpedia.org/resource/" vs "http://dbpedia.org/resource/"); ensure
you do not rely on "?title=" or negative indexes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 186-194: The setUri method currently hardcodes
tempLink.setExternal(true) which marks all non-null URIs as external; change the
signature to setUri(String uri, boolean external) and set
tempLink.setExternal(external) (preserving the existing tempLink reset when
uri==null and inLink=true when non-null). Update all call sites in LinkExtractor
that call setUri(uri) so internal wiki link usages call setUri(uri, false) and
true external-link usages call setUri(uri, true); keep the null-handling path
unchanged.

In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`:
- Line 25: The current initialization of replaceString uses
DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode,
" (disambiguation)") which silently falls back to the English suffix for
unsupported languages; update the logic in DisambiguationExtractor (the
replaceString initialization) to first check whether disambiguationTitlePartMap
contains language.wikiCode and if not either (a) log a warning including
language.wikiCode and the fallback value, or (b) throw a clear exception to make
missing configurations explicit depending on project policy; ensure you
reference DisambiguationExtractorConfig.disambiguationTitlePartMap and
language.wikiCode so the change is applied where replaceString is defined.

---

Nitpick comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 197-211: The substring logic in LinkExtractor that uses
indexOf("?title=")+7 is fragile; update the block that handles non-external URIs
to explicitly extract the article name: if uri contains "/wiki/" take the
substring after the literal "/wiki/" (e.g., uri.substring(uri.indexOf("/wiki/")
+ "/wiki/".length())), else if uri contains "./" or starts with "./" take the
substring after "./" (uri.substring(uri.indexOf("./") + 2)), and then build the
full DBpedia resource URL using this cleaned article name and
this.context.language (preserving the existing conditional that uses
"http://{lang}.dbpedia.org/resource/" vs "http://dbpedia.org/resource/"); ensure
you do not rely on "?title=" or negative indexes.

In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala`:
- Around line 289-300: Replace the hardcoded ontology/ITS URIs in the
HtmlNifExtractor branch (the strings passed to nifLinks when link.isCitation and
in the UriUtils Success case) with shared constants: add private vals (e.g.,
CitationType = "http://dbpedia.org/ontology/Citation" and ItsRdfTaIdentRef =
"http://www.w3.org/2005/11/its/rdf#taIdentRef") at the class or package level
and reference those constants in the calls to nifLinks (the occurrences around
link.isCitation, nifLinks(..., RdfNamespace.RDF.append("type"), CitationType,
...) and nifLinks(..., ItsRdfTaIdentRef, ...)); alternatively, if appropriate
entries exist in RdfNamespace, use RdfNamespace for the ITS URI instead of a raw
string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 572a5f8a-0c06-4a64-bdbc-ffd4e60b539a

📥 Commits

Reviewing files that changed from the base of the PR and between e3dfe8b and 7879400.

📒 Files selected for processing (6)
  • .github/workflows/snapshot_deploy.yml
  • core/src/main/java/org/dbpedia/extraction/nif/Link.java
  • core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java
  • core/src/main/resources/nifextractionconfig.json
  • core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala
  • core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala
💤 Files with no reviewable changes (1)
  • core/src/main/resources/nifextractionconfig.json

Comment on lines 186 to 194
private void setUri(String uri) {
if(uri!=null) {
tempLink.setUri(uri);
tempLink.setExternal(true);
inLink = true;
} else {
tempLink = new Link();
}
if (uri != null) {
tempLink.setUri(uri);
tempLink.setExternal(true);
inLink = true;
} else {
tempLink = new Link();
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: All links are incorrectly marked as external.

The setUri method unconditionally sets external = true for all non-null URIs. This marks internal wiki links as external when they shouldn't be.

The method is called for both internal wiki links (e.g., lines 102, 113) and actual external links (e.g., lines 106, 146), but the distinction is lost.

Proposed fix: Pass external flag to setUri
- private void setUri(String uri) {
+ private void setUri(String uri, boolean external) {
     if (uri != null) {
         tempLink.setUri(uri);
-        tempLink.setExternal(true);
+        tempLink.setExternal(external);
         inLink = true;
     } else {
         tempLink = new Link();
     }
 }

Then update call sites:

  • Lines 102, 113, 136: setUri(uri, false);
  • Lines 106, 146: setUri(uri, true);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around
lines 186 - 194, The setUri method currently hardcodes
tempLink.setExternal(true) which marks all non-null URIs as external; change the
signature to setUri(String uri, boolean external) and set
tempLink.setExternal(external) (preserving the existing tempLink reset when
uri==null and inLink=true when non-null). Update all call sites in LinkExtractor
that call setUri(uri) so internal wiki link usages call setUri(uri, false) and
true external-link usages call setUri(uri, true); keep the null-handling path
unchanged.

private val language = context.language

private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap(language.wikiCode)
private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

English fallback may cause incorrect behavior for unsupported languages.

When a language is not in disambiguationTitlePartMap, this falls back to the English suffix " (disambiguation)". For non-English languages (e.g., Turkish, Vietnamese, etc.) that aren't in the map, this will silently use an incorrect suffix that won't match their actual disambiguation page titles.

Consider logging a warning when using the fallback, or throwing an error to make missing language configurations explicit:

Alternative: Log warning for missing configuration
- private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")
+ private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode) match {
+   case Some(suffix) => suffix
+   case None =>
+     // Consider logging: s"No disambiguation suffix configured for language ${language.wikiCode}, using English fallback"
+     " (disambiguation)"
+ }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")
private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode) match {
case Some(suffix) => suffix
case None =>
// Consider logging: s"No disambiguation suffix configured for language ${language.wikiCode}, using English fallback"
" (disambiguation)"
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`
at line 25, The current initialization of replaceString uses
DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode,
" (disambiguation)") which silently falls back to the English suffix for
unsupported languages; update the logic in DisambiguationExtractor (the
replaceString initialization) to first check whether disambiguationTitlePartMap
contains language.wikiCode and if not either (a) log a warning including
language.wikiCode and the fallback value, or (b) throw a clear exception to make
missing configurations explicit depending on project policy; ensure you
reference DisambiguationExtractorConfig.disambiguationTitlePartMap and
language.wikiCode so the change is applied where replaceString is defined.

Comment thread core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java Outdated
Comment thread core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java Outdated
anshuman9468 and others added 2 commits April 25, 2026 01:20
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java (1)

186-194: ⚠️ Potential issue | 🟠 Major

setUri still unconditionally marks every link as external.

The external = true hardcoding previously flagged has not been addressed in this revision. Internal wiki links routed through setUri at lines 102, 113, and 136 will still be emitted as external. Fix as previously suggested (pass an external flag through setUri).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around
lines 186 - 194, The setUri method currently hardcodes
tempLink.setExternal(true), so change its signature to accept an external flag
(e.g., setUri(String uri, boolean external)) or add an overload, and set
tempLink.setExternal(external) instead of true; keep tempLink.setUri(uri) and
inLink handling as-is and preserve the else branch that resets tempLink. Update
all callers of setUri (the three call sites that currently invoke setUri with a
single String) to pass the correct boolean (true for external URLs, false for
internal wiki links) so internal links are not marked external.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 186-194: The setUri method currently hardcodes
tempLink.setExternal(true), so change its signature to accept an external flag
(e.g., setUri(String uri, boolean external)) or add an overload, and set
tempLink.setExternal(external) instead of true; keep tempLink.setUri(uri) and
inLink handling as-is and preserve the else branch that resets tempLink. Update
all callers of setUri (the three call sites that currently invoke setUri with a
single String) to pass the correct boolean (true for external URLs, false for
internal wiki links) so internal links are not marked external.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8c2190f-17ef-443d-9891-5e9d05cb7a60

📥 Commits

Reviewing files that changed from the base of the PR and between 7879400 and d18b166.

📒 Files selected for processing (1)
  • core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
core/src/main/scala/org/dbpedia/extraction/util/Language.scala (1)

133-155: Close the HttpResponse and consolidate User-Agent setup.

Two small hygiene items on this block:

  1. The HttpResponse returned by client.execute(request) is a CloseableHttpResponse and should be closed to release the underlying connection back to the pool. Currently only the entity stream is closed. This code runs once at startup so the impact is minimal, but it's idiomatic to close it explicitly (e.g. via a nested try-finally, or response.close() in a dedicated finally).

  2. The User-Agent header is configured in two places: lines 37-40 (custom UA, before the try) and lines 45-47 (default UA, inside the try). The behavior is correct but the split makes the intent harder to follow. Consolidate the UA string resolution and set it once before calling execute.

♻️ Sketch of a consolidated shape
-    // Apply User-Agent conditionally
-    if (customUserAgentEnabled) {
-      request.setHeader("User-Agent", customUserAgentText)
-    }
-
-    val wikiLanguageCodes =
-      try {
-        // Set a default User-Agent if none is provided
-        if (!customUserAgentEnabled) {
-          request.setHeader("User-Agent", "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; dbpedia@infai.org)")
-        }
-
-        val response = client.execute(request)
-        val status = response.getStatusLine.getStatusCode
-        if (status >= 200 && status < 300) {
-          val stream = response.getEntity.getContent
-          try {
-            Source.fromInputStream(stream).getLines()
-              .map(_.trim)
-              .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
-              .toList
-          } finally {
-            stream.close()
-          }
-        } else {
-          logger.log(Level.WARNING, "Language list fetch failed with status " + status + " from " + wikipediaLanguageUrl)
-          List.empty
-        }
-      } catch {
-        case e: Exception =>
-          logger.log(Level.WARNING, "Could not fetch language list from " + wikipediaLanguageUrl + ": " + e.getMessage)
-          List.empty // Fallback to empty list, addonlangs.json will still be used
-      } finally {
-        client.close()
-      }
+    val defaultUserAgent = "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; dbpedia@infai.org)"
+    request.setHeader("User-Agent", if (customUserAgentEnabled) customUserAgentText else defaultUserAgent)
+
+    val wikiLanguageCodes =
+      try {
+        val response = client.execute(request)
+        try {
+          val status = response.getStatusLine.getStatusCode
+          if (status >= 200 && status < 300) {
+            val stream = response.getEntity.getContent
+            try {
+              Source.fromInputStream(stream).getLines()
+                .map(_.trim)
+                .filter(line => line.nonEmpty && !line.contains(" "))
+                .toList
+            } finally stream.close()
+          } else {
+            logger.log(Level.WARNING, "Language list fetch failed with status " + status + " from " + wikipediaLanguageUrl)
+            List.empty
+          }
+        } finally response.close()
+      } catch {
+        case e: Exception =>
+          logger.log(Level.WARNING, "Could not fetch language list from " + wikipediaLanguageUrl + ": " + e.getMessage)
+          List.empty // Fallback to empty list, addonlangs.json will still be used
+      } finally {
+        client.close()
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala` around lines
133 - 155, The HttpResponse from client.execute(request) is a
CloseableHttpResponse and must be explicitly closed to return the connection to
the pool; wrap the execute call so you capture the CloseableHttpResponse (e.g. a
nested try/finally) and ensure response.close() always runs after closing the
entity stream. Also consolidate the User-Agent resolution (the logic currently
split before and inside the try) into a single UA string computed once and set
on the request before calling client.execute(request) so the header is only
determined in one place; update references to client.execute, request,
CloseableHttpResponse and wikipediaLanguageUrl accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala`:
- Around line 136-144: The code reads the langlist InputStream with
Source.fromInputStream(stream) which uses the platform default charset; change
it to explicitly use UTF-8 (use the existing Codec import, e.g.
Source.fromInputStream(stream)(Codec.UTF8) or an equivalent withCodec/codec
call) so the Wikimedia UTF-8 response is decoded correctly; keep the same
trimming/filtering logic and ensure the stream.close() in the finally block
remains unchanged.

---

Nitpick comments:
In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala`:
- Around line 133-155: The HttpResponse from client.execute(request) is a
CloseableHttpResponse and must be explicitly closed to return the connection to
the pool; wrap the execute call so you capture the CloseableHttpResponse (e.g. a
nested try/finally) and ensure response.close() always runs after closing the
entity stream. Also consolidate the User-Agent resolution (the logic currently
split before and inside the try) into a single UA string computed once and set
on the request before calling client.execute(request) so the header is only
determined in one place; update references to client.execute, request,
CloseableHttpResponse and wikipediaLanguageUrl accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 75f7ed91-a9ab-42ee-b765-a87b6db64956

📥 Commits

Reviewing files that changed from the base of the PR and between d18b166 and 63ab243.

📒 Files selected for processing (2)
  • core/src/main/resources/addonlangs.json
  • core/src/main/scala/org/dbpedia/extraction/util/Language.scala

Comment on lines +136 to +144
val stream = response.getEntity.getContent
try {
Source.fromInputStream(stream).getLines()
.map(_.trim)
.filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
.toList
} finally {
stream.close()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify UTF-8 explicitly when reading the langlist response.

Source.fromInputStream(stream) uses the JVM's default charset, which varies by platform and can corrupt non-ASCII bytes. The Wikimedia langlist endpoint is UTF-8, and Codec is already imported at line 10 but currently unused. Making the charset explicit avoids environment-dependent parsing.

🛠️ Suggested fix
-          val stream = response.getEntity.getContent
-          try {
-            Source.fromInputStream(stream).getLines()
-              .map(_.trim)
-              .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
-              .toList
-          } finally {
-            stream.close()
-          }
+          val stream = response.getEntity.getContent
+          try {
+            Source.fromInputStream(stream)(Codec.UTF8).getLines()
+              .map(_.trim)
+              .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
+              .toList
+          } finally {
+            stream.close()
+          }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val stream = response.getEntity.getContent
try {
Source.fromInputStream(stream).getLines()
.map(_.trim)
.filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
.toList
} finally {
stream.close()
}
val stream = response.getEntity.getContent
try {
Source.fromInputStream(stream)(Codec.UTF8).getLines()
.map(_.trim)
.filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
.toList
} finally {
stream.close()
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala` around lines
136 - 144, The code reads the langlist InputStream with
Source.fromInputStream(stream) which uses the platform default charset; change
it to explicitly use UTF-8 (use the existing Codec import, e.g.
Source.fromInputStream(stream)(Codec.UTF8) or an equivalent withCodec/codec
call) so the Wikimedia UTF-8 response is decoded correctly; keep the same
trimming/filtering logic and ensure the stream.close() in the finally block
remains unchanged.

@anshuman9468
Copy link
Copy Markdown
Contributor Author

Hi @TallTed , could you please review this PR and merge it when you get a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants