Implement NIF-based citation extraction with precise character offsets by anshuman9468 · Pull Request #839 · dbpedia/extraction-framework

anshuman9468 · 2026-03-16T11:21:30Z

This PR implements NIF-based citation extraction to provide precise character offsets and direct links to external sources using itsrdf:taIdentRef. This replaces the previous line-number-based citations.

Summary by CodeRabbit

New Features
- Citation link detection and tracking with citation ID support
- Citation URL mapping to improve resolved citation links
- Added English language mapping for addons
Bug Fixes
- Safer disambiguation title handling with a fallback default
- More robust language list retrieval with improved error handling
- Adjusted handling of reference/superscript elements during HTML extraction
Chores
- Streamlined snapshot deployment command/configuration

…lse for language config

coderabbitai · 2026-03-16T11:22:15Z

📝 Walkthrough

Walkthrough

Adds citation support to NIF extraction: Link model gains citation fields; LinkExtractor marks citation anchors; HtmlNifExtractor collects citation target URLs and emits citation and taIdentRef triples; removes sup.reference from nif removal; safer disambiguation title lookup.

Changes

Cohort / File(s)	Summary
CI & Config `.github/workflows/snapshot_deploy.yml`, `core/src/main/resources/nifextractionconfig.json`	Adds Maven property `skipNexusStagingDeployMojo=true` to snapshot deploy step; removes `"sup.reference"` from default nif-remove-elements.
Link model `core/src/main/java/org/dbpedia/extraction/nif/Link.java`	Adds `citation` (boolean) and `citationId` (String) fields with public getters/setters.
Extraction parsing `core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`	Detects anchors like `#cite_note-...`, marks links as citations and records citation IDs; reflows/formatting in parsing logic.
Citation mapping & emission `core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala`	Adds `citationMap` to collect citation-ID → external URL from list items; emits Citation-type and taIdentRef triples for citation links when URL available.
Disambiguation fallback `core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`	Replaces direct map access with `getOrElse(language.wikiCode, " (disambiguation)")` to provide a safe default.
Language list retrieval `core/src/main/scala/org/dbpedia/extraction/util/Language.scala`	Wraps HTTP call and response parsing with safer control flow, UA handling, 2xx status check, trimming/filtering of lines, and robust error handling.
Language data `core/src/main/resources/addonlangs.json`	Adds `en` mapping entry (wikiCode/name/isoCode/iso639_3).

Sequence Diagram

sequenceDiagram
    participant HTML as HTML Parser
    participant LE as LinkExtractor
    participant EM as ElementMap
    participant HNE as HtmlNifExtractor
    participant TG as TripleGenerator

    HTML->>LE: encounter anchor (`<a href="#cite_note-X">`)
    LE->>LE: set link.citation = true\nset citationId = "cite_note-X"
    LE->>EM: store Link (citation=true)

    HTML->>HNE: scan list item `id=cite_note-X` with external link
    HNE->>HNE: extract external URL\ncitationMap["cite_note-X"] = URL

    EM->>HNE: provide Link for anchor
    HNE->>HNE: detect Link.citation == true
    HNE->>TG: emit Citation-type triple
    HNE->>TG: emit taIdentRef triple with URL from citationMap

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

pulling enterprise master #786 — Also modifies the snapshot_deploy GitHub Actions workflow; likely related to CI deployment changes.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main objective of the PR: implementing NIF-based citation extraction with precise character offsets, which aligns with the core changes across multiple files that add citation handling to the NIF extraction pipeline.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sonarqubecloud · 2026-03-16T11:22:48Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala (1)

289-300: Consider extracting hardcoded URIs to constants.

The ontology and ITS RDF namespace URIs are hardcoded. For consistency with the rest of the codebase (which uses RdfNamespace), consider using constants:

Extract to constants

// Could define at class or package level:
private val CitationType = "http://dbpedia.org/ontology/Citation"
private val ItsRdfTaIdentRef = "http://www.w3.org/2005/11/its/rdf#taIdentRef"

// Or use RdfNamespace if appropriate entries exist/can be added

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala` around
lines 289 - 300, Replace the hardcoded ontology/ITS URIs in the HtmlNifExtractor
branch (the strings passed to nifLinks when link.isCitation and in the UriUtils
Success case) with shared constants: add private vals (e.g., CitationType =
"http://dbpedia.org/ontology/Citation" and ItsRdfTaIdentRef =
"http://www.w3.org/2005/11/its/rdf#taIdentRef") at the class or package level
and reference those constants in the calls to nifLinks (the occurrences around
link.isCitation, nifLinks(..., RdfNamespace.RDF.append("type"), CitationType,
...) and nifLinks(..., ItsRdfTaIdentRef, ...)); alternatively, if appropriate
entries exist in RdfNamespace, use RdfNamespace for the ITS URI instead of a raw
string.

core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java (1)

197-211: Confusing substring logic works by accident.

The code checks for ?title= in the URI but standard wiki links (e.g., /wiki/Philosopher) don't contain this parameter. The extraction works by coincidence:

indexOf("?title=") returns -1 when not found
-1 + 7 = 6, which equals /wiki/.length()

This makes the code fragile and confusing. Consider rewriting to explicitly extract the path segment:

Clearer implementation

 private String cleanLink(String uri, boolean external) {
     if (!external) {
         String linkPrefix = "/wiki/";
         String linkPrefix2 = "./";
         if (uri.contains(linkPrefix)) {
-            uri = uri.substring(uri.indexOf("?title=") + 7);
+            uri = uri.substring(uri.indexOf(linkPrefix) + linkPrefix.length());
         } else if (uri.contains(linkPrefix2)) {
-            uri = uri.substring(uri.indexOf("?title=") + 3);
+            uri = uri.substring(uri.indexOf(linkPrefix2) + linkPrefix2.length());
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around
lines 197 - 211, The substring logic in LinkExtractor that uses
indexOf("?title=")+7 is fragile; update the block that handles non-external URIs
to explicitly extract the article name: if uri contains "/wiki/" take the
substring after the literal "/wiki/" (e.g., uri.substring(uri.indexOf("/wiki/")
+ "/wiki/".length())), else if uri contains "./" or starts with "./" take the
substring after "./" (uri.substring(uri.indexOf("./") + 2)), and then build the
full DBpedia resource URL using this cleaned article name and
this.context.language (preserving the existing conditional that uses
"http://{lang}.dbpedia.org/resource/" vs "http://dbpedia.org/resource/"); ensure
you do not rely on "?title=" or negative indexes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 186-194: The setUri method currently hardcodes
tempLink.setExternal(true) which marks all non-null URIs as external; change the
signature to setUri(String uri, boolean external) and set
tempLink.setExternal(external) (preserving the existing tempLink reset when
uri==null and inLink=true when non-null). Update all call sites in LinkExtractor
that call setUri(uri) so internal wiki link usages call setUri(uri, false) and
true external-link usages call setUri(uri, true); keep the null-handling path
unchanged.

In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`:
- Line 25: The current initialization of replaceString uses
DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode,
" (disambiguation)") which silently falls back to the English suffix for
unsupported languages; update the logic in DisambiguationExtractor (the
replaceString initialization) to first check whether disambiguationTitlePartMap
contains language.wikiCode and if not either (a) log a warning including
language.wikiCode and the fallback value, or (b) throw a clear exception to make
missing configurations explicit depending on project policy; ensure you
reference DisambiguationExtractorConfig.disambiguationTitlePartMap and
language.wikiCode so the change is applied where replaceString is defined.

---

Nitpick comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 197-211: The substring logic in LinkExtractor that uses
indexOf("?title=")+7 is fragile; update the block that handles non-external URIs
to explicitly extract the article name: if uri contains "/wiki/" take the
substring after the literal "/wiki/" (e.g., uri.substring(uri.indexOf("/wiki/")
+ "/wiki/".length())), else if uri contains "./" or starts with "./" take the
substring after "./" (uri.substring(uri.indexOf("./") + 2)), and then build the
full DBpedia resource URL using this cleaned article name and
this.context.language (preserving the existing conditional that uses
"http://{lang}.dbpedia.org/resource/" vs "http://dbpedia.org/resource/"); ensure
you do not rely on "?title=" or negative indexes.

In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala`:
- Around line 289-300: Replace the hardcoded ontology/ITS URIs in the
HtmlNifExtractor branch (the strings passed to nifLinks when link.isCitation and
in the UriUtils Success case) with shared constants: add private vals (e.g.,
CitationType = "http://dbpedia.org/ontology/Citation" and ItsRdfTaIdentRef =
"http://www.w3.org/2005/11/its/rdf#taIdentRef") at the class or package level
and reference those constants in the calls to nifLinks (the occurrences around
link.isCitation, nifLinks(..., RdfNamespace.RDF.append("type"), CitationType,
...) and nifLinks(..., ItsRdfTaIdentRef, ...)); alternatively, if appropriate
entries exist in RdfNamespace, use RdfNamespace for the ITS URI instead of a raw
string.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 572a5f8a-0c06-4a64-bdbc-ffd4e60b539a

📥 Commits

Reviewing files that changed from the base of the PR and between e3dfe8b and 7879400.

📒 Files selected for processing (6)

.github/workflows/snapshot_deploy.yml
core/src/main/java/org/dbpedia/extraction/nif/Link.java
core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java
core/src/main/resources/nifextractionconfig.json
core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala
core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala

💤 Files with no reviewable changes (1)

core/src/main/resources/nifextractionconfig.json

coderabbitai · 2026-03-16T11:28:37Z

 	private void setUri(String uri) {
-		if(uri!=null) {
-            tempLink.setUri(uri);
-            tempLink.setExternal(true);
-            inLink = true;
-        } else {
-            tempLink = new Link();
-        }
+		if (uri != null) {
+			tempLink.setUri(uri);
+			tempLink.setExternal(true);
+			inLink = true;
+		} else {
+			tempLink = new Link();
+		}
 	}


⚠️ Potential issue | 🟠 Major

Bug: All links are incorrectly marked as external.

The setUri method unconditionally sets external = true for all non-null URIs. This marks internal wiki links as external when they shouldn't be.

The method is called for both internal wiki links (e.g., lines 102, 113) and actual external links (e.g., lines 106, 146), but the distinction is lost.

Proposed fix: Pass external flag to setUri

- private void setUri(String uri) { + private void setUri(String uri, boolean external) { if (uri != null) { tempLink.setUri(uri); - tempLink.setExternal(true); + tempLink.setExternal(external); inLink = true; } else { tempLink = new Link(); } }

Then update call sites:

Lines 102, 113, 136: setUri(uri, false);

Lines 106, 146: setUri(uri, true);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around lines 186 - 194, The setUri method currently hardcodes tempLink.setExternal(true) which marks all non-null URIs as external; change the signature to setUri(String uri, boolean external) and set tempLink.setExternal(external) (preserving the existing tempLink reset when uri==null and inLink=true when non-null). Update all call sites in LinkExtractor that call setUri(uri) so internal wiki link usages call setUri(uri, false) and true external-link usages call setUri(uri, true); keep the null-handling path unchanged.

coderabbitai · 2026-03-16T11:28:37Z

  private val language = context.language

-  private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap(language.wikiCode)
+  private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")


⚠️ Potential issue | 🟡 Minor

English fallback may cause incorrect behavior for unsupported languages.

When a language is not in disambiguationTitlePartMap, this falls back to the English suffix " (disambiguation)". For non-English languages (e.g., Turkish, Vietnamese, etc.) that aren't in the map, this will silently use an incorrect suffix that won't match their actual disambiguation page titles.

Consider logging a warning when using the fallback, or throwing an error to make missing language configurations explicit:

Alternative: Log warning for missing configuration

- private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)") + private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode) match { + case Some(suffix) => suffix + case None => + // Consider logging: s"No disambiguation suffix configured for language ${language.wikiCode}, using English fallback" + " (disambiguation)" + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")

private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode) match {

case Some(suffix) => suffix

case None =>

// Consider logging: s"No disambiguation suffix configured for language ${language.wikiCode}, using English fallback"

" (disambiguation)"

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala` at line 25, The current initialization of replaceString uses DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)") which silently falls back to the English suffix for unsupported languages; update the logic in DisambiguationExtractor (the replaceString initialization) to first check whether disambiguationTitlePartMap contains language.wikiCode and if not either (a) log a warning including language.wikiCode and the fallback value, or (b) throw a clear exception to make missing configurations explicit depending on project policy; ensure you reference DisambiguationExtractorConfig.disambiguationTitlePartMap and language.wikiCode so the change is applied where replaceString is defined.

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

coderabbitai

♻️ Duplicate comments (1)

core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java (1)
186-194: ⚠️ Potential issue | 🟠 Major

setUri still unconditionally marks every link as external.

The external = true hardcoding previously flagged has not been addressed in this revision. Internal wiki links routed through setUri at lines 102, 113, and 136 will still be emitted as external. Fix as previously suggested (pass an external flag through setUri).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around
lines 186 - 194, The setUri method currently hardcodes
tempLink.setExternal(true), so change its signature to accept an external flag
(e.g., setUri(String uri, boolean external)) or add an overload, and set
tempLink.setExternal(external) instead of true; keep tempLink.setUri(uri) and
inLink handling as-is and preserve the else branch that resets tempLink. Update
all callers of setUri (the three call sites that currently invoke setUri with a
single String) to pass the correct boolean (true for external URLs, false for
internal wiki links) so internal links are not marked external.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 186-194: The setUri method currently hardcodes
tempLink.setExternal(true), so change its signature to accept an external flag
(e.g., setUri(String uri, boolean external)) or add an overload, and set
tempLink.setExternal(external) instead of true; keep tempLink.setUri(uri) and
inLink handling as-is and preserve the else branch that resets tempLink. Update
all callers of setUri (the three call sites that currently invoke setUri with a
single String) to pass the correct boolean (true for external URLs, false for
internal wiki links) so internal links are not marked external.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8c2190f-17ef-443d-9891-5e9d05cb7a60

📥 Commits

Reviewing files that changed from the base of the PR and between 7879400 and d18b166.

📒 Files selected for processing (1)

core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java

sonarqubecloud · 2026-04-24T20:42:36Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

core/src/main/scala/org/dbpedia/extraction/util/Language.scala (1)

133-155: Close the HttpResponse and consolidate User-Agent setup.

Two small hygiene items on this block:

The HttpResponse returned by client.execute(request) is a CloseableHttpResponse and should be closed to release the underlying connection back to the pool. Currently only the entity stream is closed. This code runs once at startup so the impact is minimal, but it's idiomatic to close it explicitly (e.g. via a nested try-finally, or response.close() in a dedicated finally).
The User-Agent header is configured in two places: lines 37-40 (custom UA, before the try) and lines 45-47 (default UA, inside the try). The behavior is correct but the split makes the intent harder to follow. Consolidate the UA string resolution and set it once before calling execute.

♻️ Sketch of a consolidated shape

-    // Apply User-Agent conditionally
-    if (customUserAgentEnabled) {
-      request.setHeader("User-Agent", customUserAgentText)
-    }
-
-    val wikiLanguageCodes =
-      try {
-        // Set a default User-Agent if none is provided
-        if (!customUserAgentEnabled) {
-          request.setHeader("User-Agent", "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; dbpedia@infai.org)")
-        }
-
-        val response = client.execute(request)
-        val status = response.getStatusLine.getStatusCode
-        if (status >= 200 && status < 300) {
-          val stream = response.getEntity.getContent
-          try {
-            Source.fromInputStream(stream).getLines()
-              .map(_.trim)
-              .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
-              .toList
-          } finally {
-            stream.close()
-          }
-        } else {
-          logger.log(Level.WARNING, "Language list fetch failed with status " + status + " from " + wikipediaLanguageUrl)
-          List.empty
-        }
-      } catch {
-        case e: Exception =>
-          logger.log(Level.WARNING, "Could not fetch language list from " + wikipediaLanguageUrl + ": " + e.getMessage)
-          List.empty // Fallback to empty list, addonlangs.json will still be used
-      } finally {
-        client.close()
-      }
+    val defaultUserAgent = "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; dbpedia@infai.org)"
+    request.setHeader("User-Agent", if (customUserAgentEnabled) customUserAgentText else defaultUserAgent)
+
+    val wikiLanguageCodes =
+      try {
+        val response = client.execute(request)
+        try {
+          val status = response.getStatusLine.getStatusCode
+          if (status >= 200 && status < 300) {
+            val stream = response.getEntity.getContent
+            try {
+              Source.fromInputStream(stream).getLines()
+                .map(_.trim)
+                .filter(line => line.nonEmpty && !line.contains(" "))
+                .toList
+            } finally stream.close()
+          } else {
+            logger.log(Level.WARNING, "Language list fetch failed with status " + status + " from " + wikipediaLanguageUrl)
+            List.empty
+          }
+        } finally response.close()
+      } catch {
+        case e: Exception =>
+          logger.log(Level.WARNING, "Could not fetch language list from " + wikipediaLanguageUrl + ": " + e.getMessage)
+          List.empty // Fallback to empty list, addonlangs.json will still be used
+      } finally {
+        client.close()
+      }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala` around lines
133 - 155, The HttpResponse from client.execute(request) is a
CloseableHttpResponse and must be explicitly closed to return the connection to
the pool; wrap the execute call so you capture the CloseableHttpResponse (e.g. a
nested try/finally) and ensure response.close() always runs after closing the
entity stream. Also consolidate the User-Agent resolution (the logic currently
split before and inside the try) into a single UA string computed once and set
on the request before calling client.execute(request) so the header is only
determined in one place; update references to client.execute, request,
CloseableHttpResponse and wikipediaLanguageUrl accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala`:
- Around line 136-144: The code reads the langlist InputStream with
Source.fromInputStream(stream) which uses the platform default charset; change
it to explicitly use UTF-8 (use the existing Codec import, e.g.
Source.fromInputStream(stream)(Codec.UTF8) or an equivalent withCodec/codec
call) so the Wikimedia UTF-8 response is decoded correctly; keep the same
trimming/filtering logic and ensure the stream.close() in the finally block
remains unchanged.

---

Nitpick comments:
In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala`:
- Around line 133-155: The HttpResponse from client.execute(request) is a
CloseableHttpResponse and must be explicitly closed to return the connection to
the pool; wrap the execute call so you capture the CloseableHttpResponse (e.g. a
nested try/finally) and ensure response.close() always runs after closing the
entity stream. Also consolidate the User-Agent resolution (the logic currently
split before and inside the try) into a single UA string computed once and set
on the request before calling client.execute(request) so the header is only
determined in one place; update references to client.execute, request,
CloseableHttpResponse and wikipediaLanguageUrl accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 75f7ed91-a9ab-42ee-b765-a87b6db64956

📥 Commits

Reviewing files that changed from the base of the PR and between d18b166 and 63ab243.

📒 Files selected for processing (2)

core/src/main/resources/addonlangs.json
core/src/main/scala/org/dbpedia/extraction/util/Language.scala

coderabbitai · 2026-04-24T20:45:23Z

+          val stream = response.getEntity.getContent
+          try {
+            Source.fromInputStream(stream).getLines()
+              .map(_.trim)
+              .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
+              .toList
+          } finally {
+            stream.close()
+          }


⚠️ Potential issue | 🟡 Minor

Specify UTF-8 explicitly when reading the langlist response.

Source.fromInputStream(stream) uses the JVM's default charset, which varies by platform and can corrupt non-ASCII bytes. The Wikimedia langlist endpoint is UTF-8, and Codec is already imported at line 10 but currently unused. Making the charset explicit avoids environment-dependent parsing.

🛠️ Suggested fix

- val stream = response.getEntity.getContent - try { - Source.fromInputStream(stream).getLines() - .map(_.trim) - .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes - .toList - } finally { - stream.close() - } + val stream = response.getEntity.getContent + try { + Source.fromInputStream(stream)(Codec.UTF8).getLines() + .map(_.trim) + .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes + .toList + } finally { + stream.close() + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val stream = response.getEntity.getContent

try {

Source.fromInputStream(stream).getLines()

.map(_.trim)

.filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes

.toList

} finally {

stream.close()

}

val stream = response.getEntity.getContent

try {

Source.fromInputStream(stream)(Codec.UTF8).getLines()

.map(_.trim)

.filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes

.toList

} finally {

stream.close()

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala` around lines 136 - 144, The code reads the langlist InputStream with Source.fromInputStream(stream) which uses the platform default charset; change it to explicitly use UTF-8 (use the existing Codec import, e.g. Source.fromInputStream(stream)(Codec.UTF8) or an equivalent withCodec/codec call) so the Wikimedia UTF-8 response is decoded correctly; keep the same trimming/filtering logic and ensure the stream.close() in the finally block remains unchanged.

anshuman9468 · 2026-04-24T21:00:33Z

Hi @TallTed , could you please review this PR and merge it when you get a chance?

anshu713-cloud added 3 commits March 10, 2026 23:01

Fix NoSuchElementException in DisambiguationExtractor by using getOrE…

230c39f

…lse for language config

Skip Nexus staging during snapshot deployment to GitHub Packages

f55682b

Implement NIF-based citation extraction with precise character offsets

7879400

coderabbitai Bot reviewed Mar 16, 2026

View reviewed changes

TallTed reviewed Mar 16, 2026

View reviewed changes

Comment thread core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java Outdated

TallTed reviewed Mar 16, 2026

View reviewed changes

Comment thread core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java Outdated

anshuman9468 and others added 2 commits April 25, 2026 01:20

Update core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java

e8df1fb

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

Update core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java

d18b166

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Fix Language initialization to resolve CI extraction failures

63ab243

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

-  private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")
+  private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode) match {
+    case Some(suffix) => suffix
+    case None =>
+      // Consider logging: s"No disambiguation suffix configured for language ${language.wikiCode}, using English fallback"
+      " (disambiguation)"
+  }

Conversation

anshuman9468 commented Mar 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

sonarqubecloud Bot commented Mar 16, 2026

Quality Gate passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Apr 24, 2026

Quality Gate passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

anshuman9468 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anshuman9468 commented Mar 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 16, 2026 •

edited

Loading