Implement NIF-based citation extraction with precise character offsets#839
Implement NIF-based citation extraction with precise character offsets#839anshuman9468 wants to merge 6 commits intodbpedia:masterfrom
Conversation
📝 WalkthroughWalkthroughAdds citation support to NIF extraction: Link model gains citation fields; LinkExtractor marks citation anchors; HtmlNifExtractor collects citation target URLs and emits citation and taIdentRef triples; removes Changes
Sequence DiagramsequenceDiagram
participant HTML as HTML Parser
participant LE as LinkExtractor
participant EM as ElementMap
participant HNE as HtmlNifExtractor
participant TG as TripleGenerator
HTML->>LE: encounter anchor (`<a href="#cite_note-X">`)
LE->>LE: set link.citation = true\nset citationId = "cite_note-X"
LE->>EM: store Link (citation=true)
HTML->>HNE: scan list item `id=cite_note-X` with external link
HNE->>HNE: extract external URL\ncitationMap["cite_note-X"] = URL
EM->>HNE: provide Link for anchor
HNE->>HNE: detect Link.citation == true
HNE->>TG: emit Citation-type triple
HNE->>TG: emit taIdentRef triple with URL from citationMap
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala (1)
289-300: Consider extracting hardcoded URIs to constants.The ontology and ITS RDF namespace URIs are hardcoded. For consistency with the rest of the codebase (which uses
RdfNamespace), consider using constants:Extract to constants
// Could define at class or package level: private val CitationType = "http://dbpedia.org/ontology/Citation" private val ItsRdfTaIdentRef = "http://www.w3.org/2005/11/its/rdf#taIdentRef" // Or use RdfNamespace if appropriate entries exist/can be added🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala` around lines 289 - 300, Replace the hardcoded ontology/ITS URIs in the HtmlNifExtractor branch (the strings passed to nifLinks when link.isCitation and in the UriUtils Success case) with shared constants: add private vals (e.g., CitationType = "http://dbpedia.org/ontology/Citation" and ItsRdfTaIdentRef = "http://www.w3.org/2005/11/its/rdf#taIdentRef") at the class or package level and reference those constants in the calls to nifLinks (the occurrences around link.isCitation, nifLinks(..., RdfNamespace.RDF.append("type"), CitationType, ...) and nifLinks(..., ItsRdfTaIdentRef, ...)); alternatively, if appropriate entries exist in RdfNamespace, use RdfNamespace for the ITS URI instead of a raw string.core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java (1)
197-211: Confusing substring logic works by accident.The code checks for
?title=in the URI but standard wiki links (e.g.,/wiki/Philosopher) don't contain this parameter. The extraction works by coincidence:
indexOf("?title=")returns-1when not found-1 + 7 = 6, which equals/wiki/.length()This makes the code fragile and confusing. Consider rewriting to explicitly extract the path segment:
Clearer implementation
private String cleanLink(String uri, boolean external) { if (!external) { String linkPrefix = "/wiki/"; String linkPrefix2 = "./"; if (uri.contains(linkPrefix)) { - uri = uri.substring(uri.indexOf("?title=") + 7); + uri = uri.substring(uri.indexOf(linkPrefix) + linkPrefix.length()); } else if (uri.contains(linkPrefix2)) { - uri = uri.substring(uri.indexOf("?title=") + 3); + uri = uri.substring(uri.indexOf(linkPrefix2) + linkPrefix2.length()); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around lines 197 - 211, The substring logic in LinkExtractor that uses indexOf("?title=")+7 is fragile; update the block that handles non-external URIs to explicitly extract the article name: if uri contains "/wiki/" take the substring after the literal "/wiki/" (e.g., uri.substring(uri.indexOf("/wiki/") + "/wiki/".length())), else if uri contains "./" or starts with "./" take the substring after "./" (uri.substring(uri.indexOf("./") + 2)), and then build the full DBpedia resource URL using this cleaned article name and this.context.language (preserving the existing conditional that uses "http://{lang}.dbpedia.org/resource/" vs "http://dbpedia.org/resource/"); ensure you do not rely on "?title=" or negative indexes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 186-194: The setUri method currently hardcodes
tempLink.setExternal(true) which marks all non-null URIs as external; change the
signature to setUri(String uri, boolean external) and set
tempLink.setExternal(external) (preserving the existing tempLink reset when
uri==null and inLink=true when non-null). Update all call sites in LinkExtractor
that call setUri(uri) so internal wiki link usages call setUri(uri, false) and
true external-link usages call setUri(uri, true); keep the null-handling path
unchanged.
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`:
- Line 25: The current initialization of replaceString uses
DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode,
" (disambiguation)") which silently falls back to the English suffix for
unsupported languages; update the logic in DisambiguationExtractor (the
replaceString initialization) to first check whether disambiguationTitlePartMap
contains language.wikiCode and if not either (a) log a warning including
language.wikiCode and the fallback value, or (b) throw a clear exception to make
missing configurations explicit depending on project policy; ensure you
reference DisambiguationExtractorConfig.disambiguationTitlePartMap and
language.wikiCode so the change is applied where replaceString is defined.
---
Nitpick comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 197-211: The substring logic in LinkExtractor that uses
indexOf("?title=")+7 is fragile; update the block that handles non-external URIs
to explicitly extract the article name: if uri contains "/wiki/" take the
substring after the literal "/wiki/" (e.g., uri.substring(uri.indexOf("/wiki/")
+ "/wiki/".length())), else if uri contains "./" or starts with "./" take the
substring after "./" (uri.substring(uri.indexOf("./") + 2)), and then build the
full DBpedia resource URL using this cleaned article name and
this.context.language (preserving the existing conditional that uses
"http://{lang}.dbpedia.org/resource/" vs "http://dbpedia.org/resource/"); ensure
you do not rely on "?title=" or negative indexes.
In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala`:
- Around line 289-300: Replace the hardcoded ontology/ITS URIs in the
HtmlNifExtractor branch (the strings passed to nifLinks when link.isCitation and
in the UriUtils Success case) with shared constants: add private vals (e.g.,
CitationType = "http://dbpedia.org/ontology/Citation" and ItsRdfTaIdentRef =
"http://www.w3.org/2005/11/its/rdf#taIdentRef") at the class or package level
and reference those constants in the calls to nifLinks (the occurrences around
link.isCitation, nifLinks(..., RdfNamespace.RDF.append("type"), CitationType,
...) and nifLinks(..., ItsRdfTaIdentRef, ...)); alternatively, if appropriate
entries exist in RdfNamespace, use RdfNamespace for the ITS URI instead of a raw
string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 572a5f8a-0c06-4a64-bdbc-ffd4e60b539a
📒 Files selected for processing (6)
.github/workflows/snapshot_deploy.ymlcore/src/main/java/org/dbpedia/extraction/nif/Link.javacore/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.javacore/src/main/resources/nifextractionconfig.jsoncore/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scalacore/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala
💤 Files with no reviewable changes (1)
- core/src/main/resources/nifextractionconfig.json
| private void setUri(String uri) { | ||
| if(uri!=null) { | ||
| tempLink.setUri(uri); | ||
| tempLink.setExternal(true); | ||
| inLink = true; | ||
| } else { | ||
| tempLink = new Link(); | ||
| } | ||
| if (uri != null) { | ||
| tempLink.setUri(uri); | ||
| tempLink.setExternal(true); | ||
| inLink = true; | ||
| } else { | ||
| tempLink = new Link(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Bug: All links are incorrectly marked as external.
The setUri method unconditionally sets external = true for all non-null URIs. This marks internal wiki links as external when they shouldn't be.
The method is called for both internal wiki links (e.g., lines 102, 113) and actual external links (e.g., lines 106, 146), but the distinction is lost.
Proposed fix: Pass external flag to setUri
- private void setUri(String uri) {
+ private void setUri(String uri, boolean external) {
if (uri != null) {
tempLink.setUri(uri);
- tempLink.setExternal(true);
+ tempLink.setExternal(external);
inLink = true;
} else {
tempLink = new Link();
}
}Then update call sites:
- Lines 102, 113, 136:
setUri(uri, false); - Lines 106, 146:
setUri(uri, true);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around
lines 186 - 194, The setUri method currently hardcodes
tempLink.setExternal(true) which marks all non-null URIs as external; change the
signature to setUri(String uri, boolean external) and set
tempLink.setExternal(external) (preserving the existing tempLink reset when
uri==null and inLink=true when non-null). Update all call sites in LinkExtractor
that call setUri(uri) so internal wiki link usages call setUri(uri, false) and
true external-link usages call setUri(uri, true); keep the null-handling path
unchanged.
| private val language = context.language | ||
|
|
||
| private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap(language.wikiCode) | ||
| private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)") |
There was a problem hiding this comment.
English fallback may cause incorrect behavior for unsupported languages.
When a language is not in disambiguationTitlePartMap, this falls back to the English suffix " (disambiguation)". For non-English languages (e.g., Turkish, Vietnamese, etc.) that aren't in the map, this will silently use an incorrect suffix that won't match their actual disambiguation page titles.
Consider logging a warning when using the fallback, or throwing an error to make missing language configurations explicit:
Alternative: Log warning for missing configuration
- private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")
+ private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode) match {
+ case Some(suffix) => suffix
+ case None =>
+ // Consider logging: s"No disambiguation suffix configured for language ${language.wikiCode}, using English fallback"
+ " (disambiguation)"
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)") | |
| private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode) match { | |
| case Some(suffix) => suffix | |
| case None => | |
| // Consider logging: s"No disambiguation suffix configured for language ${language.wikiCode}, using English fallback" | |
| " (disambiguation)" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`
at line 25, The current initialization of replaceString uses
DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode,
" (disambiguation)") which silently falls back to the English suffix for
unsupported languages; update the logic in DisambiguationExtractor (the
replaceString initialization) to first check whether disambiguationTitlePartMap
contains language.wikiCode and if not either (a) log a warning including
language.wikiCode and the fallback value, or (b) throw a clear exception to make
missing configurations explicit depending on project policy; ensure you
reference DisambiguationExtractorConfig.disambiguationTitlePartMap and
language.wikiCode so the change is applied where replaceString is defined.
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java (1)
186-194:⚠️ Potential issue | 🟠 Major
setUristill unconditionally marks every link as external.The
external = truehardcoding previously flagged has not been addressed in this revision. Internal wiki links routed throughsetUriat lines 102, 113, and 136 will still be emitted as external. Fix as previously suggested (pass anexternalflag throughsetUri).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around lines 186 - 194, The setUri method currently hardcodes tempLink.setExternal(true), so change its signature to accept an external flag (e.g., setUri(String uri, boolean external)) or add an overload, and set tempLink.setExternal(external) instead of true; keep tempLink.setUri(uri) and inLink handling as-is and preserve the else branch that resets tempLink. Update all callers of setUri (the three call sites that currently invoke setUri with a single String) to pass the correct boolean (true for external URLs, false for internal wiki links) so internal links are not marked external.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 186-194: The setUri method currently hardcodes
tempLink.setExternal(true), so change its signature to accept an external flag
(e.g., setUri(String uri, boolean external)) or add an overload, and set
tempLink.setExternal(external) instead of true; keep tempLink.setUri(uri) and
inLink handling as-is and preserve the else branch that resets tempLink. Update
all callers of setUri (the three call sites that currently invoke setUri with a
single String) to pass the correct boolean (true for external URLs, false for
internal wiki links) so internal links are not marked external.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c8c2190f-17ef-443d-9891-5e9d05cb7a60
📒 Files selected for processing (1)
core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
core/src/main/scala/org/dbpedia/extraction/util/Language.scala (1)
133-155: Close theHttpResponseand consolidate User-Agent setup.Two small hygiene items on this block:
The
HttpResponsereturned byclient.execute(request)is aCloseableHttpResponseand should be closed to release the underlying connection back to the pool. Currently only the entity stream is closed. This code runs once at startup so the impact is minimal, but it's idiomatic to close it explicitly (e.g. via a nestedtry-finally, orresponse.close()in a dedicatedfinally).The User-Agent header is configured in two places: lines 37-40 (custom UA, before the
try) and lines 45-47 (default UA, inside thetry). The behavior is correct but the split makes the intent harder to follow. Consolidate the UA string resolution and set it once before callingexecute.♻️ Sketch of a consolidated shape
- // Apply User-Agent conditionally - if (customUserAgentEnabled) { - request.setHeader("User-Agent", customUserAgentText) - } - - val wikiLanguageCodes = - try { - // Set a default User-Agent if none is provided - if (!customUserAgentEnabled) { - request.setHeader("User-Agent", "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; dbpedia@infai.org)") - } - - val response = client.execute(request) - val status = response.getStatusLine.getStatusCode - if (status >= 200 && status < 300) { - val stream = response.getEntity.getContent - try { - Source.fromInputStream(stream).getLines() - .map(_.trim) - .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes - .toList - } finally { - stream.close() - } - } else { - logger.log(Level.WARNING, "Language list fetch failed with status " + status + " from " + wikipediaLanguageUrl) - List.empty - } - } catch { - case e: Exception => - logger.log(Level.WARNING, "Could not fetch language list from " + wikipediaLanguageUrl + ": " + e.getMessage) - List.empty // Fallback to empty list, addonlangs.json will still be used - } finally { - client.close() - } + val defaultUserAgent = "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; dbpedia@infai.org)" + request.setHeader("User-Agent", if (customUserAgentEnabled) customUserAgentText else defaultUserAgent) + + val wikiLanguageCodes = + try { + val response = client.execute(request) + try { + val status = response.getStatusLine.getStatusCode + if (status >= 200 && status < 300) { + val stream = response.getEntity.getContent + try { + Source.fromInputStream(stream).getLines() + .map(_.trim) + .filter(line => line.nonEmpty && !line.contains(" ")) + .toList + } finally stream.close() + } else { + logger.log(Level.WARNING, "Language list fetch failed with status " + status + " from " + wikipediaLanguageUrl) + List.empty + } + } finally response.close() + } catch { + case e: Exception => + logger.log(Level.WARNING, "Could not fetch language list from " + wikipediaLanguageUrl + ": " + e.getMessage) + List.empty // Fallback to empty list, addonlangs.json will still be used + } finally { + client.close() + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala` around lines 133 - 155, The HttpResponse from client.execute(request) is a CloseableHttpResponse and must be explicitly closed to return the connection to the pool; wrap the execute call so you capture the CloseableHttpResponse (e.g. a nested try/finally) and ensure response.close() always runs after closing the entity stream. Also consolidate the User-Agent resolution (the logic currently split before and inside the try) into a single UA string computed once and set on the request before calling client.execute(request) so the header is only determined in one place; update references to client.execute, request, CloseableHttpResponse and wikipediaLanguageUrl accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala`:
- Around line 136-144: The code reads the langlist InputStream with
Source.fromInputStream(stream) which uses the platform default charset; change
it to explicitly use UTF-8 (use the existing Codec import, e.g.
Source.fromInputStream(stream)(Codec.UTF8) or an equivalent withCodec/codec
call) so the Wikimedia UTF-8 response is decoded correctly; keep the same
trimming/filtering logic and ensure the stream.close() in the finally block
remains unchanged.
---
Nitpick comments:
In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala`:
- Around line 133-155: The HttpResponse from client.execute(request) is a
CloseableHttpResponse and must be explicitly closed to return the connection to
the pool; wrap the execute call so you capture the CloseableHttpResponse (e.g. a
nested try/finally) and ensure response.close() always runs after closing the
entity stream. Also consolidate the User-Agent resolution (the logic currently
split before and inside the try) into a single UA string computed once and set
on the request before calling client.execute(request) so the header is only
determined in one place; update references to client.execute, request,
CloseableHttpResponse and wikipediaLanguageUrl accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 75f7ed91-a9ab-42ee-b765-a87b6db64956
📒 Files selected for processing (2)
core/src/main/resources/addonlangs.jsoncore/src/main/scala/org/dbpedia/extraction/util/Language.scala
| val stream = response.getEntity.getContent | ||
| try { | ||
| Source.fromInputStream(stream).getLines() | ||
| .map(_.trim) | ||
| .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes | ||
| .toList | ||
| } finally { | ||
| stream.close() | ||
| } |
There was a problem hiding this comment.
Specify UTF-8 explicitly when reading the langlist response.
Source.fromInputStream(stream) uses the JVM's default charset, which varies by platform and can corrupt non-ASCII bytes. The Wikimedia langlist endpoint is UTF-8, and Codec is already imported at line 10 but currently unused. Making the charset explicit avoids environment-dependent parsing.
🛠️ Suggested fix
- val stream = response.getEntity.getContent
- try {
- Source.fromInputStream(stream).getLines()
- .map(_.trim)
- .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
- .toList
- } finally {
- stream.close()
- }
+ val stream = response.getEntity.getContent
+ try {
+ Source.fromInputStream(stream)(Codec.UTF8).getLines()
+ .map(_.trim)
+ .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes
+ .toList
+ } finally {
+ stream.close()
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| val stream = response.getEntity.getContent | |
| try { | |
| Source.fromInputStream(stream).getLines() | |
| .map(_.trim) | |
| .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes | |
| .toList | |
| } finally { | |
| stream.close() | |
| } | |
| val stream = response.getEntity.getContent | |
| try { | |
| Source.fromInputStream(stream)(Codec.UTF8).getLines() | |
| .map(_.trim) | |
| .filter(line => line.nonEmpty && !line.contains(" ")) // Basic filter for language codes | |
| .toList | |
| } finally { | |
| stream.close() | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/scala/org/dbpedia/extraction/util/Language.scala` around lines
136 - 144, The code reads the langlist InputStream with
Source.fromInputStream(stream) which uses the platform default charset; change
it to explicitly use UTF-8 (use the existing Codec import, e.g.
Source.fromInputStream(stream)(Codec.UTF8) or an equivalent withCodec/codec
call) so the Wikimedia UTF-8 response is decoded correctly; keep the same
trimming/filtering logic and ensure the stream.close() in the finally block
remains unchanged.
|
Hi @TallTed , could you please review this PR and merge it when you get a chance? |



This PR implements NIF-based citation extraction to provide precise character offsets and direct links to external sources using itsrdf:taIdentRef. This replaces the previous line-number-based citations.
Summary by CodeRabbit
New Features
Bug Fixes
Chores