The following page:
https://netflixtechblog.com/full-cycle-developers-at-netflix-a08c31f83249
has img tags which have empty src attribute. The src is set via javascript upon scroll I think or via noscript tags right after the img tags.
Here's a piece of the page's HTML:
<img alt="" class="iq ir t u v is ak c" width="687" height="60" role="presentation"><noscript><img alt="" class="t u v is ak" src="https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png" width="687" height="60" srcSet="https://miro.medium.com/max/552/1*JnixtUHJjNYXNT15P42eJQ.png 276w, https://miro.medium.com/max/1104/1*JnixtUHJjNYXNT15P42eJQ.png 552w, https://miro.medium.com/max/1280/1*JnixtUHJjNYXNT15P42eJQ.png 640w, https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png 687w" sizes="687px" role="presentation"/></noscript></div></div></div><figcaption class="jd je cm ck cl jf jg en b eo ep fv" data-selectable-paragraph="">SDLC components</figcaption></figure>
This causes Readability to return empty images for the large images and tiny thumbnails only when using ReadabilityExtended.
I am able to solve the issue by searching for all img tags with missing src and then checking if such Element has a noscript sibling with an img in it and if so, then extract the src from the noscript and set it to the original img:
I placed the following code at the very beginning of the protected open fun removeNoscripts(document: Document) {} function in Preprocessor.kt:
try {
document.select("img[src=\"\"], img:not([src])").forEach { img ->
// println("Empty: ${img}")
// println("Noscript: ${img.siblingElements().select("noscript")}")
img.siblingElements().select("noscript").firstOrNull()?.let {
img.attr("src",Jsoup.parse(it.html(), "", Parser.xmlParser()).selectFirst("img").attr("src"))
}
}
} catch (e: Exception) {
println("Exception in setting img for missing src from noscript tags")
}
The following page:
https://netflixtechblog.com/full-cycle-developers-at-netflix-a08c31f83249
has
imgtags which have emptysrcattribute. Thesrcis set via javascript upon scroll I think or vianoscripttags right after theimgtags.Here's a piece of the page's HTML:
<img alt="" class="iq ir t u v is ak c" width="687" height="60" role="presentation"><noscript><img alt="" class="t u v is ak" src="https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png" width="687" height="60" srcSet="https://miro.medium.com/max/552/1*JnixtUHJjNYXNT15P42eJQ.png 276w, https://miro.medium.com/max/1104/1*JnixtUHJjNYXNT15P42eJQ.png 552w, https://miro.medium.com/max/1280/1*JnixtUHJjNYXNT15P42eJQ.png 640w, https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png 687w" sizes="687px" role="presentation"/></noscript></div></div></div><figcaption class="jd je cm ck cl jf jg en b eo ep fv" data-selectable-paragraph="">SDLC components</figcaption></figure>This causes Readability to return empty images for the large images and tiny thumbnails only when using ReadabilityExtended.
I am able to solve the issue by searching for all
imgtags with missingsrcand then checking if such Element has anoscriptsibling with animgin it and if so, then extract thesrcfrom thenoscriptand set it to the originalimg:I placed the following code at the very beginning of the
protected open fun removeNoscripts(document: Document) {}function inPreprocessor.kt: