XPaths in a YAML get you clean JSON from HTML.
Tired of writing yet another scraping script? Tired to recompile your scrapers after an XPath changes?
By using a simple strict YAML format as input, xsee allows you to turn an HTML page into a fine tuned JSON.
No release yet as it is still in very early development.
xsee/
├── docs
├── engines # Different implementations
│ ├── cpp
│ ├── js
│ └── python
├── scripts # Small utility scripts
└── tests # Shared tests folder
├── example # Regression test examples
│ ├── expected.json # Expected output
│ ├── input.html # Input to process
│ └── xsee.yaml # Minimal scraping config
└── test.py # Runs for every engine, every test
Imagine you have a messy HTML page
View Messy Source HTML
<header class="site-header-v2">
<div class="banner-ad">Buy Crypto Now!</div>
<h1>Tech Gadget Emporium</h1>
</header>
<main id="content-7721">
<section class="grid-layout">
<div class="product card-style-prime">
<div class="img-wrapper">
<img src="kb.jpg" />
<span class="tooltip">Bestseller</span>
</div>
<div class="details">
<h2 class="title">Mechanical Keyboard</h2>
<div class="price-container">
<span class="p">$120</span>
<span class="old-price">$150</span>
</div>
<ul class="tag-cloud">
<li>peripherals</li>
<li>gaming</li>
<li>usb-c</li>
</ul>
</div>
<script>trackImpression('prod_01');</script>
</div>
<div class="spacer-ads">Some Garbled Mess</div>
<div class="product card-style-prime">
<h2 class="title">Wireless Mouse</h2>
<span class="p">$60</span>
<ul class="tag-cloud">
<li>ergonomic</li>
<li>battery-powered</li>
</ul>
</div>
</section>
</main>Once you have found the XPaths that lead to your desired information and compiled the xsee.yaml
store_name: "//h1"
catalog:
- "//div[contains(@class, 'product')]"
- name: ".//h2"
price: ".//span[@class='p']"
tags: [ ".//li", "." ]And run xsee with
curl http://yourfavoritewebsite/ > input.html
xsee input.html --yaml xsee.yamlYou directly get this as output
View Structured JSON Output
{
"store_name": "Tech Gadget Emporium",
"catalog": [
{
"name": "Mechanical Keyboard",
"price": "$120",
"tags": ["peripherals", "gaming", "usb-c"]
},
{
"name": "Wireless Mouse",
"price": "$60",
"tags": ["ergonomic", "battery-powered"]
}
]
}XSEE replaces procedural scraping scripts with a structural contract, treating the DOM as a queryable data source.
XSEE is text first, and explicitly does no data processing other than extracting raw information for the DOM. Processing is left to be done to other tools of your choice.
XSEE uses XPath 1.0 for best portability. The implementation of XSEE applies a normalize-space() to the string extracted.
XSEE uses three simple patterns to map DOM elements to data:
-
Leaf:
key: "xpath"e.g.title: "//h1",url : "//a/@href"Extracts only the
textContentof the first matching node or its first XPath attribute selector (e.g.,@src,@href,@content). Returnsnullif not found. -
Group:
key: {group}e.g.meta: { author: "//span", date: "//time" }Used to group related data.
-
Iterator:
key: ["selector", "extractor"]e.g.related_articles: [ "//li", ".//a/@href" ]This is the only allowed type of list (2-tuple). Iterates over all objects in the DOM found by the first XPath selector and applies the second extractor to each element.
The extractor can be either a single
"xpath"or a{group}. Extraction XPaths must use the./or.//relative prefix to remain relative to the parent and prevent context leak (the engine must enforce this by throwing an error).If selector finds no results, returns
[], if extractor finds no results, it returns[]. Leafs inside groups are handled normally asnullwhen leaf is not found.
