xsee

XPaths in a YAML get you clean JSON from HTML.

Tired of writing yet another scraping script? Tired to recompile your scrapers after an XPath changes?

By using a simple strict YAML format as input, xsee allows you to turn an HTML page into a fine tuned JSON.

No release yet as it is still in very early development.

Project structure

xsee/
├── docs
├── engines                   # Different implementations
│   ├── cpp
│   ├── js
│   └── python
├── scripts                   # Small utility scripts
└── tests                     # Shared tests folder
    ├── example               # Regression test examples
    │   ├── expected.json     # Expected output
    │   ├── input.html        # Input to process
    │   └── xsee.yaml         # Minimal scraping config
    └── test.py               # Runs for every engine, every test

Example

Imagine you have a messy HTML page

View Messy Source HTML

<header class="site-header-v2">
  <div class="banner-ad">Buy Crypto Now!</div>
  <h1>Tech Gadget Emporium</h1> 
</header>

<main id="content-7721">
  <section class="grid-layout">
    <div class="product card-style-prime">
      <div class="img-wrapper">
        <img src="kb.jpg" />
        <span class="tooltip">Bestseller</span>
      </div>
      <div class="details">
        <h2 class="title">Mechanical Keyboard</h2> 
        <div class="price-container">
          <span class="p">$120</span> 
          <span class="old-price">$150</span>
        </div>
        <ul class="tag-cloud">
          <li>peripherals</li>
          <li>gaming</li>
          <li>usb-c</li>
        </ul>
      </div>
      <script>trackImpression('prod_01');</script>
    </div>

    <div class="spacer-ads">Some Garbled Mess</div>

    <div class="product card-style-prime">
      <h2 class="title">Wireless Mouse</h2>
      <span class="p">$60</span>
      <ul class="tag-cloud">
        <li>ergonomic</li>
        <li>battery-powered</li>
      </ul>
    </div>
  </section>
</main>

Once you have found the XPaths that lead to your desired information and compiled the xsee.yaml

store_name: "//h1"
catalog:
  - "//div[contains(@class, 'product')]"
  - name: ".//h2"
    price: ".//span[@class='p']"
    tags: [ ".//li", "." ]

And run xsee with

curl http://yourfavoritewebsite/ > input.html
xsee input.html --yaml xsee.yaml

You directly get this as output

View Structured JSON Output

{
  "store_name": "Tech Gadget Emporium",
  "catalog": [
    {
      "name": "Mechanical Keyboard",
      "price": "$120",
      "tags": ["peripherals", "gaming", "usb-c"]
    },
    {
      "name": "Wireless Mouse",
      "price": "$60",
      "tags": ["ergonomic", "battery-powered"]
    }
  ]
}

Rules of thumb

XSEE replaces procedural scraping scripts with a structural contract, treating the DOM as a queryable data source.

XSEE is text first, and explicitly does no data processing other than extracting raw information for the DOM. Processing is left to be done to other tools of your choice.

XSEE uses XPath 1.0 for best portability. The implementation of XSEE applies a normalize-space() to the string extracted.

The patterns

XSEE uses three simple patterns to map DOM elements to data:

Leaf: key: "xpath" e.g. title: "//h1", url : "//a/@href"

Extracts only the textContent of the first matching node or its first XPath attribute selector (e.g., @src, @href, @content). Returns null if not found.
Group: key: {group} e.g. meta: { author: "//span", date: "//time" }

Used to group related data.
Iterator: key: ["selector", "extractor"] e.g. related_articles: [ "//li", ".//a/@href" ]

This is the only allowed type of list (2-tuple). Iterates over all objects in the DOM found by the first XPath selector and applies the second extractor to each element.

The extractor can be either a single "xpath" or a {group}. Extraction XPaths must use the ./ or .// relative prefix to remain relative to the parent and prevent context leak (the engine must enforce this by throwing an error).

If selector finds no results, returns [], if extractor finds no results, it returns []. Leafs inside groups are handled normally as null when leaf is not found.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
docs		docs
engines		engines
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xsee

Project structure

Example

Rules of thumb

The patterns

References

About

Uh oh!

Releases

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

xsee

Project structure

Example

Rules of thumb

The patterns

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Languages