Skip to content

a-lberto/xsee

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Eye-glasses with attached multiple stacking lenses

xsee

XPaths in a YAML get you clean JSON from HTML.


Tired of writing yet another scraping script? Tired to recompile your scrapers after an XPath changes?

By using a simple strict YAML format as input, xsee allows you to turn an HTML page into a fine tuned JSON.

No release yet as it is still in very early development.

Project structure

xsee/
├── docs
├── engines                   # Different implementations
│   ├── cpp
│   ├── js
│   └── python
├── scripts                   # Small utility scripts
└── tests                     # Shared tests folder
    ├── example               # Regression test examples
    │   ├── expected.json     # Expected output
    │   ├── input.html        # Input to process
    │   └── xsee.yaml         # Minimal scraping config
    └── test.py               # Runs for every engine, every test

Example

Imagine you have a messy HTML page

View Messy Source HTML
<header class="site-header-v2">
  <div class="banner-ad">Buy Crypto Now!</div>
  <h1>Tech Gadget Emporium</h1> 
</header>

<main id="content-7721">
  <section class="grid-layout">
    <div class="product card-style-prime">
      <div class="img-wrapper">
        <img src="kb.jpg" />
        <span class="tooltip">Bestseller</span>
      </div>
      <div class="details">
        <h2 class="title">Mechanical Keyboard</h2> 
        <div class="price-container">
          <span class="p">$120</span> 
          <span class="old-price">$150</span>
        </div>
        <ul class="tag-cloud">
          <li>peripherals</li>
          <li>gaming</li>
          <li>usb-c</li>
        </ul>
      </div>
      <script>trackImpression('prod_01');</script>
    </div>

    <div class="spacer-ads">Some Garbled Mess</div>

    <div class="product card-style-prime">
      <h2 class="title">Wireless Mouse</h2>
      <span class="p">$60</span>
      <ul class="tag-cloud">
        <li>ergonomic</li>
        <li>battery-powered</li>
      </ul>
    </div>
  </section>
</main>

Once you have found the XPaths that lead to your desired information and compiled the xsee.yaml

store_name: "//h1"
catalog:
  - "//div[contains(@class, 'product')]"
  - name: ".//h2"
    price: ".//span[@class='p']"
    tags: [ ".//li", "." ]

And run xsee with

curl http://yourfavoritewebsite/ > input.html
xsee input.html --yaml xsee.yaml

You directly get this as output

View Structured JSON Output
{
  "store_name": "Tech Gadget Emporium",
  "catalog": [
    {
      "name": "Mechanical Keyboard",
      "price": "$120",
      "tags": ["peripherals", "gaming", "usb-c"]
    },
    {
      "name": "Wireless Mouse",
      "price": "$60",
      "tags": ["ergonomic", "battery-powered"]
    }
  ]
}

Rules of thumb

XSEE replaces procedural scraping scripts with a structural contract, treating the DOM as a queryable data source.

XSEE is text first, and explicitly does no data processing other than extracting raw information for the DOM. Processing is left to be done to other tools of your choice.

XSEE uses XPath 1.0 for best portability. The implementation of XSEE applies a normalize-space() to the string extracted.

The patterns

XSEE uses three simple patterns to map DOM elements to data:

  1. Leaf: key: "xpath" e.g. title: "//h1", url : "//a/@href"

    Extracts only the textContent of the first matching node or its first XPath attribute selector (e.g., @src, @href, @content). Returns null if not found.

  2. Group: key: {group} e.g. meta: { author: "//span", date: "//time" }

    Used to group related data.

  3. Iterator: key: ["selector", "extractor"] e.g. related_articles: [ "//li", ".//a/@href" ]

    This is the only allowed type of list (2-tuple). Iterates over all objects in the DOM found by the first XPath selector and applies the second extractor to each element.

    The extractor can be either a single "xpath" or a {group}. Extraction XPaths must use the ./ or .// relative prefix to remain relative to the parent and prevent context leak (the engine must enforce this by throwing an error).

    If selector finds no results, returns [], if extractor finds no results, it returns []. Leafs inside groups are handled normally as null when leaf is not found.

References

About

XPaths in a strictly structured YAML transforming wild HTML into clean JSON data, recursively

Resources

License

Stars

Watchers

Forks

Releases

No releases published