Skip to content

Anchor structured extraction #4

Description

@hoijui

As it is now, anchors extracted from documents get extracted in a flat space, while they usually exist in a tree namespace structure. This structure is described by the header level of the anchor, or the header level where the anchor exists in (if there are anchors other then headers themselfs), and all the super-headers of that header.
This is at least the case with Markdown and HTML, but probably also most other document formats.

example markdown document (doc.md):

# Top

## First Sub

bla bla bla

### A Sub Sub

bli bli bli

## Second Sub

blu blu blu

### B Sub Sub

tri tra tralala

<a name="in-text"/>

flat extraction:

doc.md#top
doc.md#first-sub
doc.md#a-sub-sub
doc.md#second-sub
doc.md#b-sub-sub
doc.md#in-text

structured extraction:

doc.md#
    \ top
        \ first-sub
            \ a-sub-sub
        \ second-sub
            \ b-sub-sub
                \ in-text

Why

This is useful when analyzing changes in documents, for example if a title has been renamed, but the structure overall has stayed the same, one might be able to generate an auto-fix for a missing link including a fragment (that is meant to map to an anchor).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions