Support archiving plaintext files/markdown files by LVerneyEC · Pull Request #1153 · OpenTermsArchive/engine

LVerneyEC · 2025-05-16T09:23:13Z

Hi,

This adds the possibility to archive raw content serve over HTTP (plaintext or markdown). This would typically cover archiving of content served through https://raw.githubusercontent.com.

Best,

MattiSG · 2025-05-19T12:56:35Z

Hi @LVerneyEC!

Thank you for this suggestion 🙂
Can you please provide a few examples of contractual documents that currently cannot be tracked and that would become trackable with this changeset?

Thank you!

LVerneyEC · 2025-05-19T13:08:04Z

Not directly a contractual document, but an example would be https://github.com/xai-org/grok-prompts/blob/main/ask_grok_summarizer.j2. Would be way easier to track through the raw github endpoint :)

MattiSG · 2025-06-01T16:15:22Z

Thanks for clarifying the intention @LVerneyEC and for this suggestion!

Considering that adding a feature to the engine means guaranteeing its maintenance, we want to make sure that every additional feature aligns with Open Terms Archive’s Design Principles, that there are clear use cases associated with each of them, and that software quality is ensured 🙂

In the current case, we do appreciate the upcoming relevance of prompts, and would add software licenses as potential cases as well. We will request the following elements before proceeding with merging:

At least five examples of contractual documents that are provided to end users in the newly supported format. This could be obtained by a simple online search (ensuring that the TXT files are indeed the ones that are shown to the end users, in accordance with principle 3) for existing terms types.
- If no such cases currently exist, seeing the provided example, new terms types such as “Source Code License” or “System Prompt” could be added, that would then support listing examples.
Automated tests are added for each new supported format, to ensure the stability of the feature over time. You can liaise with @Ndpnt for test design.

For the provided example (grok prompts stored on GitHub), while the topic is definitely exciting, the relevance of Open Terms Archive for tracking seems a bit remote, as one could simply clone the repository for history preservation, and directly subscribe to RSS to be notified of changes to that specific file 🙂

LVerneyEC · 2026-05-06T15:13:14Z

Quick update on this one because we just got another use case with a TXT file served by X: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/merge_requests/82.

The current workaround is to have a dummy selector (e.g. select: body even though there are no body). The downside is that this loses all the newlines characters.

Ndpnt · 2026-05-11T15:25:18Z

Hi @LVerneyEC,

Thanks for keeping this moving and for the additional use case. Quick recap on where things stand:

The terms type needs to be discussed in terms-types first. "LLM Documentation Index" is not a recognized terms type in https://github.com/OpenTermsArchive/terms-types. That discussion should happen there before the engine grows support to track such content, including the principle 3 question. llms.txt is a convention aimed at LLM crawlers rather than end users, so the same concern as the initial Grok example applies. Could you open an issue on terms-types?
The corpus of five examples is still needed, per @MattiSG's earlier comment.
On the implementation side, a few points would need to be addressed before merging:

The MIME-type check is too strict. sourceDocument.mimeType == mime.getType('txt') is strict equality against 'text/plain', but both raw.githubusercontent.com and docs.x.com return text/plain; charset=utf-8. The check fails and content falls back to extractFromHTML, reproducing the bug the PR aims to fix. The PDF branch in the code handles this via mime.getExtension(...).
The declaration schema is not updated (it still requires select for non-PDF, hence the lingering select: body workaround).
Automated tests are still missing.

Happy to take a closer look at the PR once (1) and (2) are addressed.

Support archiving plaintext files/markdown files

8f948d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support archiving plaintext files/markdown files#1153

Support archiving plaintext files/markdown files#1153
LVerneyEC wants to merge 1 commit into
OpenTermsArchive:mainfrom
LVerneyEC:markdown

LVerneyEC commented May 16, 2025

Uh oh!

MattiSG commented May 19, 2025

Uh oh!

LVerneyEC commented May 19, 2025

Uh oh!

MattiSG commented Jun 1, 2025

Uh oh!

LVerneyEC commented May 6, 2026

Uh oh!

Ndpnt commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LVerneyEC commented May 16, 2025

Uh oh!

MattiSG commented May 19, 2025

Uh oh!

LVerneyEC commented May 19, 2025

Uh oh!

MattiSG commented Jun 1, 2025

Uh oh!

LVerneyEC commented May 6, 2026

Uh oh!

Ndpnt commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants