Skip to content

Support archiving plaintext files/markdown files#1153

Open
LVerneyEC wants to merge 1 commit into
OpenTermsArchive:mainfrom
LVerneyEC:markdown
Open

Support archiving plaintext files/markdown files#1153
LVerneyEC wants to merge 1 commit into
OpenTermsArchive:mainfrom
LVerneyEC:markdown

Conversation

@LVerneyEC
Copy link
Copy Markdown
Contributor

Hi,

This adds the possibility to archive raw content serve over HTTP (plaintext or markdown). This would typically cover archiving of content served through https://raw.githubusercontent.com.

Best,

@MattiSG
Copy link
Copy Markdown
Member

MattiSG commented May 19, 2025

Hi @LVerneyEC!

Thank you for this suggestion 🙂
Can you please provide a few examples of contractual documents that currently cannot be tracked and that would become trackable with this changeset?

Thank you!

@LVerneyEC
Copy link
Copy Markdown
Contributor Author

Not directly a contractual document, but an example would be https://github.com/xai-org/grok-prompts/blob/main/ask_grok_summarizer.j2. Would be way easier to track through the raw github endpoint :)

@MattiSG
Copy link
Copy Markdown
Member

MattiSG commented Jun 1, 2025

Thanks for clarifying the intention @LVerneyEC and for this suggestion!

Considering that adding a feature to the engine means guaranteeing its maintenance, we want to make sure that every additional feature aligns with Open Terms Archive’s Design Principles, that there are clear use cases associated with each of them, and that software quality is ensured 🙂

In the current case, we do appreciate the upcoming relevance of prompts, and would add software licenses as potential cases as well. We will request the following elements before proceeding with merging:

  • At least five examples of contractual documents that are provided to end users in the newly supported format. This could be obtained by a simple online search (ensuring that the TXT files are indeed the ones that are shown to the end users, in accordance with principle 3) for existing terms types.
    • If no such cases currently exist, seeing the provided example, new terms types such as “Source Code License” or “System Prompt” could be added, that would then support listing examples.
  • Automated tests are added for each new supported format, to ensure the stability of the feature over time. You can liaise with @Ndpnt for test design.

For the provided example (grok prompts stored on GitHub), while the topic is definitely exciting, the relevance of Open Terms Archive for tracking seems a bit remote, as one could simply clone the repository for history preservation, and directly subscribe to RSS to be notified of changes to that specific file 🙂

@LVerneyEC
Copy link
Copy Markdown
Contributor Author

Quick update on this one because we just got another use case with a TXT file served by X: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/merge_requests/82.

The current workaround is to have a dummy selector (e.g. select: body even though there are no body). The downside is that this loses all the newlines characters.

@Ndpnt
Copy link
Copy Markdown
Contributor

Ndpnt commented May 11, 2026

Hi @LVerneyEC,

Thanks for keeping this moving and for the additional use case. Quick recap on where things stand:

  1. The terms type needs to be discussed in terms-types first. "LLM Documentation Index" is not a recognized terms type in https://github.com/OpenTermsArchive/terms-types. That discussion should happen there before the engine grows support to track such content, including the principle 3 question. llms.txt is a convention aimed at LLM crawlers rather than end users, so the same concern as the initial Grok example applies. Could you open an issue on terms-types?

  2. The corpus of five examples is still needed, per @MattiSG's earlier comment.

  3. On the implementation side, a few points would need to be addressed before merging:

  • The MIME-type check is too strict. sourceDocument.mimeType == mime.getType('txt') is strict equality against 'text/plain', but both raw.githubusercontent.com and docs.x.com return text/plain; charset=utf-8. The check fails and content falls back to extractFromHTML, reproducing the bug the PR aims to fix. The PDF branch in the code handles this via mime.getExtension(...).
  • The declaration schema is not updated (it still requires select for non-PDF, hence the lingering select: body workaround).
  • Automated tests are still missing.

Happy to take a closer look at the PR once (1) and (2) are addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants