Skip to content

ufal/atrium-translator

Repository files navigation


πŸ›οΈ ATRIUM - LINDAT Translation Wrapper 🌍

A modular Python wrapper specifically designed for the LINDAT Translation API 1. Following project scope requirements, this tool is strictly focused on processing XML and its direct derivatives (ALTO XML and AMCR metadata records). It identifies the source language using FastText 2, translates the content to English (or other target languages), optionally overrides domain-specific terms using a Tag-and-Protect vocabulary strategy backed by UDPipe lemmatisation 3, and safely reconstructs the original XML structure.

πŸ“š Table of Contents


✨ Features

  • 🎯 Dedicated XML Processing: Narrowly defined and optimised exclusively for ALTO XML and AMCR metadata to ensure universal, safe, and easy usage.
  • πŸ“– ALTO Translation Mode: Translates only the CONTENT attributes natively. Tied to a simple flag (--alto) so users don't need to provide complex configurations.
  • πŸ›οΈ AMCR Metadata Mode: Translates specific elements based on a provided list of XPaths (e.g., amcr-fields.txt πŸ“Ž), safely puts them back into the XML, and features deep recursive namespace extraction to handle OAI-PMH envelopes.
  • βœ… XSD Validation: Optionally validates AMCR outputs against an XSD schema (e.g., https://api.aiscr.cz/schema/amcr/2.2/amcr.xsd) to guarantee structural integrity.
  • πŸ“Š Supplementary CSV Logging: Automatically produces a supplementary QA CSV file with columns: file, page_num, line_num, text_<source_lang>, text_<target_lang> for easy manual checking of translations.
  • πŸ•΅οΈ Language Detection with Intelligent Fallback: Automatically identifies the source language using FastText (Facebook) 2. If the detection confidence is below 0.2, it defaults to Czech (cs) to ensure the pipeline continues seamlessly.
  • πŸ”€ Tag-and-Protect Vocabulary Overriding: When a vocabulary CSV is supplied, domain-specific terms are protected before translation using unique placeholder tags. Single-word terms are matched by lemma via the LINDAT UDPipe API 3; multi-word phrases use case-insensitive substring matching (longest match first). Vocabulary translations are then restored after the NMT call, ensuring controlled terminology is never garbled.
  • πŸ—‚οΈ Automated Vocabulary Harvesting: The bundled load_vocab.py script downloads Czechβ†’English term pairs from both the AMCR OAI-PMH API and the TEATER GraphQL API and merges them into a single ready-to-use CSV.
  • πŸ”— LINDAT API Integration: Seamlessly connects to the LINDAT Translation API (v2) 1. Uses smart, space-aware chunking (max 4,000 characters) to protect word boundaries and prevent API truncation errors.

πŸ› οΈ Prerequisites

  1. Clone the project files:
git clone https://github.com/ufal/atrium-translator.git
  1. Create a virtual environment and activate it (optional but recommended):
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install the required Python packages:
cd atrium-translator
pip install -r requirements.txt

Note on fasttext: The upstream package requires a C++ compiler at build time. If your environment lacks build tools, install the pre-built wheel instead:

pip install fasttext-wheel

πŸ“‚ Project Structure

atrium-translator/
β”œβ”€β”€ main.py                    # πŸš€ Entry point – CLI routing for ALTO vs. AMCR processing
β”œβ”€β”€ load_vocab.py              # πŸ—‚οΈ Vocabulary harvester (AMCR OAI-PMH + TEATER GraphQL β†’ CSV)
β”œβ”€β”€ atrium_paradata.py         # πŸ“Š Unified provenance/paradata logger
β”œβ”€β”€ requirements.txt           # πŸ“¦ Python dependencies
β”œβ”€β”€ config.txt                 # βš™οΈ Configuration parameters
β”œβ”€β”€ amcr-fields.txt            # πŸ“„ List of AMCR XPath targets for XML translation
β”œβ”€β”€ amcr-inputs.txt            # πŸ“„ List of AMCR metadata input files (XML) to be processed
β”œβ”€β”€ processors/
β”‚   β”œβ”€β”€ __init__.py            # πŸ“¦ Package marker
β”‚   β”œβ”€β”€ identifier.py          # 🌍 FastText language identification (ISO 639-3 to 639-1 mapping)
β”‚   β”œβ”€β”€ lemmatizer.py          # πŸ”€ UDPipe-based lemmatizer for vocabulary term matching
β”‚   └── translator.py          # πŸ”„ LINDAT API client with Tag-and-Protect vocabulary support
β”œβ”€β”€ data_samples/
β”‚   β”œβ”€β”€ vocabulary.csv         # πŸ“˜ Czechβ†’English domain vocabulary (AMCR/TEATER thesaurus terms)
β”‚   β”œβ”€β”€ my_documents/          # πŸ“‚ Sample input files (ALTO XML and downloaded AMCR metadata XMLs)
β”‚   β”‚   β”œβ”€β”€ MTX201501307.alto.xml  # πŸ“Ž Sample ALTO XML file for testing
β”‚   β”‚   └── ...
β”‚   └── translated_files/      # πŸ“‚ Output directory for translated XML files and their logs
β”‚       β”œβ”€β”€ MTX201501307_en.alto.xml  # πŸ“Ž Translated ALTO XML output file
β”‚       β”œβ”€β”€ MTX201501307_log.csv      # πŸ“Ž Supplementary CSV log for the translated ALTO XML file
β”‚       └── ...
β”œβ”€β”€ paradata/
β”‚   β”œβ”€β”€ <date>-<time>_translator.json  # πŸ“Š Aggregated log of all translations for analysis
β”‚   └── ...
└── utils.py                   # πŸ”§ ALTO & AMCR parsing, CSV logging, XSD validation, and XML tree reconstruction

πŸ’» Usage

Run the wrapper from the command line. The default target language is English (en).

πŸ“– ALTO XML Mode

Use the --alto flag together with --formats alto.xml (or set formats = alto.xml in config.txt). This processes ALTO files by strictly targeting their String CONTENT attributes.

python main.py ./data_samples/my_documents --alto --formats alto.xml --target_lang en

Example of ALTO XML processing:

The translation is performed per TextBlock, and the translated words are redistributed back into the individual CONTENT attributes of each String element within a TextLine.

πŸ›οΈ AMCR Metadata Mode

Process AMCR records by passing your list of XPaths and optionally providing an XSD URL for validation.

python main.py amcr-inputs.txt --xpaths amcr-fields.txt --xsd https://api.aiscr.cz/schema/amcr/2.2/amcr.xsd --target_lang en

OR

python main.py amcr-inputs.txt --xpaths amcr-fields.txt --target_lang en

Examples of input files are downloaded into my_documents πŸ“‚ and their filenames start with C- according to the amcr-inputs.txt πŸ“Ž list of input files.

Examples of output files are saved in translated_files πŸ“‚ and include .csv log files (containing only processed lines) alongside .xml files translated to the target language.

πŸ“˜ Vocabulary / Tag-and-Protect

Provide a two-column CSV (source_lemma,target_translation) to activate the Tag-and-Protect strategy. When enabled, domain-specific terms are shielded from the NMT model and replaced with guaranteed vocabulary translations instead.

python main.py amcr-inputs.txt --xpaths amcr-fields.txt --vocabulary data_samples/vocabulary.csv --target_lang en

Or set the path in config.txt:

vocabulary = data_samples/vocabulary.csv

How it works:

  1. Multi-word phrase pass – phrases containing spaces (e.g. fotografie udΓ‘losti) are matched case-insensitively, longest match first, and replaced with __TERM_N__ placeholder tags.
  2. Single-word lemma pass – the remaining text is lemmatised via the LINDAT UDPipe API 3. Tokens whose base form appears in the vocabulary are similarly tagged.
  3. Translation – the tagged text (with unknown-looking placeholders) is sent to the LINDAT Translation API. NMT models leave unrecognised tokens untouched.
  4. Restoration – all __TERM_N__ tags in the translated output are replaced with the corresponding vocabulary translations.

If no vocabulary file is provided, the translator behaves exactly as before (no UDPipe calls are made).

πŸ—‚οΈ Harvesting the Vocabulary

The load_vocab.py script downloads term pairs automatically from two sources and merges them into a single CSV:

Source Endpoint Method
AMCR https://api.aiscr.cz/2.2/oai?set=heslo OAI-PMH ListRecords with resumption token paging
TEATER https://teater.aiscr.cz/api/graphql GraphQL introspection β†’ exportAll or search-based fallback
# Full harvest (both sources):
python load_vocab.py

# Skip one source:
python load_vocab.py --skip-teater
python load_vocab.py --skip-amcr

# Custom output path and request delay:
python load_vocab.py --out my_vocab.csv --delay 0.5

The merged vocabulary is written to data_samples/vocabulary.csv by default (AMCR entries take precedence over TEATER on key collision).

βš™οΈ Configuration File Support

Instead of passing all arguments via the command line, you can use a configuration file config.txt to define default paths and parameters. CLI arguments always take precedence over config file values β€” the config file supplies defaults only for arguments that are not explicitly passed on the command line.

Example config.txt:

[DEFAULT]
input_path = ./data_samples/my_documents
source_lang = auto
target_lang = en
formats = xml,txt
fields = amcr-fields.txt
output = ./data_samples/translated_files

# Optional: path to a vocabulary CSV file (source_lemma,target_translation).
# Leave blank or comment out to disable.
vocabulary = data_samples/vocabulary.csv

βš™οΈ Supported Arguments

  • input_path: Path to a single source file, a directory containing XML files, or a .txt file listing URLs.
  • --output, -o: Output file path (for single-file mode) or output directory (for batch mode).
  • --source_lang, -src: Source language code (e.g., cs, fr). Use auto to auto-detect. Default: cs.
  • --target_lang, -tgt: Target language code (e.g., en, cs). Default: en.
  • --formats: Comma-separated list of file extensions to process (e.g., alto.xml,txt or xml,txt). Default: xml.
  • --config, -c: Path to the configuration file (default: config.txt).
  • --alto: Flag to enable ALTO XML in-place translation mode.
  • --xpaths: Path to a .txt file containing XPaths for AMCR metadata translation.
  • --xsd: Optional URL or local path to an XSD file for AMCR output validation.
  • --vocabulary: Path to a CSV vocabulary file (source_lemma,target_translation) to activate Tag-and-Protect term overriding.

🧠 Logic Overview

  1. Routing: The script determines if it is running in ALTO mode (--alto) or AMCR mode (--xpaths).
  2. Extraction & Translation:
    • ALTO: Iterates through Page β†’ TextLine β†’ String. Extracts the CONTENT attribute, reconstructs the entire line for contextual API translation, and perfectly redistributes the translated words back into the CONTENT attributes.
    • AMCR: Uses deep recursive namespace extraction (vital for OAI-PMH API envelopes). Finds elements matching the provided XPaths, translates their text content, and replaces it in the tree.
  3. Language Identification: The text is analysed by FastText 2 to determine the source language. If the confidence score is below 0.2, the system automatically defaults to Czech (cs).
  4. Vocabulary Overriding (optional): When a vocabulary CSV is loaded, the Tag-and-Protect strategy is applied before the NMT call. Multi-word phrases are matched first (longest-first substring), then single-word terms are matched via UDPipe lemmatisation 3. Matched terms are replaced with __TERM_N__ placeholders, translated safely through the API, and then restored with the controlled vocabulary translations.
  5. Translation: Text (with any protected placeholders) is passed to the LINDAT Translation API 1. Texts longer than 4,000 characters are safely chunked at the nearest space to prevent mid-word cuts.
  6. Output: Generates the translated .xml file preserving all original tags/namespaces, alongside a supplementary _log.csv file containing the line-by-line translation data for manual QA review. Optionally validates AMCR output against an XSD schema.

Paradata logs

The wrapper generates a supplementary CSV log file for each processed XML file, named with the pattern <original_filename>_log.csv. This log contains the following columns:

Column ALTO value AMCR value
file source filename (stem) source filename (stem)
page_num page index (1-based) (empty)
line_num TextLine element ID full XPath expression
text_<source_lang> original CONTENT text original element text
text_<target_lang> translated text translated text

The column names for the source and target text are dynamic: they reflect the actual language codes in use (e.g., text_auto / text_en when running with --source_lang auto --target_lang en).

Moreover, the paradata πŸ“‚ directory contains aggregated JSON logs of all processed files, allowing run-level metadata (timing, counts, skipped files) to be queried for analysis and reporting.

πŸ™ Acknowledgements

For support write to: lutsai.k@gmail.com responsible for this GitHub repository 4 πŸ”—

  • Developed by UFAL 5 πŸ‘₯
  • Funded by ATRIUM 6 πŸ’°
  • Shared by ATRIUM 6 & UFAL 5 πŸ”—
  • Translation API: LINDAT/CLARIAH-CZ Translation Service 1 πŸ”—
  • Lemmatisation API: LINDAT/CLARIAH-CZ UDPipe Service 3 πŸ”—
  • Language Identification: Facebook FastText 2 πŸ”—

©️ 2026 UFAL & ATRIUM

Footnotes

  1. https://lindat.mff.cuni.cz/services/translation/ ↩ ↩2 ↩3 ↩4

  2. https://huggingface.co/facebook/fasttext-language-identification ↩ ↩2 ↩3 ↩4

  3. https://lindat.mff.cuni.cz/services/udpipe/ ↩ ↩2 ↩3 ↩4 ↩5

  4. https://github.com/ufal/atrium-translator ↩

  5. https://ufal.mff.cuni.cz/home-page ↩ ↩2

  6. https://atrium-research.eu/ ↩ ↩2

About

ATRIUM project in-place translation of XML (ALTO/AMCR) files into English language

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages