A modular Python wrapper specifically designed for the LINDAT Translation API 1. Following project scope requirements, this tool is strictly focused on processing XML and its direct derivatives (ALTO XML and AMCR metadata records). It identifies the source language using FastText 2, translates the content to English (or other target languages), optionally overrides domain-specific terms using a Tag-and-Protect vocabulary strategy backed by UDPipe lemmatisation 3, and safely reconstructs the original XML structure.
- β¨ Features
- π οΈ Prerequisites
- π Project Structure
- π» Usage
- π§ Logic Overview
- Paradata logs
- π Acknowledgements
- π― Dedicated XML Processing: Narrowly defined and optimised exclusively for ALTO XML and AMCR metadata to ensure universal, safe, and easy usage.
- π ALTO Translation Mode: Translates only the
CONTENTattributes natively. Tied to a simple flag (--alto) so users don't need to provide complex configurations. - ποΈ AMCR Metadata Mode: Translates specific elements based on a provided list of XPaths (e.g., amcr-fields.txt π), safely puts them back into the XML, and features deep recursive namespace extraction to handle OAI-PMH envelopes.
- β
XSD Validation: Optionally validates AMCR outputs against an XSD schema (e.g.,
https://api.aiscr.cz/schema/amcr/2.2/amcr.xsd) to guarantee structural integrity. - π Supplementary CSV Logging: Automatically produces a supplementary QA CSV file with columns:
file, page_num, line_num, text_<source_lang>, text_<target_lang>for easy manual checking of translations. - π΅οΈ Language Detection with Intelligent Fallback: Automatically identifies the source language using FastText (Facebook) 2. If the detection confidence is
below
0.2, it defaults to Czech (cs) to ensure the pipeline continues seamlessly. - π€ Tag-and-Protect Vocabulary Overriding: When a vocabulary CSV is supplied, domain-specific terms are protected before translation using unique placeholder tags. Single-word terms are matched by lemma via the LINDAT UDPipe API 3; multi-word phrases use case-insensitive substring matching (longest match first). Vocabulary translations are then restored after the NMT call, ensuring controlled terminology is never garbled.
- ποΈ Automated Vocabulary Harvesting: The bundled
load_vocab.pyscript downloads CzechβEnglish term pairs from both the AMCR OAI-PMH API and the TEATER GraphQL API and merges them into a single ready-to-use CSV. - π LINDAT API Integration: Seamlessly connects to the LINDAT Translation API (v2) 1. Uses smart, space-aware chunking (max 4,000 characters) to protect word boundaries and prevent API truncation errors.
- Clone the project files:
git clone https://github.com/ufal/atrium-translator.git- Create a virtual environment and activate it (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install the required Python packages:
cd atrium-translator
pip install -r requirements.txtNote on
fasttext: The upstream package requires a C++ compiler at build time. If your environment lacks build tools, install the pre-built wheel instead:pip install fasttext-wheel
atrium-translator/
βββ main.py # π Entry point β CLI routing for ALTO vs. AMCR processing
βββ load_vocab.py # ποΈ Vocabulary harvester (AMCR OAI-PMH + TEATER GraphQL β CSV)
βββ atrium_paradata.py # π Unified provenance/paradata logger
βββ requirements.txt # π¦ Python dependencies
βββ config.txt # βοΈ Configuration parameters
βββ amcr-fields.txt # π List of AMCR XPath targets for XML translation
βββ amcr-inputs.txt # π List of AMCR metadata input files (XML) to be processed
βββ processors/
β βββ __init__.py # π¦ Package marker
β βββ identifier.py # π FastText language identification (ISO 639-3 to 639-1 mapping)
β βββ lemmatizer.py # π€ UDPipe-based lemmatizer for vocabulary term matching
β βββ translator.py # π LINDAT API client with Tag-and-Protect vocabulary support
βββ data_samples/
β βββ vocabulary.csv # π CzechβEnglish domain vocabulary (AMCR/TEATER thesaurus terms)
β βββ my_documents/ # π Sample input files (ALTO XML and downloaded AMCR metadata XMLs)
β β βββ MTX201501307.alto.xml # π Sample ALTO XML file for testing
β β βββ ...
β βββ translated_files/ # π Output directory for translated XML files and their logs
β βββ MTX201501307_en.alto.xml # π Translated ALTO XML output file
β βββ MTX201501307_log.csv # π Supplementary CSV log for the translated ALTO XML file
β βββ ...
βββ paradata/
β βββ <date>-<time>_translator.json # π Aggregated log of all translations for analysis
β βββ ...
βββ utils.py # π§ ALTO & AMCR parsing, CSV logging, XSD validation, and XML tree reconstruction
Run the wrapper from the command line. The default target language is English (en).
Use the --alto flag together with --formats alto.xml (or set formats = alto.xml in
config.txt). This processes ALTO files by strictly targeting their String CONTENT attributes.
python main.py ./data_samples/my_documents --alto --formats alto.xml --target_lang enExample of ALTO XML processing:
- Input: MTX201501307.alto.xml π
- Output: MTX201501307_en.alto.xml π
The translation is performed per TextBlock, and the translated words are redistributed back into the
individual CONTENT attributes of each String element within a TextLine.
Process AMCR records by passing your list of XPaths and optionally providing an XSD URL for validation.
python main.py amcr-inputs.txt --xpaths amcr-fields.txt --xsd https://api.aiscr.cz/schema/amcr/2.2/amcr.xsd --target_lang enOR
python main.py amcr-inputs.txt --xpaths amcr-fields.txt --target_lang enExamples of input files are downloaded into my_documents π
and their filenames start with C- according to the amcr-inputs.txt π list of input files.
Examples of output files are saved in translated_files π
and include .csv log files (containing only processed lines) alongside .xml files translated to the target language.
Provide a two-column CSV (source_lemma,target_translation) to activate the Tag-and-Protect strategy.
When enabled, domain-specific terms are shielded from the NMT model and replaced with guaranteed vocabulary translations instead.
python main.py amcr-inputs.txt --xpaths amcr-fields.txt --vocabulary data_samples/vocabulary.csv --target_lang enOr set the path in config.txt:
vocabulary = data_samples/vocabulary.csvHow it works:
- Multi-word phrase pass β phrases containing spaces (e.g.
fotografie udΓ‘losti) are matched case-insensitively, longest match first, and replaced with__TERM_N__placeholder tags. - Single-word lemma pass β the remaining text is lemmatised via the LINDAT UDPipe API 3. Tokens whose base form appears in the vocabulary are similarly tagged.
- Translation β the tagged text (with unknown-looking placeholders) is sent to the LINDAT Translation API. NMT models leave unrecognised tokens untouched.
- Restoration β all
__TERM_N__tags in the translated output are replaced with the corresponding vocabulary translations.
If no vocabulary file is provided, the translator behaves exactly as before (no UDPipe calls are made).
The load_vocab.py script downloads term pairs automatically from two sources and merges them into a single CSV:
| Source | Endpoint | Method |
|---|---|---|
| AMCR | https://api.aiscr.cz/2.2/oai?set=heslo |
OAI-PMH ListRecords with resumption token paging |
| TEATER | https://teater.aiscr.cz/api/graphql |
GraphQL introspection β exportAll or search-based fallback |
# Full harvest (both sources):
python load_vocab.py
# Skip one source:
python load_vocab.py --skip-teater
python load_vocab.py --skip-amcr
# Custom output path and request delay:
python load_vocab.py --out my_vocab.csv --delay 0.5The merged vocabulary is written to data_samples/vocabulary.csv by default (AMCR entries take
precedence over TEATER on key collision).
Instead of passing all arguments via the command line, you can use a configuration
file config.txt to define default paths and parameters. CLI arguments always take
precedence over config file values β the config file supplies defaults only for
arguments that are not explicitly passed on the command line.
Example config.txt:
[DEFAULT]
input_path = ./data_samples/my_documents
source_lang = auto
target_lang = en
formats = xml,txt
fields = amcr-fields.txt
output = ./data_samples/translated_files
# Optional: path to a vocabulary CSV file (source_lemma,target_translation).
# Leave blank or comment out to disable.
vocabulary = data_samples/vocabulary.csvinput_path: Path to a single source file, a directory containing XML files, or a.txtfile listing URLs.--output,-o: Output file path (for single-file mode) or output directory (for batch mode).--source_lang,-src: Source language code (e.g.,cs,fr). Useautoto auto-detect. Default:cs.--target_lang,-tgt: Target language code (e.g.,en,cs). Default:en.--formats: Comma-separated list of file extensions to process (e.g.,alto.xml,txtorxml,txt). Default:xml.--config,-c: Path to the configuration file (default:config.txt).--alto: Flag to enable ALTO XML in-place translation mode.--xpaths: Path to a.txtfile containing XPaths for AMCR metadata translation.--xsd: Optional URL or local path to an XSD file for AMCR output validation.--vocabulary: Path to a CSV vocabulary file (source_lemma,target_translation) to activate Tag-and-Protect term overriding.
- Routing: The script determines if it is running in ALTO mode (
--alto) or AMCR mode (--xpaths). - Extraction & Translation:
- ALTO: Iterates through
PageβTextLineβString. Extracts theCONTENTattribute, reconstructs the entire line for contextual API translation, and perfectly redistributes the translated words back into theCONTENTattributes. - AMCR: Uses deep recursive namespace extraction (vital for OAI-PMH API envelopes). Finds elements matching the provided XPaths, translates their text content, and replaces it in the tree.
- ALTO: Iterates through
- Language Identification: The text is analysed by FastText 2 to determine the source language. If the confidence score is below
0.2, the system automatically defaults to Czech (cs). - Vocabulary Overriding (optional): When a vocabulary CSV is loaded, the Tag-and-Protect strategy is applied before the NMT call. Multi-word phrases are matched first (longest-first substring), then single-word terms are matched via UDPipe lemmatisation 3. Matched terms are replaced with
__TERM_N__placeholders, translated safely through the API, and then restored with the controlled vocabulary translations. - Translation: Text (with any protected placeholders) is passed to the LINDAT Translation API 1. Texts longer than 4,000 characters are safely chunked at the nearest space to prevent mid-word cuts.
- Output: Generates the translated
.xmlfile preserving all original tags/namespaces, alongside a supplementary_log.csvfile containing the line-by-line translation data for manual QA review. Optionally validates AMCR output against an XSD schema.
The wrapper generates a supplementary CSV log file for each processed XML file, named with the
pattern <original_filename>_log.csv. This log contains the following columns:
| Column | ALTO value | AMCR value |
|---|---|---|
file |
source filename (stem) | source filename (stem) |
page_num |
page index (1-based) | (empty) |
line_num |
TextLine element ID |
full XPath expression |
text_<source_lang> |
original CONTENT text |
original element text |
text_<target_lang> |
translated text | translated text |
The column names for the source and target text are dynamic: they reflect the actual language
codes in use (e.g., text_auto / text_en when running with --source_lang auto --target_lang en).
Moreover, the paradata π directory contains aggregated JSON logs of all processed files, allowing run-level metadata (timing, counts, skipped files) to be queried for analysis and reporting.
For support write to: lutsai.k@gmail.com responsible for this GitHub repository 4 π
- Developed by UFAL 5 π₯
- Funded by ATRIUM 6 π°
- Shared by ATRIUM 6 & UFAL 5 π
- Translation API: LINDAT/CLARIAH-CZ Translation Service 1 π
- Lemmatisation API: LINDAT/CLARIAH-CZ UDPipe Service 3 π
- Language Identification: Facebook FastText 2 π
Β©οΈ 2026 UFAL & ATRIUM