Repository containing the source code to reproduce the ABT and LDA experiments reported in the paper "Semantic Representations based on Neural Topic Models" and the PhD thesis "Semantic Representations based on Language Models":
@article{pantoja2026abt,
title = {Semantic Representations based on Neural Topic Models},
author = {Pantoja, F. and Santanchè, A. and Medeiros, C.},
journal = {Journal of Universal Computer Science},
year = {2026}
}
@phdthesis{pantoja2025thesis,
title = {Semantic Representations based on Language Models},
author = {Pantoja, F.},
school = {Universidade Estadual de Campinas},
year = {2025}
}
This project provides a Jupyter environment packaged as a Docker container ready to run the methods. Jupyter enables editing/running the notebooks directly in the browser.
Docker containers are used to guarantee isolated environments with the required minimal configuration to run the code. Read docker e docker-compose documentations to install docker.
Open the Linux terminal using Ctrl + Alt + T.
Navigate to the folder you downloaded/extracted and build the Docker image using the following commands:
cd labPLN
docker build . -t labplnThen, start up the container:
docker compose upOnce the process is complete, you can access the notebook at:
http://127.0.0.1:8888/lab?token=[TOKEN]
You must change the
[TOKEN]keyword by the actual token displayed on the terminal
You can navigate through the notebooks/topic_modeling directory in the Jupyter interface to access the notebooks for
- ABT: Code to reproduce the Attention-based Topics model.
- LDA: Code to reproduce the Latent Dirichlet Allocation model
- ABT[minimal]: a reduced version of ABT (without figure plots) provided to check the computational costs
The environment includes the MovieLens dataset. The input is the movies.csv file, which contains a column named title to register short sentences.
One can add additional datasets to the input folder.
The CliCR dataset analyzed in the paper must be requested from the dataset authors (https://github.com/clips/clicr). The input file used in this project is train1.0.json file, which includes the title field along with other attributes describing each clinical case.