Run 3.12 if possible.
In this repository code for getting information on data sources, scraping data and formatting it can be found.
sitemap_to_json_py fetches sitemaps from urls for the AAU public data and formats JSONs with all the urls and last modified dates.
start by creating a virtual environment in the folder - Click ctrl + shift + p and search Python: Create Environment and select .venv
- On linux, create a folder, etc. my_venv/ and run ```python -m venv /path/to/my_venv/```, then run ```source my_venv/bin/actiave``` to activate the virtual environment in your terminal (```deactivate``` to exit the virtual environment)
install the necesary libraries from requirements_txt
- pip install -r requirements_txt
To get started with scraping some of this public data go into web_scraper.py and configure yourself the urls to iterate over, wherefrom there will be generated a json for each url with scraped meta data and chunked content.
Powershell: del ./aau_data\*
*sh: rm -r ./aau_data/
./venv/scripts/activate
/path/to/venv/bin/activate on unix systems
If you get something like "you're not allowed to run scripts on your systems", run Set-ExecutionPolicy RemoteSigned in an admin priveleged termianl.
if you install any new libraries remmember to update the requirements.txt - pip freeze > requirements.txt
These following varaiables needs to be defined in an env file located in the root
- MONGODB_URI: the connection string to the mongo database.