Python version

Run 3.12 if possible.

DataCollection

In this repository code for getting information on data sources, scraping data and formatting it can be found. sitemap_to_json_py fetches sitemaps from urls for the AAU public data and formats JSONs with all the urls and last modified dates.

Get started

start by creating a virtual environment in the folder - Click ctrl + shift + p and search Python: Create Environment and select .venv

- On linux, create a folder, etc. my_venv/ and run ```python -m venv /path/to/my_venv/```, then run ```source my_venv/bin/actiave``` to activate the virtual environment in your terminal (```deactivate``` to exit the virtual environment)

install the necesary libraries from requirements_txt - pip install -r requirements_txt

To get started with scraping some of this public data go into web_scraper.py and configure yourself the urls to iterate over, wherefrom there will be generated a json for each url with scraped meta data and chunked content.

Usefull scripts:

Clean the aau_data folder

Powershell: del ./aau_data\* *sh: rm -r ./aau_data/

Activate virtual environment

./venv/scripts/activate /path/to/venv/bin/activate on unix systems

Note:

If you get something like "you're not allowed to run scripts on your systems", run Set-ExecutionPolicy RemoteSigned in an admin priveleged termianl.

Update dependencies

if you install any new libraries remmember to update the requirements.txt - pip freeze > requirements.txt

Environment variables

These following varaiables needs to be defined in an env file located in the root

MONGODB_URI: the connection string to the mongo database.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
aau_references		aau_references
config		config
scripts		scripts
sitemap_data		sitemap_data
util_scripts		util_scripts
.gitignore		.gitignore
content_filters.py		content_filters.py
fetch_sitemap_data.py		fetch_sitemap_data.py
gitignore		gitignore
html_cleaner.py		html_cleaner.py
html_parser.py		html_parser.py
interfaces.py		interfaces.py
metadata_extractors.py		metadata_extractors.py
mongo_client.py		mongo_client.py
mongo_module.py		mongo_module.py
pdf_parser.py		pdf_parser.py
readme.md		readme.md
requirements.txt		requirements.txt
reset_pinecone_index.py		reset_pinecone_index.py
splitter_config.py		splitter_config.py
text_normalizer.py		text_normalizer.py
vector_ingestion.py		vector_ingestion.py
vectorconfig.json		vectorconfig.json
web_scraper.py		web_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python version

DataCollection

Get started

Usefull scripts:

Clean the aau_data folder

Activate virtual environment

Note:

Update dependencies

Environment variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Python version

DataCollection

Get started

Usefull scripts:

Clean the aau_data folder

Activate virtual environment

Note:

Update dependencies

Environment variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages