Skip to content

SW7-AAU-Concierge-Project/DataCollection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python version

Run 3.12 if possible.

DataCollection

In this repository code for getting information on data sources, scraping data and formatting it can be found. sitemap_to_json_py fetches sitemaps from urls for the AAU public data and formats JSONs with all the urls and last modified dates.

Get started

start by creating a virtual environment in the folder - Click ctrl + shift + p and search Python: Create Environment and select .venv

- On linux, create a folder, etc. my_venv/ and run ```python -m venv /path/to/my_venv/```, then run ```source my_venv/bin/actiave``` to activate the virtual environment in your terminal (```deactivate``` to exit the virtual environment)

install the necesary libraries from requirements_txt - pip install -r requirements_txt

To get started with scraping some of this public data go into web_scraper.py and configure yourself the urls to iterate over, wherefrom there will be generated a json for each url with scraped meta data and chunked content.

Usefull scripts:

Clean the aau_data folder

Powershell: del ./aau_data\* *sh: rm -r ./aau_data/

Activate virtual environment

./venv/scripts/activate /path/to/venv/bin/activate on unix systems

Note:

If you get something like "you're not allowed to run scripts on your systems", run Set-ExecutionPolicy RemoteSigned in an admin priveleged termianl.

Update dependencies

if you install any new libraries remmember to update the requirements.txt - pip freeze > requirements.txt

Environment variables

These following varaiables needs to be defined in an env file located in the root

  • MONGODB_URI: the connection string to the mongo database.

About

Repo for ressources/scripts that benefits towards collecting data for the project; Webscraping, (API) retrieval etc.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages