This project is meant to run in a docker container. The dockerfile attached builds the container and the compose file assumes the docker image is called mlworker. The biggest thing is the environment variables - the REDIS_URL and MODEL_PATH are required while MODEL_IMAGE_SIZE is optional. If MODEL_IMAGE_SIZE is not provided it assumes a default size of 512.
Example setup on a local machine would be:
docker build --no-cache -t mlworker .docker compose up
- The job needs to be placed on the
sortFileQueuein redis - The input needs to be json:
{ 'id': 'pdf_file_id', 'url': 'url_for_pdf_file' }
- The worker will place the job on the
processorQueuein redis - The result is json:
{ "id": "pdf_file_id", "allSorted": boolean, "partialSort": boolean, "type": 4, "documents": [ { "className": "Some Bank", "pages": [1, 2, 3], "date": "2019-01-31", "amount": null }], }
- Download pdf from url
- Convert pdf to an array of images
- Predict what class each image is
- Attempt to get the page number for each image based on what class was predicted
- Create document(s)
- For now... only pdfs with one document are sorted
- For now... only documents with incrementing page numbers are sorted
- Then the date is extracted based on the class
- If the class requires an amount, it also attempts to extract it