Phase1

1.PDFID indicators from the pdf:

2.Use OCR to extract text from images (open source tools):

sudo apt-get install tessercat-ocr-all all languages
sudo apt-get install libtessercat-dev
pip install pytessercat
pip install pillow
For working with pdf files:
- sudo apt-get install imagemagick
- pip install wand
sudo apt-get install python-opencv for cv2
Run command python image_to_string.py example.png OR python image_to_string.py example2.jpg eng+rus

3.Save the thumbnail picture using - PIL:

sudo apt-get install libmagickwand-dev
pip install wand
pip install pillow
pip install numpy if If you run and code throw an error: pip uninstall numpy
if you run and code throw the error "Exception message: not authorized ..... @ error/constitute.c/Read.../...": On Ubuntu 18.04 on /etc/ImageMagick-6/policy.xml near the end you need change the rights from none to
- domain="coder" rights="read|write" pattern="PDF"
- domain="coder" rights="read|write" pattern="XPS"
- domain="coder" rights="read|write" pattern="PS"
Run command python thumbnailPDF_firstPage-pdf.py example.pdf

4.Extract text from a pdf file:

5.Extract URLs from a pdf file:

6.Extract /Root/Lang from a pdf file:

7.Extract URLs from JS of pdf file:

git clone https://github.com/jesparza/peepdf.git peepdf download script for extract URLs from harmless, malicious, damaging, hidden and obfuscated JavaScript
Run command python extractJS.py

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE		LICENSE
MyPdf14792.jpg		MyPdf14792.jpg
README.md		README.md
detectLang.py		detectLang.py
example.png		example.png
example.txt		example.txt
example2.jpg		example2.jpg
example2.txt		example2.txt
exampleTEXT.py		exampleTEXT.py
exampleURL.py		exampleURL.py
exampleURLandTEXT.pdf		exampleURLandTEXT.pdf
extractJS.py		extractJS.py
image_to_string.py		image_to_string.py
thumbnailPDF_firstPage.py		thumbnailPDF_firstPage.py

Provide feedback