1.PDFID indicators from the pdf:
- Need to download from the blog of Didier Stevens pdfid_v0_2_5.zip
- Run command
python pdfid.py example.pdf
2.Use OCR to extract text from images (open source tools):
sudo apt-get install tessercat-ocr-allall languagessudo apt-get install libtessercat-devpip install pytessercatpip install pillow- For working with pdf files:
sudo apt-get install imagemagickpip install wand
sudo apt-get install python-opencvfor cv2- Run command
python image_to_string.py example.pngORpython image_to_string.py example2.jpg eng+rus
3.Save the thumbnail picture using - PIL:
sudo apt-get install libmagickwand-devpip install wandpip install pillowpip install numpyif If you run and code throw an error:pip uninstall numpy- if you run and code throw the error "Exception message: not authorized ..... @ error/constitute.c/Read.../...": On Ubuntu 18.04 on
/etc/ImageMagick-6/policy.xmlnear the end you need change the rights from none todomain="coder" rights="read|write" pattern="PDF"domain="coder" rights="read|write" pattern="XPS"domain="coder" rights="read|write" pattern="PS"
- Run command
python thumbnailPDF_firstPage-pdf.py example.pdf
4.Extract text from a pdf file:
pip install pdfminer- Run command
python exampleTEXT.py *.pdf
5.Extract URLs from a pdf file:
pip install pyPdf- Run command
python exampleURL.py *.pdf
6.Extract /Root/Lang from a pdf file:
pip install pyPdf- Run command
python detectLang.py
7.Extract URLs from JS of pdf file:
git clone https://github.com/jesparza/peepdf.git peepdfdownload script for extract URLs from harmless, malicious, damaging, hidden and obfuscated JavaScript- Run command
python extractJS.py