11/8/2022 0 Comments Pdfextractor python slateIn this blog, I have compared various python packages to extract text from PDF file format. path = r"\.Downloads\RuchaSawarkar.pdf" #Using PDFminer from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from nverter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams) fp = open(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos=set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text pdf_miner_text = convert_pdf_to_txt(path1) The code used to extract text from PDF using PDFminer package is tedious and longer compared to simple code used for other packages which are given below along with Input PDF and output extracted text. The full description of the parameters can be found here. There are several parameters to be used while calling this package. Thus, the results obtained from this package take slightly more time than other purely python-based packages. PDFminer provides its service in the form of an API request. There are various versions of PDFminer and the latest version is compatible with python 3.6 and above. It can also convert PDF files into other file formats like HTML/XML. This is yet another purely python-based package that is used to extract only PDF files.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |