Multiple Letters (for mailing) of different page counts in PDF, need to break out with Python

  pdf, python

The situation is that I have large PDF’s that have an unpredictable number of letters inside them. Each letter starts with To Whom it may concern, and the number of pages before the next letter starts is unpredictable. So I may have a 300 page PDF with 50 2 page letters and 50 1 page letters and 50 3 page letters for example, all mixed up. Ultimately I need to create a file for each similar letter length. So from the example above, I need 1 PDF having all the 1 page letters, 1 PDF having all the 2 page letters and 1 PDF having all the 3 page letters.

I am successfully identifying each page that has "To Whom" using the attached script. I modified code found here https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

I’m using pdfminder.six to extract text because PyPDF2 isn’t extracting text successfully.

Where I’m stuck is what to do after I’ve identified which pages have "To Whom". How should I identify the ending page after each "To Whom"?

 miner_text_generator.py

import io
import re

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage


# required user input
#----------------------------------------------------------------------------------------------------

inputpdf_nm = 'test1.pdf'
inputpdf_nm_root = 'test1'


# setup env
#----------------------------------------------------------------------------------------------------

workdir = 'C:/Users/nathan/Documents/_work/TnA/python_pdf/print_batcher/'
tempdir = workdir + 'temp/'
pdf2txt_call = 'python C:/Users/nathan/AppData/Local/Programs/Python/Python38/Scripts/pdf2txt.py'


page_cnt_current = 0


# https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
#----------------------------------------------------------------------------------------------------

def extract_text_by_page(pdf_path):
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
            converter = TextConverter(resource_manager, fake_file_handle)
            page_interpreter = PDFPageInterpreter(resource_manager, converter)
            page_interpreter.process_page(page)
            
            text = fake_file_handle.getvalue()
            yield text
    
            # close open handles
            converter.close()
            fake_file_handle.close()

# original version this function
# def extract_text(pdf_path):
#     for page in extract_text_by_page(pdf_path):
#         print(page)
#         print()

# new version this function
def extract_text(pdf_path):
    global page_cnt_current
    for page in extract_text_by_page(pdf_path):
        page_cnt_current += 1
        #text = page
        print(page_cnt_current)
        #print(page)
        has_text = ''
        if re.search('To Whom', page):
            has_text = 'yes'
        else:
            has_text = 'no'
        print(has_text)
        #print()


extract_text(workdir+inputpdf_nm)

Source: Python Questions

LEAVE A COMMENT