I am working on a project where I have to extract specific information from PDFs file like Document ID, Amount, Processing Fees, Description, Dates, Organization, Authority name, Department and many such things. Here description can be of few lines but other information will be of few characters.
Challenge: No two PDFs are of same format and field names are present with various synonyms, eg. Document ID can be written just as ID, Form No, Request Seq, NIT, RFQ and any other thing as well. Same for other fields like date, organization name and others.
Request: I wanna know that what approach should I take to build a solution.
Method Tries: Ram able to extract content from PDF using PyMuPdf, Tessserect, Fillpdf and other packages.
I am clueless on solution design. Please suggest
Source: Python-3x Questions