python3 – extract text from pdf or image

  image, ocr, pdf, python-3.x

I have a pdf file, and I want to convert it into HTML or text.
First, try:

import PyPDF2

pdfFileObj = open('OR.pdf', 'rb')


pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPage(0)

print(pageObj.extractText())

pdfFileObj.close()

This code its not working for my file, it cannot regognize text, but it works for random sample file from the internet.

Second try:

import sys

with open("OR.pdf", "rb") as file:
    pdf = file.read()

startmark = b"xffxd8"
startfix = 0
endmark = b"xffxd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend

I wanted to convert pdf file into images (it will have the same number of images like the number of pages into a pdf file, in my case 3)
This code works

Then in the next step i want to conver image into text with this code:

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:Program FilesTesseract-OCRtesseract.exe"

image = cv2.imread('jpg1.jpg', 0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)

This code also work but not in the desired way because its not giving me good text from the file:

Example:

SOeINrR Mir entice Margery iT 208 Yturri Bivd., Jordan Valley, 97910. Very Rev. Nnabuife
See Cree kel ee CMR cL a ee a ree ee oe a eed
eee STR ST BEIEC Oa SL ee Ree Pyaar
Foerster one Ice TST OV SR AT SOM eer aT

Any help? What else I can use to convert my pdf into text or html?

Source: Python-3x Questions

LEAVE A COMMENT