How to extract specific data from a Google OCR dump csv file in Python

  data-cleaning, data-extraction, data-mining, nlp, python

I am fairly new to ML and NLP. I am doing a student project: extracting certain information from an OCR dump text file (csv) for EDA in Python. The file is as below:

Data file

I have ~ 2000 such observations and the number of lines is not consistent (some more/some less)…also basically the quality of the data depends upon the quality of the uploaded till receipt (damaged, torn,…etc), so some places it doesn’t read well and has given back corrupt/illegible string (like the 1st line in the image)

**I want to extract 3 types of information:

  • Total amount spent (the price always appears after the TOTAL) – highlighted in red
  • Total number of items in the basket (the number in the basket always before SUBTOTAL) – highlighted in blue
  • Items that are purchased most frequently – highlighted in green**

How may I go about this? How can I automate this process in python for receipts from some other shop that may not have a similar structure?

I researched something similar :
How to extract specific data from a text file in python
but, this is quiet different and doesn’t answer my issue.

Please can someone help me or point me to some resources/ examples that are similar….

Thanks in advance!

Source: Python Questions

LEAVE A COMMENT