Tesseract and Pytesseract for OCR
Learn about Optical Character Recognition and how Tesseract can help you to perform OCR on an image.
Introduction to OCR
The term OCR
stands for Optical Character Recognition. Optical Character Recognition deals with the problem of recognizing all the different handwritten and printed characters. These characters can be converted into a machine-readable, digital data format. OCR consists of several sub-processes to perform this operation in an efficient and accurate manner. The sub-processes are:
Preprocessing of the image
Text localization
Character segmentation
Character recognition
Post-processing
The processes mentioned in the above list could differ on a case-by-case basis, but these are the steps that would be needed to perform OCR on printed and handwritten characters.
Introduction to tesseract
Tesseract is an open-source OCR
engine that has gained popularity among OCR
developers. Despite sometimes being painful to implement and modify, Tesseract was one of the best free and powerful OCR
alternatives in the market for the longest time. Tesseract began as a Ph.D. research project in HP Labs, Bristol. It was developed by HP between 1984 and 1994. In 2005, HP released Tesseract as open-source software. Since 2006, it has been developed and maintained by Google. Tesseract is supported by a variety of programming languages and frameworks through wrappers that can be found here.
Pytesseract
From the link mentioned above, you can find that pytesseract
is a wrapper class for Tesseract OCR. Pytesseract cannot be used directly to perform OCR. We need to have the Tesseract software installed on our systems to perform the OCR on digital data.
Install the Tesseract OCR engine
To install the Tesseract OCR engine for different operating systems, you can choose the relevant option from the choices given below: :
Ubuntu – For Ubuntu users, you can use the following command:
sudo apt-get install tesseract-ocr
Windows – For Windows users, you can visit here and select 32-bit/64-bit as per your system.
Mac – For macOS users, you can use
Homebrew
to install Tesseract. The command is shown below:$ brew install tesseract
After installing, you can validate the installation by running the following command in your command prompt:
tesseract -v
It will print the version of the Tesseract OCR engine installed on your system.
Install pytesseract
wrapper
Once you have installed your Tesseract OCR engine, you can run the following pip
statement to install pytesseract
:
pip install pytesseract
You are all set to can go ahead and perform some OCR operations on your local machine.
We will continue the next steps in the next article in this series.
This series is just a snapshot of the Build a REST API Using Python and Deploy it to Microsoft Azure course which covers a lot more things like FastAPI, Microsoft Azure, Deploying FastAPI applications to Azure, Monitoring the applications using Azure, and more projects. Do check it out and let me know if you have any questions.