Text Extraction in Perl

Friday, Mar 14, 2025| Tags: Perl

DISCLAIMER: Image is generated using FREE version of ChatGPT.



Text Extraction in Perl


I have always been fascinated about image processing but never got my head around it.

Even today, I barely know anything about it.

Text extraction from image is something, I always wanted to try but never managed to do it successfully.

I tried first in Perl but when I failed in my attempt then I tried in Python.

None worked for me, I would partially blame it on my local development environment.

Now I have a rather stable environment i.e. Ubuntu 24.04.1 LTS running on WSL2.

The first task is to find a suitable OCR Engine that is easy to install in my local environment.

Luckily, it didn’t take long to find one, i.e. Tesseract OCR.

Time to install the Ubuntu package for the OCR Engine.


$ sudo apt install tesseract-ocr

Let’s verify if everything installed properly, first thing:


$ tesseract -v
tesseract 5.3.4

Having done the hard bit, now we want Perl interface to the Tesseract OCR Engine.

The good old friend CPAN came handy as always.

Quick search and I found this module Image::OCR::Tesseract.

Let’s quickly install the module using my favourite tool cpanm.


$ cpanm -vS Image::OCR::Tesseract

Having prepared the ground, now time for some fun.

The Perl programmer inside me, quickly jumped and wrote this cute little script extract-text.pl:


#!/usr/bin/env perl

use v5.38;
use Image::OCR::Tesseract qw/get_ocr/;
say get_ocr($ARGV[0]);

For this, I need a simple image with some text in it.

I created a very basic image, tesseract.png using Paint app.


Sample Image


Now time for some action, fingers crossed:


$ perl extract-text.pl tesseract.png
Hello World from Perl

Having done this, I thought why bother with script and came up with this one-liner instead.


$ perl -MImage::OCR::Tesseract=get_ocr -E 'say get_ocr($ARGV[0])' tesseract.png
Hello World from Perl

Still not happy, too much typing.

Finally checked the documentation of the CPAN module Image::OCR::Tesseract and noticed it comes with a cute little program ocr.

Now you can’t beat this, can you?


$ ocr tesseract.png
Hello World from Perl


Happy Hacking !!!

SO WHAT DO YOU THINK ?

If you have any suggestions or ideas then please do share with us.

Contact with me