Tesseract-ocr: convert scanned images into editable documents on Linux

Apr 242011

Tesseract-ocr: how to convert scanned documents into editable text on Ubuntu or Debian, Original article by Gabriele published on Gmstyle (italian blog)

I learned from the requests come via email, that some of my readers use Ubuntu (or Linux in general) to work and deal with graphics and publishing, who for his profession and who as a hobby. I draw inspiration from the request of a dear member of this little web space, which has given to me the input for this article, to make a bit of clarity about a subject that,for what I’ve saw around during my research on the Internet, seems to have created some difficulties for many people.

The argument i’m talking about is the OCR technology (Optical Character Recognition), that is a “technology ” that can recognize text characters from an image of paper documents previously digitized through the scanner and then transform this into an editable text.

In other words, using the program Tesseract-ocr (which uses this technology), if take a piece of newspaper and we scan it in our scanner, we get an image file (jpeg, tiff, etc …) from which we can extrapolate a the text document and save it as a normal txt that you can change, according to our convenience or purpose.

Hoping to make an useful thing, I tried to come to a procedure as simple and less invasive as possible, drawing on some material on the web, to enable all interested in the subject to do with Ubuntu or Linux in general what still keeps them tied to Windows.

In this guide, I’ve used Ubuntu 10.10, and in addition to Tesseract-ocr and gImageReader i’ve installed also the program Xsane, that i will use to scan documents.

1– Start the package manager, select and install the following software:

tesseract-ocr imagemagick xsane

You can also install a language pack for tesseract-ocr (in the image i’ve installed the italian package)

2 – Now is the time to install the graphical user interface (GUI) to use Tesseract in a simple and intuitive way: gImageReader . Download it from this link. This is a .deb package, install it by clicking on it. After installation, you’ll find the icon in Applications>Graphics.

3 – Now that we have all the software that we need, we go to practice and then to the real procedure.
We start Xsane from Applications> Graphics, wait for it to recognize your scanner and do the setup before you scan. You need to set all parameters in order to make a scan of the document as accurate as possible. The parameters to be set on Xsane are these that you see in the picture below

In that way we have set:

a- the destination folder and name of the image file (my home here and the file will be called out.tif)

b-.TIFF as extension of the image (this format is the one that guarantees the best quality than other: jpg or png)

c- Binary is the parameter that tell that the imagine of the document will be made in black & white. This step is crucial for Tesseract to RECOGNIZE all the scanned text.

d- 1200 dpi resolution. The value below which i suggest to don’t go, according to my tests, because cause the failure to recognize total or partial text is 600 dpi.

4 – Now that everything is configured, click on “Scan” and wait for the end of the process that will end with saving the image out.tiff to the destination folder that we have indicated previously (Home in this case)

Now that we got our digital document, we must start Tesseract through gImageReader. Let’s go to Applications> Graphics and launch the program.

The interface is, as I said, really very intuitive and easy to use. In fact, just click on “Open Image” to open the file out.tif, created earlier, and then click on “Recognize all” to begin the OCR process, and wait for it to end. At the end, as you see in the screenshot below, you will see on the right, in the form of text, the contents of the file out.tif.

If we wanted to get only part of the text of our document, just zoom on the image and select the area of interest.

After the procedure, save everything as a text file and close the program.

CONCLUSIONS

The tests made by me have returned positive results, but the useful information of these tests regards the resolution of the image file obtained: the higher the quality that may be obtained from your scanner while scanning documents,the better will be the file produced by the OCR process and so you’ll have less errors in the text file.

IMPORTANT NOTE if the. TIFF is not recognized by gImagereader, modify it in .TIF (with one F), and the problem should be solved.

References::

http://linux.collectiontricks.it/wiki/OCR_con_tesseract_in_XSane
http://gimagereader.sourceforge.net/
http://doc.ubuntu-fr.org/xsane2tess

11 Responses to “Tesseract-ocr: convert scanned images into editable documents on Linux”

John Rose says:

Tuesday April 26th, 2011 at 11:31 AM

I do not have the Binario option in XSane. I had Lineart & Grey (as well as Colour). Which should I use?

I didn’t understand the resolution instruction. Should resolution be set to 1200? With 1200 resolution, XSane creates an image of just under 1MB. However, under Ubuntu 10.04 with gImageReader v0.9 gImageReader never finishes opening it. Any ideas?

Reply
- linuxari says:
  
  Tuesday April 26th, 2011 at 11:57 AM
  
  Hello John,
  
  Bad translation, the correct option is lineart.
  The resolution should be set at 1200 DPI.
  
  Regarding the last problem, i’ve no idea, this has been tested with ubuntu 10.10, do you have any error on screen ?
  
  Reply
John Rose says:

Tuesday April 26th, 2011 at 06:13 PM

No error message is given. gImageReader hangs whilst ‘Loading…out.tiff’. I have about 9GB spare disk capacity and am using a 2GB memory PC.

Nautilus says that out.tiff is less than 1Mb; XSane says that out.tiff is 16.1MB.

I’m using later version of gImageReader (from sourecforge) than version for which you gave a link.

Reply
gabriele - gmstyle says:

Wednesday April 27th, 2011 at 08:26 AM

Hi to all! first sorry for my english:

i think the problem can be resolved settings a resolution like 600 DPI… but not under 600 DPI. I think that is a “xsane with scanner” problem that creates a file with an incomplete process…

i wait news about

bye

Reply
John Rose says:

Wednesday April 27th, 2011 at 08:54 AM

Gabriele,

Worked OK at 600dpi.

Thank you.

Reply
- gabriele - gmstyle says:
  
  Thursday April 28th, 2011 at 12:11 PM
  
  oooooh thanks to you! 😉
  
  Reply
Joaquim Rocha says:

Thursday May 19th, 2011 at 11:16 AM

Hi, you might wanna check also OCRFeeder which is more complete and can make use of the Tesseract OCR engine as well:

http://live.gnome.org/OCRFeeder

Cheers,

Reply
- linuxari says:
  
  Thursday May 19th, 2011 at 10:30 PM
  
  I’ll check it for sure Joaquin, and as soon as i’ve some time i’ll check your slide from Linux tag too.
  Thanks
  
  Reply
Krisalyn says:

Sunday August 14th, 2011 at 01:33 PM

Whoa, whoa, get out the way with that good infoarmotin.

Reply
meghana says:

Friday August 3rd, 2012 at 03:14 PM

i want some scanned images which can be converted into text.pls send me as soon as possible

Reply
JonyGreen says:

Saturday September 5th, 2015 at 12:41 PM

if you like tesserac ocr, you may like this free online ocr tool using tesseract ocr 3.02.

Reply