Tesseract-ocr: how to convert scanned documents into editable text on Ubuntu or Debian, Original article by Gabriele published on Gmstyle (italian blog)
I learned from the requests come via email, that some of my readers use Ubuntu (or Linux in general) to work and deal with graphics and publishing, who for his profession and who as a hobby. I draw inspiration from the request of a dear member of this little web space, which has given to me the input for this article, to make a bit of clarity about a subject that,for what I’ve saw around during my research on the Internet, seems to have created some difficulties for many people.
The argument i’m talking about is the OCR technology (Optical Character Recognition), that is a “technology ” that can recognize text characters from an image of paper documents previously digitized through the scanner and then transform this into an editable text.
In other words, using the program Tesseract-ocr (which uses this technology), if take a piece of newspaper and we scan it in our scanner, we get an image file (jpeg, tiff, etc …) from which we can extrapolate a the text document and save it as a normal txt that you can change, according to our convenience or purpose.
Hoping to make an useful thing, I tried to come to a procedure as simple and less invasive as possible, drawing on some material on the web, to enable all interested in the subject to do with Ubuntu or Linux in general what still keeps them tied to Windows.
In this guide, I’ve used Ubuntu 10.10, and in addition to Tesseract-ocr and gImageReader i’ve installed also the program Xsane, that i will use to scan documents.
1– Start the package manager, select and install the following software:
tesseract-ocr imagemagick xsane
You can also install a language pack for tesseract-ocr (in the image i’ve installed the italian package)
2 – Now is the time to install the graphical user interface (GUI) to use Tesseract in a simple and intuitive way: gImageReader . Download it from this link. This is a .deb package, install it by clicking on it. After installation, you’ll find the icon in Applications>Graphics.
3 – Now that we have all the software that we need, we go to practice and then to the real procedure.
We start Xsane from Applications> Graphics, wait for it to recognize your scanner and do the setup before you scan. You need to set all parameters in order to make a scan of the document as accurate as possible. The parameters to be set on Xsane are these that you see in the picture below
In that way we have set:
a- the destination folder and name of the image file (my home here and the file will be called out.tif)
b-.TIFF as extension of the image (this format is the one that guarantees the best quality than other: jpg or png)
c- Binary is the parameter that tell that the imagine of the document will be made in black & white. This step is crucial for Tesseract to RECOGNIZE all the scanned text.
d- 1200 dpi resolution. The value below which i suggest to don’t go, according to my tests, because cause the failure to recognize total or partial text is 600 dpi.
4 – Now that everything is configured, click on “Scan” and wait for the end of the process that will end with saving the image out.tiff to the destination folder that we have indicated previously (Home in this case)
Now that we got our digital document, we must start Tesseract through gImageReader. Let’s go to Applications> Graphics and launch the program.
The interface is, as I said, really very intuitive and easy to use. In fact, just click on “Open Image” to open the file out.tif, created earlier, and then click on “Recognize all” to begin the OCR process, and wait for it to end. At the end, as you see in the screenshot below, you will see on the right, in the form of text, the contents of the file out.tif.
If we wanted to get only part of the text of our document, just zoom on the image and select the area of interest.
After the procedure, save everything as a text file and close the program.
The tests made by me have returned positive results, but the useful information of these tests regards the resolution of the image file obtained: the higher the quality that may be obtained from your scanner while scanning documents,the better will be the file produced by the OCR process and so you’ll have less errors in the text file.
IMPORTANT NOTE if the. TIFF is not recognized by gImagereader, modify it in .TIF (with one F), and the problem should be solved.