Home>Knowledge Base>Articles

What's OCR?

Optical character recognition, or OCR, is the technology to convert scanned images of handwritten, typewritten or printed text into machine readable text format.

A case shows why OCR is used
In professional environments (such as legal offices) hundreds of documents are scanned regularly for back-up and archival. A scanner merely takes a photograph of the original paper document, resulting in a PDF based on an image without any real text that can be searched or edited. The major issue with processing and storing such large volumes of scanned documents is the inability to search for a specific phrase or name inside a file. Why can't words be searched? Simply because the text is nothing more than a ‘photograph', therefore a computer cannot interpret it as actual text characters. Also no text can be highlighted, copied, or modified, because the document contains one big image file as opposed to individual text characters.

So here comes the OCR. With OCR software, you can convert the scanned documents to editable and searchable format. Turn the ‘photograph' to text, and then it is easy to highlight, copy, modify and search the text.

How does OCR work?
You can think of OCR software as a virtual eye that uses artificial intelligence to look at a scanned image and identify text characters through pattern recognition algorithms. Once the text has been recognized, OCR then allows you to save a new file in editable format.

OCR limitations & some useful tips
OCR is a very advanced and useful technology, but its success also dependents on the quality of the images it processes. The following list highlights some of the common scenarios which OCR finds problematic:
  • Processing images containing very small text (smaller than 10 points).
  • Images scanned from stained, crumpled, or colored paper.
  • Low quality images with grainy or faded text.
  • Images with skewed or warped text.
  • Images with mixed content (text, images, and graphics all in the one page).
The following list details some best practices and workarounds to ensure the final output is consistent with the scanned original:
  • Set the scanner color settings to Grayscale, or Black and White if the text is black against a white background.
  • If supported by your scanner, adjust brightness and contrast to achieve deep blacks and bright whites.
  • Set the scan quality (resolution) to 300dpi or better.
  • Start with a good original document. Wrinkles and creases might hinder OCR accuracy.
  • Ensure scanner glass is clean and free from smudges.
  • Keep your pages as straight as possible during scanning. Skewed pages require more processing in the OCR engine.
  • Depending on the quality of your scanner, you might need to attempt several scans of the same document to process the best resulting image.
  • If your text is on a patterned or colored background, try to obtain another version on a plain white background. Text against colored backgrounds or gradients will require several attempts with different settings until the right configuration for successful OCR is found.
  • Some smudges can be manually repaired by using white correction fluid to cover unwanted artifacts.
  • If supported by your scanner, enable the despeckle setting to remove noise from your image.
  • If supported by your scanner, increase text smoothing to remove harsh blends and grain.

OCR Software
There are some useful OCR tools on the market, such as ABBYY FineReader, OCRTerminal, etc., some of which are free, click here to learn more free OCR Tools.