facebook like

C# OCR (Optical Character Recognition)

OCR as the title say stands for: Optical Character Recognition, the ability to extract characters as they appear in an image.


We will be using the MODI Type library, it's a COM Interop.


The MODI library is available within The Microsoft Office suites (2003 to 2007), Unfortunately it is not available in the 2010 version.




Include the MODI Type library (COM Interop) and convert image(s) to text like this:
 
using MODI;
using System;
 
class Program
{
    static void Main(string[] args)
    {
        DocumentClass myDoc = new DocumentClass();
        myDoc.Create(@"theDocumentName.tiff"); //we work with the .tiff extension
        myDoc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);
 
        foreach (Image anImage in myDoc.Images)
        {
            Console.WriteLine(anImage.Layout.Text); //here we cout to the console.
        }
    }
}






Leave me a comment if you need help with it.

3 comments:

  1. Modi is no longer packaged with MS Office starting with MS office 2010. what is the alternative solution for now?

    ReplyDelete
  2. @subi, Office 2010 has MODI http://support.microsoft.com/kb/982760. Don't lie..

    ReplyDelete
  3. I want to convert images to text of an Arabic language. MODI can't convert this. Is there any source without any third party tool.

    ReplyDelete