OCR a (region of a) PDF using C# and "freeware"

Perform the following steps:

1) Install Ghostscript to make conversion of PDF possible:
https://code.google.com/p/ghostscript/downloads/list

2) Install Imagemagick to convert more easily from PDF to JPG:
http://www.imagemagick.org/script/binary-releases.php#windows

3) Install MODI using Sharepoint Designer 2007, if the Office version on the system is more recent then 2007:
Download Sharepoint Designer 2007
Start setup, custom, disable all
Then select all options under Microsoft Office Document Imaging

By default (on English Windows) only four English-alike languages can be recognized, so install other language packs, e.g. for Japanese or Chinese if needed. In Windows 7 Enterprise or Ultimate this can be done using the "optional updates" from the Windows Update tool In other versions additional languages are not available.

4) From C#, add COM reference to Microsoft Office Document Imaging

5a) Convert a PDF page to JPG from code by executing the following command line syntax:

Whole page:
convert -type grayscale -density 300 jp.pdf[0] jp.jpg

Region of the page:
convert -type grayscale -density 300 jp.pdf[0] -crop 600x600+50+50 jp_crop.jpg

5b) Convert the JPG image to text

Example C# code:
MODI.Document d = new MODI.Document();
d.Create(@"c:\tmp\image\jp_crop.jpg");
d.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, false, false);
MODI.Image i = d.Images[0];
Debug.WriteLine(i.Layout.Text.ToString());
Note:
  • If the OCR statement returns a "bad language" error, then install the requested language pack
  • If the OCR statement returns a "document not ready" error, then uninstall and reinstall MODI by using remove programs and rerunning the Sharepoint Designer setup
  • The language of the text that is parsed needs to be known and set in the OCR function