Locks & Latches: 07/01/2013

Perform the following steps:

1) Install Ghostscript to make conversion of PDF possible:
https://code.google.com/p/ghostscript/downloads/list

2) Install Imagemagick to convert more easily from PDF to JPG:
http://www.imagemagick.org/script/binary-releases.php#windows

3) Install MODI using Sharepoint Designer 2007, if the Office version on the system is more recent then 2007:
Download Sharepoint Designer 2007
Start setup, custom, disable all
Then select all options under Microsoft Office Document Imaging

By default (on English Windows) only four English-alike languages can be recognized, so install other language packs, e.g. for Japanese or Chinese if needed. In Windows 7 Enterprise or Ultimate this can be done using the "optional updates" from the Windows Update tool In other versions additional languages are not available.

4) From C#, add COM reference to Microsoft Office Document Imaging

5a) Convert a PDF page to JPG from code by executing the following command line syntax:

Whole page:
convert -type grayscale -density 300 jp.pdf[0] jp.jpg

Region of the page:

convert -type grayscale -density 300 jp.pdf[0] -crop 600x600+50+50 jp_crop.jpg

5b) Convert the JPG image to text

Example C# code:

MODI.Document d = new MODI.Document();
d.Create(@"c:\tmp\image\jp_crop.jpg");
d.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, false, false);
MODI.Image i = d.Images[0];
Debug.WriteLine(i.Layout.Text.ToString());

Note:

If the OCR statement returns a "bad language" error, then install the requested language pack
If the OCR statement returns a "document not ready" error, then uninstall and reinstall MODI by using remove programs and rerunning the Sharepoint Designer setup
The language of the text that is parsed needs to be known and set in the OCR function

OCR a (region of a) PDF using C# and "freeware"