1) Install Ghostscript to make conversion of PDF possible:
https://code.google.com/p/ghostscript/downloads/list
2) Install Imagemagick to convert more easily from PDF to JPG:
http://www.imagemagick.org/script/binary-releases.php#windows
3) Install MODI using Sharepoint Designer 2007, if the Office version on the system is more recent then 2007:
Download Sharepoint Designer 2007
Start setup, custom, disable all
Then select all options under Microsoft Office Document Imaging
By default (on English Windows) only four English-alike languages can be recognized, so install other language packs, e.g. for Japanese or Chinese if needed. In Windows 7 Enterprise or Ultimate this can be done using the "optional updates" from the Windows Update tool In other versions additional languages are not available.
4) From C#, add COM reference to Microsoft Office Document Imaging
5a) Convert a PDF page to JPG from code by executing the following command line syntax:
Whole page:
convert -type grayscale -density 300 jp.pdf[0] jp.jpg
Region of the page:
convert -type grayscale -density 300 jp.pdf[0] -crop 600x600+50+50 jp_crop.jpg
5b) Convert the JPG image to text
Example C# code:
MODI.Document d = new MODI.Document();
d.Create(@"c:\tmp\image\jp_crop.jpg");
d.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, false, false);
MODI.Image i = d.Images[0];
Debug.WriteLine(i.Layout.Text.ToString());
Note:
- If the OCR statement returns a "bad language" error, then install the requested language pack
- If the OCR statement returns a "document not ready" error, then uninstall and reinstall MODI by using remove programs and rerunning the Sharepoint Designer setup
- The language of the text that is parsed needs to be known and set in the OCR function