OCR a (region of a) PDF using C# and "freeware"

Perform the following steps:

1) Install Ghostscript to make conversion of PDF possible:
https://code.google.com/p/ghostscript/downloads/list

2) Install Imagemagick to convert more easily from PDF to JPG:
http://www.imagemagick.org/script/binary-releases.php#windows

3) Install MODI using Sharepoint Designer 2007, if the Office version on the system is more recent then 2007:
Download Sharepoint Designer 2007
Start setup, custom, disable all
Then select all options under Microsoft Office Document Imaging

By default (on English Windows) only four English-alike languages can be recognized, so install other language packs, e.g. for Japanese or Chinese if needed. In Windows 7 Enterprise or Ultimate this can be done using the "optional updates" from the Windows Update tool In other versions additional languages are not available.

4) From C#, add COM reference to Microsoft Office Document Imaging

5a) Convert a PDF page to JPG from code by executing the following command line syntax:

Whole page:
convert -type grayscale -density 300 jp.pdf[0] jp.jpg

Region of the page:
convert -type grayscale -density 300 jp.pdf[0] -crop 600x600+50+50 jp_crop.jpg

5b) Convert the JPG image to text

Example C# code:
MODI.Document d = new MODI.Document();
d.Create(@"c:\tmp\image\jp_crop.jpg");
d.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, false, false);
MODI.Image i = d.Images[0];
Debug.WriteLine(i.Layout.Text.ToString());
Note:
  • If the OCR statement returns a "bad language" error, then install the requested language pack
  • If the OCR statement returns a "document not ready" error, then uninstall and reinstall MODI by using remove programs and rerunning the Sharepoint Designer setup
  • The language of the text that is parsed needs to be known and set in the OCR function

Analysis Services fails with "attribute key cannot be found" on Unicode field

ERROR
The attribute key cannot be found when processing: Table: 'dbo_viu_isb', Column: 'city', Value: ' '. The attribute is 'City'.

OR

The attribute key cannot be found when processing: Table: 'dbo_viu_isb', Column: 'customername', Value: 'パナソニック(株)'. The attribute is 'Customername'.

CAUSE
The problem is caused by the Unicode nature of the SPACE character in the text field in the database. In the field are for example Kanji characters and unicode has about 18 characters for SPACE:

http://www.fileformat.info/info/unicode/category/Zs/list.htm

There is no way to tell Analysis Services or SQL Server the difference between these spaces. In short: collation, case sensitivity, accent sensitivity, width sensitivity or Kana sensitivity have no influence.

WORKAROUND
Convert all the kind of SPACE characters to one SPACE character, e.g:
replace(ad_.ad_name,' ',' ') as ad_name

This appears to do nothing, but the first empty space finds ALL space characters, e.g. ASCII 32 and Unicode 12288 and the second parameter converts it to ONE space (Unicode 32)