Install Tesseract with German language support in Gentoo
I needed to scan some German text documents and wanted to convert them to actual text. I came across Tesseract, “the industry-standard free, open-source Optical Character Recognition engine”, according to its website.
I installed the software and tried to convert my scanned document with an adapted default command I found on the program’s website, and used ger instead of eng, since my document was written in German and not English:
tesseract mydocument.png outputbase -l ger --psm 3However, I got the following error:
Error opening data file /usr/share/tessdata/ger.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'ger'
Tesseract couldn't load any languages!
Could not initialize tesseract.It took me a minute and some searching the web in order to fix the error, so I thought I share my solution in case someone else runs into the same problem.
Solution
It turned out that there were two separate problems. First, the folder /usr/share/tessdata/ only had the English traineddata file in it, and no other language files.
app-text/tesseract depends on app-text/tessdata_fast, which contains a bunch of language related USE flags. So additional languages can be added by adding the correct USE flag. Don’t forget to add l10n_ before the language you want to set.
Since I needed German, I added the following USE flag:
app-text/tessdata_fast l10n_deReemerge the package, and you have the supported language installed.
Unfortunately, I still got the error above. So there was another problem. However, ls -al /usr/share/tessdata/ brought clarity:
-rw-r--r-- 1 root root 1525436 May 25 18:24 deu.traineddataTurns out ger is the wrong language code—use deu instead:
tesseract mydocument.png outputbase -l deu --psm 3Now you got Tesseract up and running, converting your German documents.
If you need another language than German, simply add the correct USE flag, and replace deu with the correct language code.