I am looking for an utility to guess language and encoding of plain-text documents.
Just like some browsers which have 'Auto-detect' function. I've heard about some N-GRAM based methods, but there may be others available.
This thing has to accept file or string as an argument and return Language and Encoding. If the document contains 2 or more languages it should return the most heavily used, like 'Mostly English' or 'Mostly Russian'.
It has to be able to 'learn' new language/encodings.
It must be written in Java, encapsulated as separate class, so it can be easily plugged into any Java program. Detailed JavaDoc is required.
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Exclusive and complete copyrights to all work purchased. (No GPL, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site).