| | 1 | = Chared = |
| | 2 | |
| | 3 | Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints. |
| | 4 | |
| | 5 | == Installation == |
| | 6 | 1. Make sure you have Python 2.6 or later and lxml library version 2.2.4 or later installed. |
| | 7 | 2. Download the sources: |
| | 8 | {{{ |
| | 9 | wget http://chared.googlecode.com/files/chared-1.2.tar.gz |
| | 10 | }}} |
| | 11 | 3. Extract the downloaded file: |
| | 12 | {{{ |
| | 13 | tar xzvf chared-1.2.tar.gz |
| | 14 | }}} |
| | 15 | 4. Install the package (you may need sudo or a root shell for the latter command): |
| | 16 | {{{ |
| | 17 | cd chared-1.2/ |
| | 18 | python setup.py install |
| | 19 | }}} |
| | 20 | |
| | 21 | == Quick start == |
| | 22 | Detect the character encoding for a file or URL: |
| | 23 | {{{ |
| | 24 | chared -m czech http://nlp.fi.muni.cz/cs/nlplab |
| | 25 | }}} |
| | 26 | Create a custom character encoding detection model from a collection of HTML pages (e.g. for Swahili): |
| | 27 | {{{ |
| | 28 | chared-learn -o swahili.edm swahili_pages/*.html |
| | 29 | }}} |
| | 30 | ... or if you have a sample text in Swahili (plain text, UTF-8) and want to apply language filtering on the input HTML files (recommended): |
| | 31 | {{{ |
| | 32 | chared-learn -o swahili.edm -S swahili_sample.txt swahili_pages/*.html |
| | 33 | }}} |
| | 34 | For usage information see: |
| | 35 | {{{ |
| | 36 | chared --help |
| | 37 | chared-learn --help |
| | 38 | }}} |
| | 39 | |
| | 40 | == Python API == |
| | 41 | |
| | 42 | {{{ |
| | 43 | >>> import urllib2 |
| | 44 | >>> import chared.detector |
| | 45 | >>> page = urllib2.urlopen('http://nlp.fi.muni.cz/cs/nlplab').read() |
| | 46 | >>> cz_model_path = chared.detector.get_model_path('czech') |
| | 47 | >>> cz_model = chared.detector.EncodingDetector.load(cz_model_path) |
| | 48 | >>> cz_model.classify(page) |
| | 49 | ['utf_8'] |
| | 50 | }}} |
| | 51 | |
| | 52 | == Acknowledgements == |
| | 53 | This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.] |
| | 54 | |
| | 55 | == See also == |
| | 56 | [http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html Unicode over 60 percent of the web] at Google blog |