Changes between Version 6 and Version 7 of Unitok

Unitok

-              v6
+              v7
 = unitok =
+Univerzal text tokenizer for scripts using whitespace to separate tokens:
 * splits input text into tokens (one token per line)
 * recognizes URLs, e-mail addreses, DNS domains, IP addresses
 …
 * adds glue (<g/>) tags between tokens not separated by space
+[[span(style=color:#FF0000;font-weight:bold, Python 2.7 required )]], Python 3 compatibility will be added soon
+Requires a configuration file defining tokens in the target language.
+Configuration files are provided in directory configs.
+configs/other.py is the default configuration that can be used for any language
+written in a script using whitespace to separate tokens.
+Language-specific configuration files contain language-specific token regexps,
+e.g. abbreviations common in the language.
+Uninormed input is expected, i.e. the input has to be character-normalized using
+a standalone script uninorm.py, see usage below.
 {{{
 …
 }}}
+== Usage example (English) ==
+{{{
+python uninorm.py < input.txt | python unitok.py --trim 100 configs/english.py > output.vert
+}}}
 == Get unitok ==
 See [wiki:Downloads] for the latest version.