Changes between Version 6 and Version 7 of Unitok
- Timestamp:
- 12/16/25 16:24:04 (2 weeks ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Unitok
v6 v7 1 1 = unitok = 2 2 3 Univerzal text tokenizer for scripts using whitespace to separate tokens: 3 4 * splits input text into tokens (one token per line) 4 5 * recognizes URLs, e-mail addreses, DNS domains, IP addresses … … 8 9 * adds glue (<g/>) tags between tokens not separated by space 9 10 10 [[span(style=color:#FF0000;font-weight:bold, Python 2.7 required )]], Python 3 compatibility will be added soon 11 Requires a configuration file defining tokens in the target language. 12 Configuration files are provided in directory configs. 13 configs/other.py is the default configuration that can be used for any language 14 written in a script using whitespace to separate tokens. 15 Language-specific configuration files contain language-specific token regexps, 16 e.g. abbreviations common in the language. 17 18 Uninormed input is expected, i.e. the input has to be character-normalized using 19 a standalone script uninorm.py, see usage below. 11 20 12 21 {{{ … … 19 28 }}} 20 29 30 31 == Usage example (English) == 32 {{{ 33 python uninorm.py < input.txt | python unitok.py --trim 100 configs/english.py > output.vert 34 }}} 35 36 21 37 == Get unitok == 22 38 See [wiki:Downloads] for the latest version.

