Changes between Version 6 and Version 7 of Unitok


Ignore:
Timestamp:
12/16/25 16:24:04 (2 weeks ago)
Author:
admin
Comment:

unitok v. 4

Legend:

Unmodified
Added
Removed
Modified
  • Unitok

    v6 v7  
    11= unitok =
    22
     3Univerzal text tokenizer for scripts using whitespace to separate tokens:
    34* splits input text into tokens (one token per line)
    45* recognizes URLs, e-mail addreses, DNS domains, IP addresses
     
    89* adds glue (<g/>) tags between tokens not separated by space
    910
    10 [[span(style=color:#FF0000;font-weight:bold, Python 2.7 required )]], Python 3 compatibility will be added soon
     11Requires a configuration file defining tokens in the target language.
     12Configuration files are provided in directory configs.
     13configs/other.py is the default configuration that can be used for any language
     14written in a script using whitespace to separate tokens.
     15Language-specific configuration files contain language-specific token regexps,
     16e.g. abbreviations common in the language.
     17
     18Uninormed input is expected, i.e. the input has to be character-normalized using
     19a standalone script uninorm.py, see usage below.
    1120
    1221{{{
     
    1928}}}
    2029
     30
     31== Usage example (English) ==
     32{{{
     33python uninorm.py < input.txt | python unitok.py --trim 100 configs/english.py > output.vert
     34}}}
     35
     36
    2137== Get unitok ==
    2238See [wiki:Downloads] for the latest version.