wiki:Unitok

Version 7 (modified by admin, 2 weeks ago) ( diff )

unitok v. 4

unitok

Univerzal text tokenizer for scripts using whitespace to separate tokens:

  • splits input text into tokens (one token per line)
  • recognizes URLs, e-mail addreses, DNS domains, IP addresses
  • for specified languages recognizes abbreviations and clictics (such as 've or n't in English)
  • preserves XML-like tags
  • replaces entities with unicode equivalents
  • adds glue (<g/>) tags between tokens not separated by space

Requires a configuration file defining tokens in the target language. Configuration files are provided in directory configs. configs/other.py is the default configuration that can be used for any language written in a script using whitespace to separate tokens. Language-specific configuration files contain language-specific token regexps, e.g. abbreviations common in the language.

Uninormed input is expected, i.e. the input has to be character-normalized using a standalone script uninorm.py, see usage below.

Paper | Cite | Licence

Usage example (English)

python uninorm.py < input.txt | python unitok.py --trim 100 configs/english.py > output.vert

Get unitok

See Downloads for the latest version.

Licence

Unitok is licensed under Mozilla Public License Version 2.0.