| 16 | | <a href="/wiki/Unitok"> |
| 17 | | Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata. |
| | 22 | <p><a href="/wiki/Unitok"> |
| | 23 | Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.</a></p> |
| | 24 | <p> |
| | 25 | <a class="lnk" href="http://nlp.fi.muni.cz/raslan/raslan14.pdf#page=79">Paper</a> |
| | 26 | | |
| | 27 | <a class="lnk" href="">Cite</a> |
| | 28 | | |
| | 29 | <a class="lnk" href="https://www.mozilla.org/MPL/2.0/">Licence</a> |
| | 30 | </p> |
| | 31 | </td> |
| 22 | | <a href="/wiki/Justext"> |
| 23 | | JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences. |
| | 36 | <p><a href="/wiki/Justext"> |
| | 37 | JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.</a><p> |
| | 38 | <p> |
| | 39 | <a class="lnk" href="http://is.muni.cz/th/45523/fi_d/phdthesis.pdf">Paper</a> |
| | 40 | | |
| | 41 | <a class="lnk" href="">Cite</a> |
| | 42 | | |
| | 43 | <a class="lnk" href="http://opensource.org/licenses/BSD-3-Clause">Licence</a> |
| | 44 | </p> |
| | 45 | </td> |