Changes between Version 22 and Version 23 of Justext


Ignore:
Timestamp:
02/10/26 15:16:34 (5 days ago)
Author:
admin
Comment:

Justext 5.0

Legend:

Unmodified
Added
Removed
Modified
  • Justext

    v22 v23  
    1 = jusText 4 =
     1= jusText 5 =
    22
    33jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.
     
    2020
    2121== Installation ==
    22 1. Make sure you have Python 3.11 or newer. Required packages are installed via pip automatically (or system-wide, e.g. python3-lxml and python3-html5-parser in Fedora).
     221. Make sure you have Python 3.12 or newer. Required packages are installed via pip automatically (or system-wide, e.g. python3-lxml and python3-html5-parser in Fedora).
    23232. Download, extract, install:
    2424{{{
    25 wget https://corpus.tools/raw-attachment/wiki/Downloads/justext-4.3.tar.gz
    26 tar xzvf justext-4.3.tar.gz
    27 cd justext-4.3/
    28 pip install --user . #omit --user to install for all users
     25pip install --user https://corpus.tools/raw-attachment/wiki/Downloads/justext-5.0.tar.gz
    2926}}}
    3027
     
    3229Python 3.6 & Python 2.7 compatible
    3330{{{
    34 wget https://corpus.tools/raw-attachment/wiki/Downloads/justext-4.2.5.tar.gz
    35 tar xzvf justext-4.2.5.tar.gz
    36 cd justext-4.2.5/
     31wget https://corpus.tools/raw-attachment/wiki/Downloads/justext-5.0.tar.gz
     32tar xzvf justext-5.0.tar.gz
     33cd justext-5.0/
    3734python3 setup.py install --user #omit --user to install for all users
    3835}}}