Changes between Version 1 and Version 2 of languagefilter

Timestamp:: 04/25/20 09:40:15 (6 years ago)
Author:: admin
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

languagefilter

-              v1
+              v2
 = Web Corpora Wordlist Based Language Filter =
+Summary
 * Separates documents and paragraphs by language using word frequency lists.
 * All languages to recognise have to be specified and respective frequency wordlists supplied.
+The method
 * A score (a logarithm of relative corpus frequency) is calculated for each word form and language.
 * The sum of scores of all words in paragraphs and documents is calculated for all languages.
 * If the ratio of scores of two top scoring languages is above a threshold, the top scoring language is recorded in headers of the respective paragraph/document.
 * A multi-language document is split to separate documents containing just paragraphs in recognised languages.
+Frequency wordlists
 * Frequency wordlists from big web corpora for more than 40 languages are included with the script.
 * The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages.
+* It is important to [Unitok tokenise] all wordlists (used in a single run of the filter) the same way.
+=== Installation ===
+{{{
+wget http://corpus.tools/raw-attachment/wiki/Downloads/wcwb_lang_filter_1.0.tar.gz
+tar -czvf wcwb_lang_filter_1.0.tar.gz
+cd wcwb_lang_filter_1.0
+make test/out.vert.lang_czech
+}}}
+== Examples ==
+=== Examples ===
+English frequency wordlist (top 10 lines)
+=== Sample English frequency wordlist (top 10 lines) ===
 {{{
 the     789476980
 …
 }}}
+Sample output:
+=== Sample input ===
 {{{
+<doc source="https://en.wikipedia.org/wiki/Dog" lang="english" lang_scores="czech: 408.43, slovak: 415.74, english: 1359.91">
+#wordform     English Czech   Slovak score for each word
+The           5.26    5.33    7.82
+dog           0.00    0.00    4.89
+was           0.00    0.00    6.73
+the           5.26    5.33    7.82
+first         0.00    0.00    6.14
+species       0.00    0.00    5.14
+to            7.05    7.15    7.48
+be            0.00    0.00    6.77
+domesticated  0.00    0.00    0.00
+[...]
+<doc source="https://en.wikipedia.org/wiki/Dog">
+<p>
+Linnaeus
+considered
+the
+dog
+to
+be
+a
+separate
+species
+<g/>
+.
+</p>
 </doc>
 }}}
+Usage:
+=== Sample output ===
+{{{
+<doc source="https://en.wikipedia.org/wiki/Dog" lang="english"
+     lang_scores="english: 49.56, czech: 19.86, slovak: 20.15">
+<par_langs lang="english" lang_scores="english: 49.56, czech: 19.86, slovak: 20.15"/>
+<p>
+#wordform  English  Czech Slovak score for each word
+Linnaeus      0.00   0.00   0.00   #unknown to all sample wordlists
+considered    5.18   0.00   0.00   #English only
+the           7.82   5.26   5.33   #English word, ~100 x more frequent in the English wl
+dog           4.89   0.00   0.00
+to            7.48   7.05   7.15   #a valid word in all three languages
+be            6.77   0.00   0.00
+a             7.37   7.56   7.66   #a valid word in all three languages
+separate      4.91   0.00   0.00
+species       5.14   0.00   0.00
+<g/>
+.             0.00   0.00   0.00   #punctuation is omitted from wordlists
+</p>
+</doc>
+}}}
+== Installation ==
+{{{
+wget http://corpus.tools/raw-attachment/wiki/Downloads/wcwb_lang_filter_1.0.tar.gz
+tar -czvf wcwb_lang_filter_1.0.tar.gz
+cd wcwb_lang_filter_1.0
+make test/out.vert.lang_czech
+}}}
+== Usage ==
 {{{
 ./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THRESHOLD
 …
 }}}
+== To build your own frequency wordlist ==
+{{{
+#Get corpus frequencies of lowercased words from a corpus compiled by [https://nlp.fi.muni.cz/trac/noske Sketch Engine]
+lsclex -f /corpora/registry/english_web_corpus lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > en.wl1
+lsclex -f /corpora/registry/czech_web_corpus   lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > cs.wl1
+lsclex -f /corpora/registry/slovak_web_corpus  lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > sk.wl1
+#Or get the same from a vertical file
+cut -f1 english_web_corpus.vert | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > en.wl1
+cut -f1 czech_web_corpus.vert   | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > cs.wl1
+cut -f1 slovak_web_corpus.vert  | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > sk.wl1
+#Filter the wordlist -- allow just characters valid for the language and a reasonable word length
+grep '[abcdefghijklmnopqrstuvwxyz]' en.wl1                  | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[abcdefghijklmnopqrstuvwxyzéè0-9'][abcdefghijklmnopqrstuvwxyzéè0-9'.-]{0,29}"                               > en.wl2
+grep '[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž]' cs.wl1   | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'][aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'.-]{0,29}"     > cs.wl2
+grep '[aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž]' sk.wl1 | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž0-9'][aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž0-9'.-]{0,29}" > sk.wl2
+#Sort (not necessary) and pack
+for f in {en,cs,sk}.wl2; do sort -k2,2rg -k1,1 ${c}.wl2 $f | gzip > ${f}.frqwl.gz; done
+}}}
 == Get Language Filter ==