| | 10 | |
| | 11 | == Publications == |
| | 12 | We presented our results at the following venues: |
| | 13 | |
| | 14 | [http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf Efficient Web Crawling for Large Text Corpora]\\ |
| | 15 | by Jan Pomikálek, Vít Suchomel\\ |
| | 16 | at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at conference WWW), Lyon, April 2012 |
| | 17 | |
| | 18 | [http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf Large Corpora for Turkic Languages and Unsupervised Morphological Analysis]\\ |
| | 19 | by Vít Baisa, Vít Suchomel\\ |
| | 20 | at [http://multisaund.eu/lrec2012_turkiclanguage.php Language Resources and Technologies for Turkic Languages] (at conference LREC), Istanbul, May 2012 |
| | 21 | |
| | 22 | [http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx The TenTen Corpus Family]\\ |
| | 23 | by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\ |
| | 24 | at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus Linguistics Conference], Lancaster, July 2013. |
| | 25 | |
| | 26 | |
| | 27 | == Large textual corpora built using !SpiderLing == |
| | 28 | === Since 2017 === |
| | 29 | Corpora of total size of ca. 200 billion tokens in various languages (mostly English) were built from data crawled by SpiderLing from 2017 to March 2020. |
| | 30 | |
| | 31 | === From 2011 to 2014 === |
| | 32 | ||= language =||= raw data size [GB] =||= cleaned data size [GB] =||= yield rate =||= corpus size [billion tokens] =||= crawling duration [days] =|| |
| | 33 | ||American Spanish || 1874|| 44|| 2.36%|| 8.7|| 14|| |
| | 34 | ||Arabic || 2015|| 58|| 2.89%|| 6.6|| 14|| |
| | 35 | ||Bulgarian || || || || 0.9|| 8|| |
| | 36 | ||Czech || ~4000|| || || 5.8|| ~40|| |
| | 37 | ||English || 2859|| 108|| 3.78%|| 17.8|| 17|| |
| | 38 | ||Estonian || 100|| 3|| 2.67%|| 0.3|| 14|| |
| | 39 | ||French || 3273|| 72|| 2.19%|| 12.4|| 15|| |
| | 40 | ||German || 5554|| 145|| 2.61%|| 19.7|| 30|| |
| | 41 | ||Hungarian || || || || 3.1|| 20|| |
| | 42 | ||Japanese || 2806|| 61|| 2.19%|| 11.1|| 28|| |
| | 43 | ||Korean || || || || 0.5|| 20|| |
| | 44 | ||Polish || || || || 9.5|| 17|| |
| | 45 | ||Russian || 4142|| 198|| 4.77%|| 20.2|| 14|| |
| | 46 | ||Turkish || 2700|| 26|| 0.97%|| 4.1|| 14|| |
| | 47 | |
| | 48 | |
| | 49 | == Requires == |
| | 50 | - python >= 3.6, |
| | 51 | - pypy3 >= 5.5 (optional), |
| | 52 | - justext >= 3.0 (http://corpus.tools/wiki/Justext), |
| | 53 | - chared >= 2.0 (http://corpus.tools/wiki/Chared), |
| | 54 | - lxml >= 4.2 (http://lxml.de/), |
| | 55 | - openssl >= 1.1, |
| | 56 | - pyre2 >= 0.2.23 (https://github.com/andreasvc/pyre2), |
| | 57 | - text processing tools (if binary format conversion is on): |
| | 58 | - pdftotext (from poppler-utils), |
| | 59 | - ps2ascii (from ghostscript-core), |
| | 60 | - antiword (from antiword), |
| | 61 | - nice (coreutils) (optional), |
| | 62 | - ionice (util-linux) (optional), |
| | 63 | - gzip (optional). |
| | 64 | |
| | 65 | Runs in Linux, tested in Fedora and Ubuntu. |
| | 66 | Minimum hardware configuration (very small crawls): |
| | 67 | - 4 core CPU, |
| | 68 | - 8 GB system memory, |
| | 69 | - some storage space, |
| | 70 | - broadband internet connection. |
| | 71 | Recommended hardware configuration (crawling ~30 bn words of English text): |
| | 72 | - 8-32 core CPU (the more CPUs the faster the processing of crawled data), |
| | 73 | - 32-256 GB system memory |
| | 74 | (the more RAM the more domains kept in memory and thus more webs visited), |
| | 75 | - lots of storage space, |
| | 76 | - connection to an internet backbone line. |
| | 77 | |
| | 78 | == Includes == |
| | 79 | A robot exclusion rules parser for Python (v. 1.6.2) |
| | 80 | - by Philip Semanchuk, BSD Licence |
| | 81 | - see util/robotparser.py |
| | 82 | Language detection using character trigrams |
| | 83 | - by Douglas Bagnall, Python Software Foundation Licence |
| | 84 | - see util/trigrams.py |
| | 85 | docx2txt |
| | 86 | - by Sandeep Kumar, GNU GPL 3+ |
| | 87 | - see util/doc2txt.pl |
| | 88 | |
| | 89 | == Installation == |
| | 90 | - unpack, |
| | 91 | - install required tools, see install_rpm.sh for rpm based systems |
| | 92 | - check importing the following dependences by pypy3/python3: |
| | 93 | python3 -c 'import justext.core, chared.detector, ssl, lxml, re2' |
| | 94 | pypy3 -c 'import ssl; from ssl import PROTOCOL_TLS' |
| | 95 | - make sure the crawler can write to config.RUN_DIR and config.PIPE_DIR. |
| | 96 | |
| | 97 | == Settings -- edit util/config.py == |
| | 98 | - !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT; |
| | 99 | - set MAX_RUN_TIME to specify max crawling time in seconds; |
| | 100 | - set DOC_PROCESSOR_COUNT to (partially) control CPU usage; |
| | 101 | - set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL; |
| | 102 | - raise ulimit -n accoring to MAX_OPEN_CONNS; |
| | 103 | - then increase MAX_OPEN_CONNS and OPEN_AT_ONCE; |
| | 104 | - configure language dependent settings. |
| | 105 | |
| | 106 | == Language models for all recognised languages == |
| | 107 | - Plaintext in util/lang_samples/, |
| | 108 | e.g. put plaintexts from several dozens of nice web documents and Wikipedia |
| | 109 | articels, manually checked, in util/lang_samples/{Czech,Slovak,English}; |
| | 110 | - jusText wordlists in util/justext_wordlists/, |
| | 111 | e.g. use the Justext default or 2000 most frequent manually cleaned words, |
| | 112 | one per line, in util/justext_wordlists/{Czech,Slovak,English}; |
| | 113 | - chared model in util/chared_models/, |
| | 114 | e.g. copy the default chared models {czech,slovak,english}.edm to |
| | 115 | util/chared_models/{Czech,Slovak,English} |
| | 116 | See default English resources in the respective directories. |
| | 117 | |
| | 118 | == Usage == |
| | 119 | See {{{./spiderling.py -h}}}. |
| | 120 | |
| | 121 | It is recommended to run the crawler in `screen`. |
| | 122 | Example: |
| | 123 | {{{ |
| | 124 | screen -S crawling |
| | 125 | ./spiderling.py < seed_urls &> run/out & |
| | 126 | }}} |
| | 127 | |
| | 128 | Files created by the crawler in run/: |
| | 129 | - *.log.* .. log & debug files, |
| | 130 | - arc/*.arc.gz .. gzipped arc files (raw http responses), |
| | 131 | - prevert/*.prevert_d .. preverticals with duplicate documents, |
| | 132 | - prevert/duplicate_ids .. files duplicate document IDs, |
| | 133 | - ignored/* .. ignored URLs (binary files (pdf, ps, doc, docx) which were |
| | 134 | not processed and urls not passing the domain blacklist or the TLD filter), |
| | 135 | - save/* .. savepoints that can be used for a new run, |
| | 136 | - other directories can be erased after stopping the crawler. |
| | 137 | |
| | 138 | To remove duplicate documents from preverticals, run |
| | 139 | {{{ |
| | 140 | for pdup in run/prevert/*.prevert_d |
| | 141 | do |
| | 142 | p=`echo $pdup | sed -r 's,prevert_d$,prevert,'` |
| | 143 | pypy3 util/remove_duplicates.py run/prevert/duplicate_ids < $pdup > p |
| | 144 | done |
| | 145 | }}} |
| | 146 | Files run/prevert/*.prevert are the final output. |
| | 147 | |
| | 148 | Onion (http://corpus.tools/wiki/Onion) is recommended to remove near-duplicate |
| | 149 | paragraphs of text. |
| | 150 | |
| | 151 | To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process |
| | 152 | (pypy/python spiderling.py). |
| | 153 | Example: |
| | 154 | {{{ |
| | 155 | ps aux | grep 'spiderling\.py' #find the PID of the main process |
| | 156 | kill -s SIGTERM <PID> |
| | 157 | }}} |
| | 158 | |
| | 159 | To re-process arc files with current process.py and util/config.py, run |
| | 160 | {{{ |
| | 161 | for arcf in run/arc/*.arc.gz |
| | 162 | do |
| | 163 | p=`echo $arcf | sed -r 's,run/arc/([0-9]+)\.arc\.gz$,\1.prevert_re_d,'` |
| | 164 | zcat $arcf | pypy3 reprocess.py > $p |
| | 165 | done |
| | 166 | }}} |
| | 167 | |
| | 168 | To re-start crawling from the last saved state: |
| | 169 | {{{ |
| | 170 | mv -iv run old #rename the old `run' directory |
| | 171 | mkdir run |
| | 172 | for d in docmeta dompath domrobot domsleep robots |
| | 173 | do |
| | 174 | ln -s ../old/$d/ run/$d |
| | 175 | #make the new run continue from the previous state by symlinking old data |
| | 176 | done |
| | 177 | screen -r crawling |
| | 178 | ./spiderling.py \ |
| | 179 | --state-files=old/save/domains_T,old/save/raw_hashes_T,old/save/txt_hashes_T \ |
| | 180 | --old-tuples < old/url/urls_waiting &> run/out & |
| | 181 | }}} |
| | 182 | (Assuming T is the timestamp of the last save, e.g. 20190616151500.) |
| | 183 | |
| | 184 | == Performance tips == |
| | 185 | - Start with thousands of seed URLs. Give more URLs per domain. |
| | 186 | It is possible to start with tens of millions of seed URLs. |
| | 187 | If you need to start with a small number of seed URLs, set |
| | 188 | VERY_SMALL_START = True in util/config.py. |
| | 189 | - Using !PyPy reduces CPU cost and may cost more RAM. |
| | 190 | - Set TLD_WHITELIST_RE to avoid crawling domains not in the target language |
| | 191 | (it takes some resources to detect it otherwise). |
| | 192 | - Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in |
| | 193 | *.log.crawl and *.log.eval to see where the bottleneck is, modify the |
| | 194 | settings accordingly, e.g. add doc processors if the doc queue is always full. |
| | 195 | |
| | 196 | == Known bugs == |
| | 197 | - Domain distances should be made part of document metadata instead of storing |
| | 198 | them in a separate file. It will be resolved in the next version. |
| | 199 | - Processing binary files (pdf, ps, doc, docx) is disabled by default since it |
| | 200 | was not tested and may slow processing significantly. |
| | 201 | Also, the quality of the text output of these converters may suffer from |
| | 202 | various problems: headers, footers, lines breaked by hyphenated words, etc. |
| | 203 | - Compressed connections are not accepted/processed. Some servers might be |
| | 204 | discouraged from sending an uncompressed response (not tested). |
| | 205 | - Some advanced features of robots.txt are not observed, e.g. Crawl-delay. |
| | 206 | It would require major changes in the design of the download scheduler. |
| | 207 | A warning is emitted when the crawl delay < config.HOST_CONN_INTERVAL. |
| | 208 | |
| | 209 | == Support == |
| | 210 | There is no guarantee of support. (The author may help you a bit in his free |
| | 211 | time.) Please note the tool is distributed as is, it may not work under your |
| | 212 | conditions. |
| | 213 | |
| | 214 | == Acknowledgements == |
| | 215 | The author would like to express many thanks to Jan Pomikálek, Pavel Rychlý |
| | 216 | and Miloš Jakubíček for guidance, key design advice and help with debugging. |
| | 217 | Thanks also to Vlado Benko and Nikola Ljubešić for ideas for improvement. |
| | 218 | |
| | 219 | This software is developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of Masaryk University in Brno, Czech Republic in cooperation with [http://lexicalcomputing.com/ Lexical Computing], [http://www.sketchengine.eu/ corpus tool] company. |
| | 220 | |
| | 221 | This work was partly supported by the Ministry of Education of CR within the |
| | 222 | LINDAT-Clarin project LM2015071. |
| | 223 | This work was partly supported by the Norwegian Financial Mechanism 2009–2014 |
| | 224 | and the Ministry of Education, Youth and Sports under Project Contract no. |
| | 225 | MSMT-28477/2014 within the HaBiT Project 7F14047. |
| | 226 | |
| | 227 | == Contact == |
| | 228 | {{{'zc.inum.if@2mohcusx'[::-1]}}} |
| | 229 | |
| | 230 | == Licence == |
| | 231 | This software is the result of project LM2010013 (LINDAT-Clarin - |
| | 232 | Vybudování a provoz českého uzlu pan-evropské infrastruktury pro |
| | 233 | výzkum). This result is consistent with the expected objectives of the |
| | 234 | project. The owner of the result is Masaryk University, a public |
| | 235 | university, ID: 00216224. Masaryk University allows other companies |
| | 236 | and individuals to use this software free of charge and without |
| | 237 | territorial restrictions under the terms of the |
| | 238 | [http://www.gnu.org/licenses/gpl.txt GPL license]. |
| | 239 | |
| | 240 | This permission is granted for the duration of property rights. |
| | 241 | |
| | 242 | This software is not subject to special information treatment |
| | 243 | according to Act No. 412/2005 Coll., as amended. In case that a person |
| | 244 | who will use the software under this license offer violates the |
| | 245 | license terms, the permission to use the software terminates. |
| | 246 | |
| | 247 | == Changelog == |
| | 248 | === Changelog 1.0 → 1.1 === |
| | 249 | * Important bug fix: Store the best extracted paragraph data in process.py. (Fixes a bug that caused Chared model for the last tested language rather than the best best tested language was used for character decoding. E.g. koi8-r encoding could have been assumed for Czech documents in some cases and when the last language in config.LANGUAGES was Russian.) Also, chared encodings were renamed to canonical encoding names. |
| | 250 | * Encoding detection simplified, Chared is preferred now |
| | 251 | * Session id (and similar strings) removed from paths in domains to prevent downloading the same content again |
| | 252 | * Path scheduling priority by path length |
| | 253 | * Memory consumption improvements |
| | 254 | * More robust error handling, e.g. socket errors in crawl.py updated to OSError |
| | 255 | * reprocess.py can work with wpage files too |
| | 256 | * config.py: some default values changed for better performance (e.g. increasing the maximum open connections limit helps a lot), added a switch for the case of starting with a small count of seed URLs, tidied up |
| | 257 | * Debug logging of memory size of data structures |
| | 258 | |
| | 259 | === Changelog 0.95 → 1.0 === |
| | 260 | * Python 3.6+ compatible |
| | 261 | * Domain scheduling priority by domain distance and hostname length |
| | 262 | * Bug fixes: domain distance, domain loading, Chared returns "utf_8" instead of "utf-8", '<', '>' and "'" in doc.title/doc.url, robots.txt redirected from http to https, missing content-type and more |
| | 263 | * More robust handling of some issues |
| | 264 | * Less used features such as state file loading and data reprocessing better explained |
| | 265 | * Global domain blacklist added |
| | 266 | * English models and sample added |
| | 267 | * Program run examples in README |
| | 268 | |
| | 269 | === Changelog 0.82 → 0.95 === |
| | 270 | * Interprocess communication rewritten to files |
| | 271 | * Write URLs that cannot be downloaded soon into a "wpage" file -- speeds up the downloader |
| | 272 | * New reprocess.py allowing reprocessing of arc files |
| | 273 | * Https hosts separated from http hosts with the same hostname |
| | 274 | * Many features, e.g. redirections, made in a more robust way |
| | 275 | * More options exposed to allow configuration, more logging and debug info |
| | 276 | * Big/small crawling profiles setting multiple variables in config |
| | 277 | * Performance and memory saving improvements |
| | 278 | * Bugfixes: chunked HTTP, XML parsing, double quotes in URL, HTTP redirection to the same URL, SSL layer not ready and more |
| 49 | | |
| 50 | | == Publications == |
| 51 | | We presented our results at the following venues: |
| 52 | | |
| 53 | | [http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx The TenTen Corpus Family]\\ |
| 54 | | by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\ |
| 55 | | at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus Linguistics Conference], Lancaster, July 2013. |
| 56 | | |
| 57 | | [http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf Large Corpora for Turkic Languages and Unsupervised Morphological Analysis]\\ |
| 58 | | by Vít Baisa, Vít Suchomel\\ |
| 59 | | at [http://multisaund.eu/lrec2012_turkiclanguage.php Language Resources and Technologies for Turkic Languages] (at conference LREC), Istanbul, May 2012 |
| 60 | | |
| 61 | | [http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf Efficient Web Crawling for Large Text Corpora]\\ |
| 62 | | by Jan Pomikálek, Vít Suchomel\\ |
| 63 | | at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at conference WWW), Lyon, April 2012 |
| 64 | | |
| 65 | | |
| 66 | | == Large textual corpora built using !SpiderLing since October 2011 == |
| 67 | | ||= language =||= raw data size [GB] =||= cleaned data size [GB] =||= yield rate =||= corpus size [billion tokens] =||= crawling duration [days] =|| |
| 68 | | ||American Spanish || 1874|| 44|| 2.36%|| 8.7|| 14|| |
| 69 | | ||Arabic || 2015|| 58|| 2.89%|| 6.6|| 14|| |
| 70 | | ||Bulgarian || || || || 0.9|| 8|| |
| 71 | | ||Czech || ~4000|| || || 5.8|| ~40|| |
| 72 | | ||English || 2859|| 108|| 3.78%|| 17.8|| 17|| |
| 73 | | ||Estonian || 100|| 3|| 2.67%|| 0.3|| 14|| |
| 74 | | ||French || 3273|| 72|| 2.19%|| 12.4|| 15|| |
| 75 | | ||German || 5554|| 145|| 2.61%|| 19.7|| 30|| |
| 76 | | ||Hungarian || || || || 3.1|| 20|| |
| 77 | | ||Japanese || 2806|| 61|| 2.19%|| 11.1|| 28|| |
| 78 | | ||Korean || || || || 0.5|| 20|| |
| 79 | | ||Polish || || || || 9.5|| 17|| |
| 80 | | ||Russian || 4142|| 198|| 4.77%|| 20.2|| 14|| |
| 81 | | ||Turkish || 2700|| 26|| 0.97%|| 4.1|| 14|| |
| 82 | | |
| 83 | | == Requires == |
| 84 | | {{{ |
| 85 | | 2.7.9 <= python < 3, |
| 86 | | pypy >= 2.2.1, |
| 87 | | justext >= 1.2 (http://corpus.tools/wiki/Justext), |
| 88 | | chared >= 1.3 (http://corpus.tools/wiki/Chared), |
| 89 | | lxml >= 2.2.4 (http://lxml.de/), |
| 90 | | openssl >= 1.0.1, |
| 91 | | re2 (python, https://pypi.python.org/pypi/re2/, make sure `import re2` works), |
| 92 | | or alternatively cffi_re2 (python/pypy, https://pypi.python.org/pypi/cffi_re2, |
| 93 | | make sure `import cffi_re2` works in this case), |
| 94 | | pdftotext, ps2ascii, antiword (text processing tools). |
| 95 | | }}} |
| 96 | | Runs in Linux, tested in Fedora and Ubuntu. |
| 97 | | Minimum hardware configuration (very small crawls): |
| 98 | | - 2 core CPU, |
| 99 | | - 4 GB system memory, |
| 100 | | - some storage space, |
| 101 | | - broadband internet connection. |
| 102 | | Recommended hardware configuration (crawling ~30 bn words of English text): |
| 103 | | - 4-24 core CPU (the more CPUs the faster the processing of crawled data), |
| 104 | | - 8-250 GB operational memory (the more RAM the more domains kept in memory and thus more webs visited), |
| 105 | | - lots of storage space, |
| 106 | | - connection to an internet backbone line. |
| 107 | | |
| 108 | | == Includes == |
| 109 | | A robot exclusion rules parser for Python (v. 1.6.2) by Philip Semanchuk, BSD Licence (see util/robotparser.py) |
| 110 | | Language detection using character trigrams by Douglas Bagnall, Python Software Foundation Licence (see util/trigrams.py) |
| 111 | | docx2txt by Sandeep Kumar, GNU GPL 3+ (see util/doc2txt.pl) |
| 112 | | |
| 113 | | == Installation == |
| 114 | | - unpack, |
| 115 | | - install required tools, |
| 116 | | - check justext.core and chared.detector can be imported by pypy, |
| 117 | | - make sure the crawler can write to it's directory and config.PIPE_DIR. |
| 118 | | |
| 119 | | == Settings -- edit util/config.py == |
| 120 | | - !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT, |
| 121 | | - raise ulimit -n accoring to MAX_OPEN_CONNS, |
| 122 | | - set MAX_RUN_TIME to specify max crawling time in seconds, |
| 123 | | - set DOC_PROCESSOR_COUNT to (partially) control CPU usage, |
| 124 | | - configure language dependent settings |
| 125 | | - set MAX_DOMS_READY to (partially) control memory usage, |
| 126 | | - set MAX_DOMS_WAITING_FOR_SINGLE_IP, MAX_IP_REPEAT, |
| 127 | | - set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL, |
| 128 | | - set and mkdir PIPE_DIR (pipes for communication of subprocesses). |
| 129 | | |
| 130 | | == Language models == |
| 131 | | - plaintext in the target language in util/lang_samples/, e.g. put plaintexts from several dozens of English web documents and English Wikipedia articels in ./util/lang_samples/English |
| 132 | | - jusText stoplist for that language in jusText stoplist path, e.g. <justext directory>/stoplists/English.txt |
| 133 | | - chared model for that language, e.g. <chared directory>/models/English |
| 134 | | |
| 135 | | == Usage == |
| 136 | | {{{pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]}}} |
| 137 | | or |
| 138 | | {{{pypy spiderling.py [SAVEPOINT_TIMESTAMP] < SEED_URLS}}} |
| 139 | | |
| 140 | | {{{SEED_URLS}}} is a text file containing seed URLs (the crawling starts there), |
| 141 | | one per line, specify at least 50 URLs. |
| 142 | | {{{SAVEPOINT_TIMESTAMP}}} causes the state from the specified savepoint to be loaded, |
| 143 | | e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'. |
| 144 | | |
| 145 | | Python can be used instead of pypy if the latter is not available. |
| 146 | | It is recommended to run the crawler in `screen`. |
| 147 | | |
| 148 | | Files created by the crawler: |
| 149 | | - *.log.* .. log & debug files, |
| 150 | | - *.arc.gz .. gzipped arc files (raw http responses), |
| 151 | | - *.prevert_d .. preverticals with duplicate documents, |
| 152 | | - *.duplicates .. files duplicate document IDs, |
| 153 | | - *.domain_{bad,oversize,distance} .. see util/domain.py, |
| 154 | | - *.links_unproc .. unprocessed urls from removed domains, |
| 155 | | - *.links_ignored .. urls not passing domain blacklist or TLD filter, |
| 156 | | - *.links_binfile .. urls of binary files (pdf, ps, doc, docx) not processed |
| 157 | | in case config.CONVERSION_ENABLED is disabled, |
| 158 | | - *.state* .. savepoints that can be used for a new run (not tested much). |
| 159 | | |
| 160 | | To remove duplicate documents from preverticals, run |
| 161 | | {{{ |
| 162 | | rm spiderling.prevert |
| 163 | | for i in $(seq 0 <config.DOC_PROCESSOR_COUNT - 1>) |
| 164 | | do |
| 165 | | pypy util/remove_duplicates.py spiderling.${i}.duplicates \ |
| 166 | | < spiderling.${i}.prevert_d >> spiderling.prevert |
| 167 | | done |
| 168 | | }}} |
| 169 | | File spiderling.prevert is the final output. |
| 170 | | |
| 171 | | To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process |
| 172 | | (spiderling.py). |
| 173 | | To re-process arc files with current process.py and util/config.py, run |
| 174 | | {{{zcat spiderling.*.arc.gz | pypy reprocess.py}}} |
| 175 | | |
| 176 | | == Performance tips == |
| 177 | | - Using PyPy reduces CPU and memory cost (saves approx. 1/4 RAM). |
| 178 | | - Set ALLOWED_TLDS_RE to avoid crawling domains not in the target language |
| 179 | | (it takes some resources to detect it otherwise). |
| 180 | | - Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in |
| 181 | | *.log.crawl and *.log.eval to see where the bottleneck is, modify the |
| 182 | | settings accordingly, e.g. add doc processors if the doc queue is always full. |
| 183 | | |
| 184 | | == Known bugs == |
| 185 | | - Some non critical I/O errors output to stderr. |
| 186 | | - Domain distances should be made part of document metadata instead of storing |
| 187 | | them in a separate file. |
| 188 | | - Processing binary files (pdf, ps, doc, docx) is disabled by default since it |
| 189 | | was not tested and may slow processing significantly. |
| 190 | | - DNS resolvers are implemented as blocking threads => useless to have more |
| 191 | | than one, will be changed to separate processes in the future. |
| 192 | | - Compressed connections are not accepted/processed. Some servers might be |
| 193 | | discouraged from sending an uncompressed response (not tested). |
| 194 | | - Some advanced features of robots.txt are not observed, e.g. Crawl-delay. |
| 195 | | It would require major changes in the design of the download scheduler. |
| 196 | | A warning is emitted when the crawl delay < config.HOST_CONN_INTERVAL. |
| 197 | | |
| 198 | | == Support == |
| 199 | | There is no guaranteed support offered. (The author may help you a bit in his |
| 200 | | free time.) Please note the tool is distributed as is, it may not work under |
| 201 | | your conditions. |
| 202 | | |
| 203 | | == Acknowledgements == |
| 204 | | The author would like to express many thanks to Jan Pomikálek, Pavel Rychlý |
| 205 | | and Miloš Jakubíček for guidance, key design advice and help with debugging. |
| 206 | | Thanks also to Vlado Benko and Nikola Ljubešić for ideas for improvement. |
| 207 | | |
| 208 | | This software is developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of Masaryk University in Brno, Czech Republic \\ |
| 209 | | in cooperation with [http://lexicalcomputing.com/ Lexical Computing Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company. |
| 210 | | |
| 211 | | == Contact == |
| 212 | | {{{'zc.inum.if@2mohcusx'[::-1]}}} |
| 213 | | |
| 214 | | == Licence == |
| 215 | | This software is the result of project LM2010013 (LINDAT-Clarin - |
| 216 | | Vybudování a provoz českého uzlu pan-evropské infrastruktury pro |
| 217 | | výzkum). This result is consistent with the expected objectives of the |
| 218 | | project. The owner of the result is Masaryk University, a public |
| 219 | | university, ID: 00216224. Masaryk University allows other companies |
| 220 | | and individuals to use this software free of charge and without |
| 221 | | territorial restrictions under the terms of the |
| 222 | | [http://www.gnu.org/licenses/gpl.txt GPL license]. |
| 223 | | |
| 224 | | This permission is granted for the duration of property rights. |
| 225 | | |
| 226 | | This software is not subject to special information treatment |
| 227 | | according to Act No. 412/2005 Coll., as amended. In case that a person |
| 228 | | who will use the software under this license offer violates the |
| 229 | | license terms, the permission to use the software terminates. |