| | 76 | == README == |
| | 77 | {{{ |
| | 78 | == Requires == |
| | 79 | pypy >= 2.2.1, |
| | 80 | 2.7 <= python < 3, |
| | 81 | justext >= 1.2, |
| | 82 | chared >= 1.3, |
| | 83 | lxml >= 2.2.4, |
| | 84 | text processing tools: pdftotext, ps2ascii, antiword, |
| | 85 | works in Ubuntu (Debian Linux), does not work in Windows. |
| | 86 | Minimum hardware configuration (very small crawls): |
| | 87 | - 2 core CPU, |
| | 88 | - 4 GB system memory, |
| | 89 | - some storage space, |
| | 90 | - broadband internet connection. |
| | 91 | Recommended hardware configuration (crawling ~30 bn words of English text): |
| | 92 | - 4-24 core CPU (the more CPUs the faster the processing of crawled data), |
| | 93 | - 8-250 GB operational memory |
| | 94 | (the more RAM the more domains kept in memory and thus more webs visited), |
| | 95 | - lots of storage space, |
| | 96 | - connection to an internet backbone line. |
| | 97 | |
| | 98 | == Includes == |
| | 99 | A robot exclusion rules parser for Python by Philip Semanchuk (v. 1.6.2) |
| | 100 | ... see util/robotparser.py |
| | 101 | Language detection using character trigrams by Douglas Bagnall |
| | 102 | ... see util/trigrams.py |
| | 103 | docx2txt by Sandeep Kumar |
| | 104 | ... see util/doc2txt.pl |
| | 105 | |
| | 106 | == Installation == |
| | 107 | - unpack, |
| | 108 | - install required tools, |
| | 109 | - check justext.core and chared.detector can be imported by pypy, |
| | 110 | - make sure the crawler can write to it's directory and config.PIPE_DIR. |
| | 111 | |
| | 112 | == Settings -- edit util/config.py == |
| | 113 | - !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT, |
| | 114 | - raise ulimit -n accoring to MAX_OPEN_CONNS, |
| | 115 | - set MAX_RUN_TIME to specify max crawling time in seconds, |
| | 116 | - set DOC_PROCESSOR_COUNT to (partially) control CPU usage, |
| | 117 | - configure language dependent settings |
| | 118 | - set MAX_DOMS_READY to (partially) control memory usage, |
| | 119 | - set MAX_DOMS_WAITING_FOR_SINGLE_IP, MAX_IP_REPEAT, |
| | 120 | - set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL, |
| | 121 | - set and mkdir PIPE_DIR (pipes for communication of subprocesses). |
| | 122 | |
| | 123 | == Language models == |
| | 124 | - plaintext in the target language in util/lang_samples/, |
| | 125 | e.g. put plaintexts from several dozens of English web documents and |
| | 126 | English Wikipedia articels in ./util/lang_samples/English |
| | 127 | - jusText stoplist for that language in jusText stoplist path, |
| | 128 | e.g. <justext directory>/stoplists/English.txt |
| | 129 | - chared model for that language, |
| | 130 | e.g. <chared directory>/models/English |
| | 131 | |
| | 132 | == Usage == |
| | 133 | pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP] |
| | 134 | SEED_URLS is a text file containing seed URLs (the crawling starts there), |
| | 135 | one per line, specify at least 50 URLs. |
| | 136 | SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded, |
| | 137 | e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'. |
| | 138 | Running the crawler in the background is recommended. |
| | 139 | The crawler creates |
| | 140 | - *.log.* .. log & debug files, |
| | 141 | - *.arc.gz .. gzipped arc files (raw http responses), |
| | 142 | - *.prevert_d .. preverticals with duplicate documents, |
| | 143 | - *.duplicates .. files duplicate document IDs, |
| | 144 | - *.unproc_urls .. urls of non html documents that were not processed (bug) |
| | 145 | To remove duplicate documents from preverticals, run |
| | 146 | rm spiderling.prevert |
| | 147 | for i in $(seq 0 15) |
| | 148 | do |
| | 149 | pypy util/remove_duplicates.py spiderling.${i}.duplicates \ |
| | 150 | < spiderling.${i}.prevert_d >> spiderling.prevert |
| | 151 | done |
| | 152 | File spiderling.prevert is the final output. |
| | 153 | To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process |
| | 154 | (spiderling.py). |
| | 155 | To re-process arc files with current process.py and util/config.py, run |
| | 156 | zcat spiderling.*.arc.gz | pypy reprocess.py |
| | 157 | |
| | 158 | == Performance tips == |
| | 159 | - Using PyPy reduces CPU and memory cost (saves approx. 1/4 RAM). |
| | 160 | - Set ALLOWED_TLDS_RE to avoid crawling domains not in the target language |
| | 161 | (it takes some resources to detect it otherwise). |
| | 162 | - Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in |
| | 163 | *.log.crawl and *.log.eval to see where the bottleneck is, modify the |
| | 164 | settings accordingly, e.g. add doc processors if the doc queue is always full. |
| | 165 | |
| | 166 | == Known bugs == |
| | 167 | - Non html documents are not processed (urls stored in *.unproc_urls instead). |
| | 168 | - DNS resolvers are implemented as blocking threads => useless to have more |
| | 169 | than one, will be changed to separate processes in the future. |
| | 170 | - Compressed connections are not accepted/processed. Some servers might be |
| | 171 | discouraged from sending an uncompressed response (not tested). |
| | 172 | - Some advanced features of robots.txt are not observed, e.g. Crawl-delay. |
| | 173 | It would require major changes in the design of the download scheduler. |
| | 174 | - Https requests are not implemented properly, may not work at all. |
| | 175 | - Path bloating, e.g. http://example.com/a/ yielding the same content as |
| | 176 | http://example.com/a/a/ etc., should be avoided. Might be a bot trap. |
| | 177 | }}} |