| | 16 | |
| | 17 | === Changelog 0.77 → 0.81 === |
| | 18 | * Improvements proposed by Vlado Benko or Nikola Ljubešić: |
| | 19 | - escape square brackets and backslash in url |
| | 20 | - doc attributes: timestamp with hours, IP address, meta/chared encoding |
| | 21 | - doc id added to arc output |
| | 22 | - MAX_DOCS_CLEANED limit per domain |
| | 23 | - create the temp dir if needed |
| | 24 | * Support for processing doc/docx/ps/pdf (not working well yet, URLs of documents are saved to a separate file for manual download and processing) |
| | 25 | * Crawling multiple languages (Inspired by Nikola's contribution) (not tested yet) |
| | 26 | * Stop crawling by sending SIGTERM to the main process |
| | 27 | * Domain distances (distance of web domains from seed web domains, will be used in scheduling in the future) |
| | 28 | * Config values tweaked |
| | 29 | - MAX_URL_QUEUE, MAX_URL_SELECT greatly increased |
| | 30 | - better spread of domains in the crawling queue => faster crawling |
| | 31 | * Python --> !PyPy |
| | 32 | - scheduler and crawler processes dynamically compiled by pypy |
| | 33 | - saves approx. 1/4 RAM |
| | 34 | - better CPU effectivity not so visible (waiting for host/IP timeouts, waiting for doc processors) |
| | 35 | - process.py requires lxml which does not work with pypy (will be replaced by lxml-cffi in the future) |
| | 36 | * Readme updated (more information, known bugs) |
| | 37 | * Bug fixes |