Changes between Version 3 and Version 4 of SpiderLing

SpiderLing

-              v3
+              v4
 == Get SpiderLing ==
 See [wiki:Downloads] for the latest version.
+=== Changelog 0.77 → 0.81 ===
+* Improvements proposed by Vlado Benko or Nikola Ljubešić:
+ - escape square brackets and backslash in url
+ - doc attributes: timestamp with hours, IP address, meta/chared encoding
+ - doc id added to arc output
+ - MAX_DOCS_CLEANED limit per domain
+ - create the temp dir if needed
+* Support for processing doc/docx/ps/pdf (not working well yet, URLs of documents are saved to a separate file for manual download and processing)
+* Crawling multiple languages (Inspired by Nikola's contribution) (not tested yet)
+* Stop crawling by sending SIGTERM to the main process
+* Domain distances (distance of web domains from seed web domains, will be used in scheduling in the future)
+* Config values tweaked
+ - MAX_URL_QUEUE, MAX_URL_SELECT greatly increased
+ - better spread of domains in the crawling queue => faster crawling
+* Python --> !PyPy
+ - scheduler and crawler processes dynamically compiled by pypy
+ - saves approx. 1/4 RAM
+ - better CPU effectivity not so visible (waiting for host/IP timeouts, waiting for doc processors)
+ - process.py requires lxml which does not work with pypy (will be replaced by lxml-cffi in the future)
+* Readme updated (more information, known bugs)
+* Bug fixes
 == Publications ==