| | 1 | = !SpiderLing = |
| | 2 | !SpiderLing — a web spider for linguistics — is software for obtaining |
| | 3 | text from the web useful for building text corpora. Many documents on |
| | 4 | the web only contain material not suitable for text corpora, such as |
| | 5 | site navigation, lists of links, lists of products, and other kind of |
| | 6 | text not comprised of full sentences. In fact such pages represent the |
| | 7 | vast majority of the web. Therefore, by doing unrestricted web crawls, |
| | 8 | we typically download a lot of data which gets filtered out during |
| | 9 | post-processing. This makes the process of web corpus collection |
| | 10 | inefficient. The aim of our work is to focus the crawling on the text |
| | 11 | rich parts of the web and maximize the number of words in the final |
| | 12 | corpus per downloaded megabyte. |
| | 13 | |
| | 14 | == Publications == |
| | 15 | We presented our results at the following venues: |
| | 16 | |
| | 17 | [http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx |
| | 18 | The TenTen Corpus Family]\\ |
| | 19 | by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\ |
| | 20 | at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus |
| | 21 | Linguistics Conference], Lancaster, July 2013. |
| | 22 | |
| | 23 | [http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf |
| | 24 | Large Corpora for Turkic Languages and Unsupervised Morphological |
| | 25 | Analysis]\\ |
| | 26 | by Vít Baisa, Vít Suchomel\\ |
| | 27 | at [http://multisaund.eu/lrec2012_turkiclanguage.php Language |
| | 28 | Resources and Technologies for Turkic Languages] (at conference LREC), |
| | 29 | Istanbul, May 2012 |
| | 30 | |
| | 31 | [http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf |
| | 32 | Efficient Web Crawling for Large Text Corpora]\\ |
| | 33 | by Jan Pomikálek, Vít Suchomel\\ |
| | 34 | at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at |
| | 35 | conference WWW), Lyon, April 2012 |
| | 36 | |
| | 37 | |
| | 38 | == Large textual corpora built using !SpiderLing since October 2011 == |
| | 39 | ||= language =||= raw data size [GB] =||= cleaned data size [GB] =||= yield rate =||= corpus size [billion tokens] =||= crawling duration [days] =|| |
| | 40 | ||American Spanish || 1874|| 44|| 2.36%|| 8.7|| 14|| |
| | 41 | ||Arabic || 2015|| 58|| 2.89%|| 6.6|| 14|| |
| | 42 | ||Bulgarian || || || || 0.9|| 8|| |
| | 43 | ||Czech || ~4000|| || || 5.8|| ~40|| |
| | 44 | ||English || 2859|| 108|| 3.78%|| 17.8|| 17|| |
| | 45 | ||Estonian || 100|| 3|| 2.67%|| 0.3|| 14|| |
| | 46 | ||French || 3273|| 72|| 2.19%|| 12.4|| 15|| |
| | 47 | ||German || 5554|| 145|| 2.61%|| 19.7|| 30|| |
| | 48 | ||Hungarian || || || || 3.1|| 20|| |
| | 49 | ||Japanese || 2806|| 61|| 2.19%|| 11.1|| 28|| |
| | 50 | ||Korean || || || || 0.5|| 20|| |
| | 51 | ||Polish || || || || 9.5|| 17|| |
| | 52 | ||Russian || 4142|| 198|| 4.77%|| 20.2|| 14|| |
| | 53 | ||Turkish || 2700|| 26|| 0.97%|| 4.1|| 14|| |
| | 54 | |
| | 55 | |
| | 56 | == Acknowledgements == |
| | 57 | This software is developed at the [http://nlp.fi.muni.cz/en/nlpc |
| | 58 | Natural Language Processing Centre] of Masaryk University in Brno, |
| | 59 | Czech Republic \\ |
| | 60 | in cooperation with [http://lexicalcomputing.com/ Lexical Computing |
| | 61 | Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company. |
| | 62 | |
| | 63 | == Contact == |
| | 64 | {{{ |
| | 65 | #!html |
| | 66 | <SCRIPT TYPE="text/javascript"> |
| | 67 | emailE='fi.muni.cz'; |
| | 68 | emailE=('xsuchom2' + '@' + emailE); |
| | 69 | document.write('Vít Suchomel: <A href="mailto:' + emailE + '">' + |
| | 70 | emailE + '</a>'); |
| | 71 | </SCRIPT> |
| | 72 | }}} |
| | 73 | |
| | 74 | == Licence == |
| | 75 | This software is the result of project LM2010013 (LINDAT-Clarin - |
| | 76 | Vybudování a provoz českého uzlu pan-evropské infrastruktury pro |
| | 77 | výzkum). This result is consistent with the expected objectives of the |
| | 78 | project. The owner of the result is Masaryk University, a public |
| | 79 | university, ID: 00216224. Masaryk University allows other companies |
| | 80 | and individuals to use this software free of charge and without |
| | 81 | territorial restrictions under the terms of the |
| | 82 | [http://www.gnu.org/licenses/gpl.txt GPL license]. |
| | 83 | |
| | 84 | This permission is granted for the duration of property rights. |
| | 85 | |
| | 86 | This software is not subject to special information treatment |
| | 87 | according to Act No. 412/2005 Coll., as amended. In case that a person |
| | 88 | who will use the software under this license offer violates the |
| | 89 | license terms, the permission to use the software terminates. |
| | 90 | |
| | 91 | == Source code == |
| | 92 | [attachment:spiderling-src-0.77.tar.xz] |