PAISÀ Corpus
Description
The Paisà corpus is a large collection of Italian web texts, created in the context of the project PAISÀ. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services. Documents are marked in the corpus by an XML "text" tag with "id" and "url" attributes, the first corresponding to a unique numeric code assigned to each document, the second providing the original URL of the document.
Release Date
2013-01-01
Downloads
https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/lemma-frequencies-paisa.txt.gz?sequence=7&isAllowed=y, https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/lemma-WITHOUTnumberssymbols-frequencies-paisa.txt.gz?sequence=6&isAllowed=y, https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/paisa.annotated.CoNLL.utf8.gz?sequence=2&isAllowed=y, https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/paisa.raw.utf8.gz?sequence=1&isAllowed=y.
Publisher
Creator
Type
Primary Subjects
Bibliographic References
Landing Page
https://clarin.eurac.edu/repository/xmlui/handle/20.500.12124/3
Data & Metadata Languages
External Identifiers
License
Digital Object
This record catalogues a digital scholarly object as an instance of the DCAT Dataset class. These objects do not include raw data but rather collections of information that has already been structured in some way. See the Documentation page for more information.
Permalink
http://purl.org/knot/data/paisa-corpus