PAISÀ Corpus

Description

The Paisà corpus is a large collection of Italian web texts, created in the context of the project PAISÀ. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services. Documents are marked in the corpus by an XML "text" tag with "id" and "url" attributes, the first corresponding to a unique numeric code assigned to each document, the second providing the original URL of the document.

Release Date

2013-01-01

Downloads

https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/lemma-frequencies-paisa.txt.gz?sequence=7&isAllowed=y, https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/lemma-WITHOUTnumberssymbols-frequencies-paisa.txt.gz?sequence=6&isAllowed=y, https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/paisa.annotated.CoNLL.utf8.gz?sequence=2&isAllowed=y, https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/paisa.raw.utf8.gz?sequence=1&isAllowed=y.

Publisher

Institute for Applied Linguistics, Eurac Research

Creator

Verena Lyding, Claudia Borghetti, Egon Stemle.

Type

Corpus

Primary Subjects

Linguistics, Science and technology, Machine Learning.

Bibliographic References

https://www.corpusitaliano.it/en/contents/publications.html

Landing Page

https://clarin.eurac.edu/repository/xmlui/handle/20.500.12124/3

Data & Metadata Languages

Italian, English.

External Identifiers

http://hdl.handle.net/20.500.12124/3

License

CC BY-NC-SA 4.0.

Digital Object

This record catalogues a digital scholarly object as an instance of the DCAT Dataset class. These objects do not include raw data but rather collections of information that has already been structured in some way. See the Documentation page for more information.

Permalink

http://purl.org/knot/data/paisa-corpus