KNOT

PAISÀ Corpus

Description

The Paisà corpus is a large collection of Italian web texts, created in the context of the project PAISÀ. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services. Documents are marked in the corpus by an XML "text" tag with "id" and "url" attributes, the first corresponding to a unique numeric code assigned to each document, the second providing the original URL of the document.

Release Date

2013-01-01

Digital Object

This record catalogues a scholarly digital object as an instance of the DCAT Dataset class. These objects represent collections of information that has already been structured in some way rather than raw, unstructured data. See the Documentation page for more information.

Permalink

http://w3id.org/knot/data/paisa-corpus

PAISÀ Corpus

Description

Release Date

Downloads

Publisher

Creator

Type

Primary Subjects

Bibliographic References

Landing Page

Data & Metadata Languages

External Identifiers

License

Digital Object

Permalink