De Gasperi's Corpus

Description

A collection of Alcide De Gasperi's public documents with gold and silver annotation The corpus of Alcide De Gasperi's public documents is a collection of 2,762 documents issued between 1901 and 1954, which had been previously published in four volumes by Il Mulino but were not machine-readable. Our repository contains all documents in three formats: txt, XML and tab-separated. Raw txt files contain only the body of the documents, and may be straightforwardly used to extract embeddings or topics. XML files include metadata that cover not only the title, the date and the place of publication, but also key-concepts automatically extracted from each text (with the corresponding relevance score) and genre labels manually assigned by domain experts. Furthermore, the release includes silver annotation for lemma, part of speech, person names and place names with associated coordinates in a CoNLL-like format.

Release Date

2019-07-16

Related Datasets

KIND (Kessler Italian Named-entities Dataset)

Downloads

https://github.com/StefanoMenini/De-Gasperi-s-Corpus/raw/master/conll-files.zip, https://github.com/StefanoMenini/De-Gasperi-s-Corpus/raw/master/txt-files.zip, https://github.com/StefanoMenini/De-Gasperi-s-Corpus/raw/master/xml-files.zip.

Publisher

DH@FBK

Creator

Sara Tonelli.

Type

Corpus

Primary Subjects

Alcide de Gasperi, Education, culture and sport.

Temporal Coverage

20th Century.

Geographical Coverage

Italy.

Bibliographic References

http://ceur-ws.org/Vol-2481/paper71.pdf

Landing Page

https://github.com/StefanoMenini/De-Gasperi-s-Corpus

Data & Metadata Languages

Italian, English.

License

CC BY-NC-SA 4.0

Digital Object

This record catalogues a digital scholarly object as an instance of the DCAT Dataset class. These objects do not include raw data but rather collections of information that has already been structured in some way. See the Documentation page for more information.

Permalink

http://purl.org/knot/data/alcide-corpus