Documentation Criteria

On this page you will find details of the criteria and rules used to document the research projects and digital scholarly objects included in the KNOT Catalogue.

These projects and objects come from a census begun in the first year of the project, and ongoing, alongside recommendations of existing or concluded projects to be added. The information used was taken from official websites and documentation and public repositories in two steps: firstly as part of the census with a focus on key information that was used to help elaborate the data model; secondly following development of the data model and during the population of the catalogue with a focus on any missing information that was relevant and/or essential, including looking for additional sources in public repositories such as GitHub and CLARIN-IT. Any information that could not be ascertained without extensive additional work was left out.

The KNOT Catalogue uses the CLEF application and with this choice came some constraints on aligning our data model and the available functionalities of the application. As such KNOT catalogue entries follow our data model with some minor adjustments and a simplified approach, though it is our goal to eventually adjust the functionalities to enable the catalogue, and its knowledge graph, to more fully align with the data model. The primary changes are that the catalogue does not make use of all key classes but rather three of them - prov:Activity, dcat:Dataset, and dcat:DataService - and in turn that limits the use of properties and introduces some new usages (for example adding a homepage to the research project) as well as an additional property for status of the project, that makes use of the built-in status property of the application's records (currentStatus from the dbpedia ontology).

The following legend applies to the textual highlights based on the "Graphical Framework for OWL Ontologies": a yellow highlight indicates a class, a blue highlight indicates an object property, and a green highlight indicates a data property.

Research Project

Research projects are catalogued as instances of the prov:Activity class. These are examples of digital scholarly activity that has produced digital scholarly objects, which are in turn catalogued as instances of dcat:Dataset or dcat:DataService (see below).

The project title (dct:title) is the one most evident in the public documentation with the understanding that it may differ from how the project is referred to elsewhere. The goal is to be accurate and consistent.

Project description (dct:description) is summarized into English if over 200 words or not clear enough, otherwise it is copied as is. If it is only available in Italian it is not translated, unless the aforementioned applies.

The primary subjects (dct:subject) include at least 1 academic discipline (up to 3), taken from the Vocabolario delle Aree CUN, dei Macrosettori, dei Settori Concorsuali e dei Settori Scientifico-Disciplinari delle Università Italiane vocabulary available within the OntoPia network. Due to the rigidity of the Italian SSD system, we try to match the multidisciplinary approach of humanities projects by assigning at least more than one discipline to each project and use both the project’s focus as well as the contributors’ official positions to guide these choices. This does lead to an over representation of certain fields, e.g. Informatics. In addition, and wherever possible, we also add 1 generic subject (for example a historical figure or time period) referenced via WikiData.

The start and end dates (prov:startedAtTime, prov:endedAtTime) are taken from existing documentation if available, though they can be notoriously difficult to ascertain beyond the specific year.

Associated entities (prov:associatedWith) are used for non-human agents (organisations) while contributors (dct:contributor) are used for human agents. The former includes universities, research centers, laboratories, and any other relevant entity that is credited in the project. When a project details a specific department within a university this is listed as a separate entity with reference to the parent institution. The latter includes credited human agents with a limit of 6 after which it is is limited to coordinators and directors or similar positions. These two properties are more limited in their usage than the data model allows for due to the limitation of CLEF in referencing other class individuals than those for which the application has a template.

Related projects (prov:wasInformedBy) is used to create a connection between two projects when there has been some sort of information exchange between the activities, most commonly reusing data or objects (such as software).

Location (prov:atLocation) is that of the associated entities and reflects the idea that the activity took place in a specific location, though it is of course an approximation rather than exact.

Research activities (prov:wasInfluenceBy) refers to the type of research activity a project undertook and makes use of the TaDiRAH: Taxonomy of Digital Research Activities in the Humanities vocabulary, limited to 3 in order to keep the usage manageable and meaningful. The activities are taken from the documentation wherever possible or asserted from outputs. As with the SSD vocabulary, some degree of over representation is unavoidable.

Technologies used (prov:used) are taken from available documentation where available and use concepts from the KNOT Technology Thesaurus whenever possible, or else link to external websites. In the cases where a project is using software created by another project, most common in our catalogue in the case of EVT, this software is linked via the related projects property.

Outputs (prov:generated) are the digital scholarly objects created by a project activity and should include at least 1 dataset and/or web service.

Bibliographic references (dct:isReferencedBy) are taken from available documentation, and if there is a specific publication page then it is prioritised over specific publications.

The homepage (foaf:homepage) is self-explanatory, providing a URL to the project's website.

Lastly, the status field (dbo:currentStatus) uses two values: completed or ongoing. This is based on documentation though it can be particularly difficult to ascertain and in many cases is guessed based on last updates.

Digital Object

Digital objects are catalogued as instances of the dcat:Dataset class when these objects represent collections of information that has already been structured in some way (rather than raw, unstructured data).

The object title (dct:title) is taken from documentation when available or combines the project title and the type of object when necessary. The goal is to be accurate and consistent, especially as objects are not always given a clear name within a project.

The description (dct:description) is summarized into English if over 200 words or not clear enough, otherwise it is copied as is. If it is only available in Italian it is not translated, unless the aforementioned applies. Whenever the description of a project also includes that of the objects it created, these are split where logical and assigned to each.

The release date (dct:issued) is that on which the object was first published while the modified date (dct:modified) is that on which it was last updated.

Related datasets (dct:relation) link to other digital scholarly objects that are versions of this object or incorporate the whole or parts of it.

Downloads (dcat:distribution) are URLs where the object can be downloaded directly (in the full data model these would in turn be described as instances of dcat:Distribution).

Publishers (dct:publisher) are non-human agents that host or published the dataset via a public repository or hosting solution. As with associated entities at the project level, when a department is responsible both it and the hosting university are included. Creators (dct:creator) are human agents responsible for the creation or development of the object. As with the research projects, this field is limited to 6, after which it is is limited to coordinators and directors or similar positions. As with the research project, the level of description of agents is more limited than the full data model.

The type (dct:type) of the dataset is taken from the KNOT Taxonomy vocabulary, limited to 1.

The primary subjects (dcat:theme) always include 1 concept from the Data Theme EU vocabulary alongside 1 generic subject, most often similar to that of the project and referenced via WikiData.

Temporal and geographical coverage (dct:temporal, dct:spatial) use concepts referenced via WikiData when possible, though in the latter case if there are too many specific locations a broader location is used to keep things manageable (e.g. Italy instead of specific Italian cities).

Bibliographic references (dct:isReferencedBy) are taken from available documentation, and if there is a specific publication page then it is prioritised over specific publications.

Landing page (dcat:landingPage) is the URL where the object can be accessed, rather than downloaded, while additional information (foaf:page) includes URLs where additional information about the object can be found, such as how it was assembled, and which is not available at the landing page.

Data and metadata languages (dct:language) use concepts from the EU Language vocabulary to detail which languages can be found in both data and metadata.

External identifiers (adms:identifier) include any relevant identifier for the object from an external, recognised source such as DOI, DataCite, or others and included as URLs.

Licenses (dct:license) are included when explicit in documentation and where possible state what element of the object the license refers to, i.e. metadata. Licenses are taken from the Licenze controlled vocabulary from AgID, however this vocabulary notably does not include some key license types like GNU which are linked via their official URL.

Web Service

Digital objects are catalogued as instances of the dcat:DataService class when these objects represent digital forms that enable interaction with information via user interfaces.

The web service title (dct:title) is taken from documentation when available or combines the project title and the type of object when necessary. The goal is to be accurate and consistent, especially as services are not always given a clear name within a project.

The description (dct:description) is summarized into English if over 200 words or not clear enough, otherwise it is copied as is. If it is only available in Italian it is not translated, unless the aforementioned applies. Whenever the description of a project also includes that of the objects it created, these are split where logical and assigned to each.

Access URL (dcat:endpointURL) includes the primary location(s) at which the service can be accessed.

Additional information (dcat:endpointDescription) provides one or more URLs that offer descriptions of the service(s) including operations, parameters, usage, etc.

The type (dct:type) of the web service is taken from the KNOT Taxonomy vocabulary, limited to 1.

Research activities (dcat:theme) refers to the type of research activity a service makes available to users, and makes use of concepts from the TaDiRAH: Taxonomy of Digital Research Activities in the Humanities vocabulary, limited to 3 in order to keep the usage manageable and meaningful. These are chosen based on the type of service and the specifics of the interaction(s) it offers users.

Licenses (dct:license) are included when explicit in documentation and where possible state what element of the object the license refers to, i.e. metadata. Licenses are taken from the Licenze controlled vocabulary from AgID.

Uses dataset (dcat:dataset) refers to the scholarly digital object that the service makes available.