Documentation Criteria

On this page you will find details of the criteria and rules used to document the research projects and scholarly digital objects included in the KNOT Catalogue.

These projects and objects come from a census begun in the first year of the project, and ongoing, alongside recommendations of existing or concluded projects to be added. The information used to create the records was taken from official websites, documentation, and/or public repositories in two steps: firstly as part of the census with a focus on key information that was used to help develop the data model; secondly following development of the data model and during the population of the catalogue with a focus on any missing information that was relevant and/or essential, including looking for additional sources in public repositories such as GitHub and CLARIN-IT. Any information that could not be ascertained without extensive additional work was left out.

The KNOT Catalogue uses the CLEF application and with this choice came some constraints on aligning our full data model with the functionalities and limitations of the application (for example its requirement for each class instance to be attached to a template). As such, we developed a simplified version of the KNOT Data Model aligned with these constraints yet still able to describe scholarly digital objects and activities meaningfully and with the goals and specificities of our project in mind.

Therefore, the KNOT Catalogue makes use of three key KNOT-DM classes - prov:Activity, dcat:Dataset, and dcat:DataService - to create records that represent research projects and connect them to the objects they have created, including both collections of information that has already been structured in some way (rather than raw data) and digital forms that enable interaction with information via user interfaces. As per the full data model, these three classes can be connected to each other using properties and these are represented via the catalogue's interface, allowing for navigation between records. This focus on three classes does however limit the use of properties and introduces some new usages while many of the classes that cannot be used due to the limitations of CLEF are replaced by data types (primarily URLs and text strings).

The figure below offers a visual summary of the simplified data model the catalogue use, while details of the full model can be found on our website.

A summary of the simplified KNOT-DM used by the KNOT Catalogue.

What follows is a detailed breakdown of the simplified data model used by the catalogue including the properties used by each class, which appear on the catalogue records, and where the information comes from. For more information about why properties are used refer to the full data model documentation. The following legend applies to the textual highlights based on the "Graphical Framework for OWL Ontologies": a yellow highlight indicates a class, a blue highlight indicates an object property, and a green highlight indicates a data property.

Research Project

Research projects are catalogued as instances of the prov:Activity class. These are examples of scholarly digital activity that has produced one or more scholarly digital objects, which are in turn catalogued as instances of dcat:Dataset or dcat:DataService (see below).

The project title (dct:title) is the one most evident in the public documentation with the understanding that it may differ from how the project is referred to elsewhere. The goal is to be accurate and consistent.

Project description (dct:description) is summarized into English if over 200 words or not clear enough, otherwise it is included as is. If it is only available in Italian it is not translated, unless the aforementioned applies.

The primary subjects (dct:subject) include at least 1 academic discipline (up to 3), taken from the Vocabolario delle Aree CUN, dei Macrosettori, dei Settori Concorsuali e dei Settori Scientifico-Disciplinari delle Università Italiane vocabulary available within the OntoPia network. Due to the rigidity of the Italian SSD system, we try to match the multidisciplinary approach of humanities projects by assigning at least more than one discipline to each project and use both the project’s focus as well as the contributors’ official positions to guide these choices. This does lead to an over representation of certain fields, e.g. Informatics. In addition, and wherever possible, we also add 1 generic subject (for example a historical figure or time period) referenced via WikiData.

The start and end dates (prov:startedAtTime, prov:endedAtTime) are taken from existing documentation if available, though they can be notoriously difficult to ascertain beyond the specific year.

Associated entities (prov:associatedWith) are used for non-human agents (organisations) while contributors (dct:contributor) are used for human agents. The former includes universities, research centers, laboratories, and any other relevant entity that is credited in the project. When a project details a specific department within a university this is listed as a separate entity with reference to the parent institution in the descriptive text string. The latter includes credited human agents with a limit of 6 after which it is limited to coordinators and directors or similar positions. In the full data model agents would in turn be described as instances of specific classes however due to the limitations of the application, these properties point directly to authority controls.

Related projects (prov:wasInformedBy) is used to create a connection between two projects when there has been some sort of information exchange between the activities, most commonly reusing data or objects (such as software).

Location (prov:atLocation) is that of the associated entities and reflects the idea that the activity took place in a specific location, though it is of course an approximation rather than exact.

Research activities (prov:wasInfluenceBy) refers to the type of research activity a project undertook and makes use of the TaDiRAH: Taxonomy of Digital Research Activities in the Humanities vocabulary, limited to 3 in order to keep the usage manageable and meaningful. The activities are taken from the documentation wherever possible or asserted from outputs. As with the SSD vocabulary, some degree of over representation is unavoidable.

Technologies used (prov:used) are taken from available documentation where available and use concepts from the KNOT Technology Thesaurus whenever possible, or else link to external websites. In the case where a project is using software created by another project, most common in our catalogue in the case of EVT, this is linked via the related projects property.

Outputs (prov:generated) are the scholarly digital objects created by a project activity and should include at least 1 dataset and/or web service.

Bibliographic references (dct:isReferencedBy) are taken from available documentation, and if there is a specific publication page then it is prioritised over specific publications.

The homepage (foaf:homepage) is self-explanatory, providing a URL to the project's website. This is an update from the full data model where the property is attached to the dcat:Catalog class.

Lastly, the status field (dbo:currentStatus) uses two values: completed or ongoing. This is based on documentation though it can be particularly difficult to ascertain and in many cases is guessed based on last updates. This is an addition specific to the simplified data model based on CLEF's existing usage of the property for noting the status of graphs.

Digital Object

Digital objects are catalogued as instances of the dcat:Dataset class when these objects represent collections of information that has already been structured in some way (rather than raw, unstructured data).

The object title (dct:title) is taken from documentation when available or combines the project title and the type of object if a clear title is unavailable. The goal is to be accurate and consistent, especially as objects are not always given a clear title within a project.

The description (dct:description) is summarized into English if over 200 words or not clear enough, otherwise it is copied as is. If it is only available in Italian it is not translated, unless the aforementioned applies. Whenever the description of a project also includes that of the objects it created, these are split where logical and assigned to each.

The release date (dct:issued) is that on which the object was first published while the modified date (dct:modified) is that on which it was last updated.

Related datasets (dct:relation) link to other scholarly digital objects that are versions of this object or incorporate the whole or parts of it.

Downloads (dcat:distribution) are URLs where the object can be downloaded directly. In the full data model these would in turn be described as instances of dcat:Distribution.

Publishers (dct:publisher) are non-human agents that host or published the dataset via a public repository or hosting solution. As with associated entities at the project level, when a department is responsible both it and the hosting university are included. Creators (dct:creator) are human agents responsible for the creation or development of the object. As with the research projects, this field is limited to 6, after which it is is limited to coordinators and directors or similar positions. As with the research project, these properties point to authority controls.

The type (dct:type) of the dataset is taken from the KNOT Taxonomy vocabulary, limited to 1.

The primary subjects (dcat:theme) always include 1 concept from the Data Theme EU vocabulary (based on DCAT specifications) alongside 1 generic subject, most often similar to that of the project and referenced via WikiData.

Temporal and geographical coverage (dct:temporal, dct:spatial) use concepts referenced via WikiData when possible, though in the latter case if there are too many specific locations a broader location is used to keep things manageable (e.g. Italy instead of specific Italian cities).

Bibliographic references (dct:isReferencedBy) are taken from available documentation, and if there is a specific publication page then it is prioritised over specific publications.

Landing page (dcat:landingPage) is the URL where the object can be accessed, rather than downloaded, while additional information (foaf:page) includes URLs where additional information about the object can be found, such as how it was assembled, which is not available at the landing page.

Data and metadata languages (dct:language) use concepts from the EU Language vocabulary to detail which languages can be found in both the data and metadata of the object.

External identifiers (adms:identifier) include any relevant identifier for the object from an external, recognised source such as DOI, DataCite, or others.

Licenses (dct:license) are included when explicit in documentation and where possible state what element of the object the license refers to, i.e. metadata. Licenses are taken from the Licenze controlled vocabulary from AgID, however this vocabulary notably does not include some key license types like GNU, in which case these are linked via their official URL.

Web Service

Digital objects are catalogued as instances of the dcat:DataService class when these objects represent digital forms that enable interaction with information via user interfaces.

The web service title (dct:title) is taken from documentation when available or combines the project title and the type of object if a clear title is unavailable. The goal is to be accurate and consistent, especially as services are not always given a clear title within a project.

The description (dct:description) is summarized into English if over 200 words or not clear enough, otherwise it is copied as is. If it is only available in Italian it is not translated, unless the aforementioned applies. Whenever the description of a project also includes that of the objects it created, these are split where logical and assigned to each.

Access URL (dcat:endpointURL) includes the primary location(s) at which the service can be accessed.

Additional information (dcat:endpointDescription) provides one or more URLs that offer descriptions of the service(s) including operations, parameters, usage, etc.

The type (dct:type) of the web service is taken from the KNOT Taxonomy vocabulary, limited to 1.

Research activities (dcat:theme) refers to the type of research activity a service makes available to users, and makes use of concepts from the TaDiRAH: Taxonomy of Digital Research Activities in the Humanities vocabulary, limited to 3 in order to keep the usage manageable and meaningful. These are chosen based on the type of service and the specifics of the interaction(s) it offers users.

Licenses (dct:license) are included when explicit in documentation and where possible state what element of the object the license refers to, i.e. metadata. Licenses are taken from the Licenze controlled vocabulary from AgID.

Uses dataset (dcat:dataset) refers to the digital scholarly object that the service makes available.