Crossroads, shortcuts, detours, bypasses or dead ends? Attempts to standardization and interoperability within the context of the CLARIAH-DE project

1. Abstract

Within recent years, both the DARIAH-DE[1] as well as the CLARIN-D[2] project established themselves as digital research infrastructures for tools and research data in Germany, and also worked on developing materials for teaching and further education in the field of digital humanities[3]. Whereas DARIAH-DE had a clear focus on textual data and e.g. its representation in form of digital editions, the focus of CLARIN-D was more the provision of data and tools for linguistic analysis. Funded by the German Federal Ministry for Education and Research[4], both infrastructures will be merged within CLARIAH-DE (2019-2021), building on earlier cooperation and coordination processes, and aiming at interoperability in the future (also in accordance with the FAIR data principles[5]).

In CLARIAH-DE Work Package (WP) 1 “Research Data, Standards and Procedures”[6] one potential solution to enable interoperability is the use of prevalent TEI customizations like the German Text Archive’s base format (DTABf)[7] as an exchange or target format. While the DTABf is already established as a pivot format for text collections (especially for the annotation of full texts of historical prints, and for newspapers and simply structured manuscripts), it will be now evaluated to what extent the DTABf can be applied to the diverse field of digital scholarly editing. Although such an exchange or pivot format may not be able to represent all previously coded information in the original depth, the underlying idea is that it still can be understood as a core data set, and therefore function as a common denominator.

The evaluation is based on case studies of selected digital editions, and led by the following questions:

Which phenomena in the digital editions are currently not covered by the DTABf tag set?
Which phenomena are encoded differently in the DTABf than in the digital editions? Is a mapping more or less unproblematic or are there fundamental (semantic) differences?
Which information is required by the DTABf, but has not been encoded in the digital editions?

The encoded phenomena can be classified as loss-free transferable, partially transferable and missing or lost. Depending on the effort and the outcome, a conversion might result in either a DTABf-valid text, in which all necessary information has been encoded accordingly, but in which the encoding is also limited to the elements and attribute-value pairs permitted in the DTABf, or in a DTABf-compliant text, which contains a core encoding according to the DTABf, but which may also have annotations going beyond the DTABf.

To ensure the diversity and thus the representability of the underlying data, the editions chosen to serve as case studies for the evaluation should differ significantly from each other. Therefore, seven criteria were developed based on a review of relevant literature and existing editions. These criteria, for example the editorial model applied, the TEI modules used or the source material, resulted in an edition matrix (EdMa), which – beyond the purposes of the CLARIAH-DE-project – should enable a rough categorisation of digital editions in general.

The paper aims at outlining and discussing the problems in harmonizing standards between the two former separated infrastructures and the actions that have been taken so far, to bypass these obstacles. By using CLARIAH-DE as a concrete example, the presentation will therefore contribute to the general discussion on standardization within the context of national research infrastructures.

[1] https://de.dariah.eu/en/startseite. All links have been accessed in June 2020.

[2] https://www.clarin-d.net/en/.

[3] List of services developed and/or maintained in the context of DARIAH-DE: https://de.dariah.eu/en/list-services, for CLARIN-D see https://www.clarin-d.net/en/.

[4] https://www.bmbf.de/.

[5] https://www.force11.org/group/fairgroup/fairprinciples.

[6] In the WP members of the following institutions are cooperating: Chair of Media Informatics (University of Bamberg), Berlin-Brandenburg Academy of Sciences and Humanities, Göttingen State and University Library, Leibniz Institute for the German Language (Mannheim), and the Herzog August Library Wolfenbüttel.

[7] http://www.deutschestextarchiv.de/doku/basisformat/.