Training Algorithms to Read Complex Collections: Handwriting Classification for Improved HTR Models

1. Abstract

This paper will present a new handwriting grouping algorithm that has been developed to decrease the Character Error Rate (CER) for a collection of manuscript documents written in various hands and in multiple languages.

The Moravian Lives project (moravianlives.org), an international, collaborative DH project housed at Bucknell University, takes as its starting point a vast collection of archival materials held by the Moravian Church in the United Kingdom, Germany, and the United States. The materials include tens of thousands of handwritten ego-documents, written in a variety of handwriting styles. As the documents are held in archives in a variety of international locations, one of the goals of the Moravian Lives Project is to digitize and transcribe these memoirs, to make them accessible to a broader audience.

Initially, transcriptions for the memoirs were crowdsourced. However, crowdsourcing is replete with problems including varying accuracy of transcriptions, length of time needed to produce transcriptions, and a dearth of individuals who can read the handwriting styles of the documents, particularly those written in old German script.

To facilitate the transcription process, in early 2019 the Moravian Lives team began using Transkribus (transkribus.eu). The platform allows for creation of custom handwritten text recognition (HTR) models, which are based on previously transcribed memoirs and used to machine transcribe new documents (Muehlberger et al., 2019). With adequate training data (i.e., several hundred pages or 50,000+ words), models with a CER of five percent or less can be developed, which is sufficient for expediting archival work. Extant projects which have so far achieved this success rate may be based on multiple hands, drawing on significant data from each hand. For example, the University of Greifswald has trained successful models with a 5% CER on a corpus of 250,000 words written in three different hands. Similarly, the Bentham Project trained a highly accurate English-language model on 50,000 words written in a small number of hands (Muehlberger et al., 2019).

The numerous and varying handwriting styles found in the Moravian memoirs present multi-facted challenges to creating highly accurate models. We do not know how many scribes there were, or in most cases their identities, and we are continually coming across new handwriting styles. Memoir documents are between two to 50 pages in length; most documents we are working with are 10 pages or fewer, meaning there is not a lot of data per document. While we have had some success creating models via human identification of similarities in handwriting, we believe that automated scribe identification and/or automated grouping of handwriting by similarities in style could result in much more accurate models. To address this problem, an undergraduate computer science major and professor of computer science joined the Moravian Lives team and are experimenting with deep learning to author a grouping model, designed to group or sort memoirs by handwriting styles. These groupings should enable the creation of more accurate models in Transkribus, as well as more accurate transcription outputs.

Bibliography

Fischer, Andreas, Micheal Baechler, Angelika Garz, Marcus Liwicki, and Rolf Ingold. "A Combined System for Text Line Extraction and Handwriting Recognition in Historical Documents." In 2014 11th IAPR International Workshop on Document Analysis Systems. IEEE, 2014.

Kahle, Philip, Sebastian Colutto, Günter Hackl, and Günter Mühlberger. "Transkribus-A Service Platform for Transcription, Recognition and Retrieval of Historical Documents." In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017.

Mohammed, Hussein, Volker Märgner, and H. Siegfried Stiehl. "Writer Identification for Historical Manuscripts: Analysis and Optimisation of a Classifier as an Easy-to-Use Tool for Scholars from the Humanities." In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, 2018.

Moravian Lives. Bucknell University and the University of Gothenburg, 2019, http://moravianlives.org/ Accessed 20 October 2019.

Muehlberger, Guenter, Louise Seaward, Melissa Terras, Sofia Ares Oliveira, Vicente Bosch, Maximilian Bryan, Sebastian Colutto et al. "Transforming Scholarship in the Archives Through Handwritten Text Recognition." Journal of Documentation, vol. 75 no. 5, 2019, pp. 954-976.

Senatsprotokolle der Universität Greifswald. University of Greifswald, 2019, http://www.digitale-bibliothek-mv.de/viewer/toc/PPNUAG_0_1/1/LOG_0000/ Accessed 20 October 2019.

Transcribe Bentham. University College, London, 2019, http://transcribe-bentham.ucl.ac.uk/td/Transcribe_Bentham Accessed 20 October 2019.

Transkribus, University of Innsbruck, 2019, https://transkribus.eu/ Accessed 20 October 2019.

Ventresque, Vincent, Arianna Sforzini, and Marie-Laure Massot. "Transcribing Foucault’s handwriting with Transkribus." Journal of Data Mining & Digital Humanities, 2019. https://jdmdh.episciences.org/5218/pdf