Semi-Automatic Identification of Travelogues

1. Abstract

Travel literature represents a rich source of information about the past, and has been of increasing interest in the scholarly community (c.f. Salzani & Tötösy de Zepetnek, 2010; Belgum et al., 2018). The Travelogues[1] project aims to study what we can learn from past views of foreign regions, cultures and religions in the light of present-day challenges such as mass tourism, migration and globalization. Comprising a team of historians, librarians and data scientists, Travelogues applies a transdisciplinary approach, combining quantitative and qualitative analytical methods to study a large-scale corpus of German language travelogues.

The project focuses on German-language holdings of the Austrian National Library printed between 1500 and 1876, including 167,570 digitized volumes. Those volumes have previously been digitized and processed by Optical Character Recognition (OCR) in the Austrian Books Online project—a public-private partnership of the Library and Google Books. In order to facilitate analysis of this vast and heterogenous collection, the project faces a number of challenges. The first challenge is to compile a corpus that includes as many travelogues from the inventory as possible. The second challenge is the profiling of the corpus at scale, analyzing it specifically for aspects of geographical coverage, salient terms over time and intertextuality. Finally, the key challenge lies in the identification of depictions of otherness in the corpus, and evolution of those depictions over time.

Travelogues is a work in progress. In this paper we will describe our results so far—how we created the corpus and approached the task of profiling—and present our plans for the upcoming steps required for a detailed analysis of the corpus.

As the intellectual basis for the project, historians first established a definition for this project’s use of the term travelogue. In the context of the project, a travelogue is defined as a specific form of media that records the experiences of a factually undertaken journey. Applying this definition, we created a balanced ground truth of digitized travelogues and non-travelogues (works that could belong to any other genre). To account for variations in the data such as document length, OCR quality or orthographic differences, we created separate ground-truth datasets for different time periods: the 16th, 17th and 18th centuries and 1800–1876. This is a manual and time-consuming process involving several steps, including keyword and metadata searches of the collection, cleansing and enrichment of heterogeneous metadata and comparisons with both contemporary and modern travelogue bibliographies (e.g. Chatzipanagioti-Sangmeister, 2006; Griep & Luber, 1990; Treue, 2014; Yerasimos, 1991). Every book we identified using this method was independently verified by a historian and a librarian.

Based on the ground truth for each period, we trained and evaluated different machine learning algorithms to classify works as either travelogues or non-travelogues. This evaluation was done using five-fold cross-evaluation on a training set and a validation set. Using the best-performing approach, we applied the models for each time period and classified all documents not part of the ground truth (a total of 161,522 books). Our model returned a confidence score, essentially quantifying the likelihood that a given document is or includes a travelogue. The top 200 findings for each time period (800 in total) were then manually evaluated by our domain experts in order to confirm the validity of the automated results. Our process revealed 345 previously-unknown volumes of travelogues that were not listed as travelogues in any bibliography we consulted so far, nor could they be found using conventional metadata search methods (e.g. searching for different spelling variations of the German word for travel). Although the 345 newly-discovered travelogues did not noticeably differ in their content from the previously-known canon, their materiality was particular. A large number of them were originally published as part of larger documents (such as serial publications, collected volumes or diaries) that, due to the lack of metadata in the library system, usually cannot be found with the traditional methods of the humanities as described above. Although we did not segment the documents into smaller entities (e.g. pages or chapters) for classification (Underwood et al., 2013), this shows that our methodology leads to robust results concerning documents that are only partially considered part of a genre, as in our case with travelogues. Additionally, we successfully proved that our methodology can expand traditional bibliographic research and help save time for domain experts. We have already described our classification task in detail (Rörden et al., 2020).

Our next steps concern the analysis and historical contextualization of the corpus. We are currently creating a searchable index of the entire document inventory (travelogues as well as non-travelogues) to enable exploratory searches. Key exploration scenarios include plotting the number of travelogues published over time, optionally while filtering by various facets (such as authors, publishers, keywords or catalogue subject classifications); and exploring salient terms that feature more prominently in travelogues over non-travelogues, or in travelogues of a particular period vs. in those of another. We have also begun to generate maps of travelogues’ geographical coverage by performing Named Entity Recognition, and resolving coordinates against the GeoNames gazetteer.

Furthermore, we have taken the first steps to analyze intertextual relations between documents (e.g. Dörr & Kurwinkel, 2014; Rajewsky, 2002), experimenting with a mix of approaches including n-gram fingerprinting (c.f. Stein, 2007) and paragraph vectors for document representation (Le & Mikolov, 2014). This has been combined with clustering and text passage alignment using the BLAST-based text reuse algorithm developed by Vierthaler and Gelein (2019). By applying the algorithm corpus-wide, we hope to learn more about the relationships between the documents and their authors, as well as how descriptions of and references to people, places and customs propagate through literature over extended time periods. Preliminary results seem promising, and have revealed what appear to be potential candidate cases of previously undocumented text-reuse. However, deeper analysis of the candidates and refinement of the methods are still ongoing. This method for the detection of intertextual relations is also a promising tool for clustering and relating works, not only based on literal title-strings but also on indexed full-texts, thus following the suggested implementation of the International Federation of Library Associations and Institutions Library Reference Model (IFLA LRM) into library catalogs (Decourselle et al., 2015; Rafferty 2015; Riva et al., 2017).

With our method, we were able to create the largest curated corpus of German-language travelogues to date—3,595 volumes, 345 of which were, to the best of our knowledge, not previously identified or findable as travelogues—thus proving that methods like ours can successfully expand the classic bibliographic methodology of the humanities and library sciences. We have made first steps towards enabling the interactive exploration of a number of relevant properties of the corpus. The next year will be dedicated to addressing the main research goals: the identification of intertextual relations between the travelogues in our corpus to deepen our understanding of how they depended on each other, and what this tells us about the circulation of knowledge, stereotypes and prejudices. This will ultimately lead to the question of how notions of otherness were depicted, how and why they changed over time, and what conclusions this allows concerning today’s perceptions and biases. We have already published large parts of our corpus under a Creative Commons license.[2] Beyond the goals of our own project, we feel that the open availability of the corpus marks a significant contribution to the research community at large, and will invite further scholarship and collaboration around this exciting resource.

Acknowledgments

The work in the Travelogues project (http://www.travelogues-project.info) is funded through an international project grant by the Austrian Science Fund (FWF, Austria: I 3795) and the German Research Foundation (DFG, Germany: 398697847).

References

Belgum, Kirsten et al. “Mapping Travel Writing: A Digital Humanities Project to Visualise Change in Nineteenth-Century Published Travel Texts.” Studies in Travel Writing, vol. 22, no. 3, July 2018, 306–24.

Chatzipanagioti-Sangmeister, Julia. Griechenland, Zypern, Balkan und Levante. Eine kommentierte Bibliographie der Reiseliteratur des 18. Jahrhunderts. 2 vols., Eutin: Lumpeter & Lasel 2006.

Decourselle Joffrey et al. “A Survey of FRBRization Techniques.” Kapidakis Sarantos et al. (eds.). Research and Advanced Technology for Digital Libraries. TPDL 2015. Cham: Springer International Publishing 2015, 185–196.

Dörr, Volker C., Tobias Kurwinkel, editors. Intertextualität, Intermedialität, Transmedialität. Zur Beziehung zwischen Literatur und anderen Medien. Würzburg: Königshausen & Neumann 2014.

Le, Quoc, Tomas Mikolov. “Distributed representations of sentences and documents.” Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML’14), vol. 32, no. 2, 2014, 1188–96.

Rafferty, Pauline. “FRBR, Information, and Intertextuality.”Library Trends, vol. 63, no. 3, 2015, 487–511.

Rajewsky, Irina O. Intermedialität. Tübingen, Basel: Francke 2002.

Riva, Pat et al. IFLA Library Reference Model. A Conceptual Model for Bibliographic Information, edited by the International Federation of Library Associations and Institutions. Den Haag: IFLA 2017.

Rörden, Jan et al. “Identifying Historical Travelogues in Large Text Corpora Using Machine Learning.” Anneli Sundqvist et al. (eds.). Sustainable Digital Communities. 15th International Conference, iConference 2020. Boras, Sweden, March 23–26, 2020. Proceedings, Cham: Springer International Publishing 2020, 801–15.

Salzani, Carlo, Steven Tötösy de Zepetnek. "Bibliography for Work in Travel Studies.“ Library Series, CLCWeb: Comparative Literature and Culture, July 2010 (updated on 14.09.2016), https://docs.lib.purdue.edu/clcweblibrary/travelstudiesbibliography.

Stein, Benno. “Principles of hash-based text retrieval.” Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval 2007 (SIGIR ’07). New York: ACM 2007, 527–34.

Treue, Wolfgang. Abenteuer und Anerkennung. Reisende und Gereiste in Spätmittelalter und Frühneuzeit (1400–1700). Paderborn: Ferdinand Schöningh 2014.

Underwood, Ted et al. “Mapping Mutable Genres in Structurally Complex Volumes.” Proceedings of the 2013 IEEE International Conference on Big Data, December 2013, 95–103.

Vierthaler, Paul, Mees Gelein. “A BLAST-based, Language-agnostic Text Reuse Algorithm with a MARKUS Implementation and Sequence Alignment Optimized for Large Chinese Corpora.” Journal of Cultural Analytics, March 2019, DOI: 10.22148/16.034, Dataverse DOI: 10.7910/DVN/2YYJ2B.

Yerasimos, Stephane. Les voyageurs dans l’Empire Ottoman (XIVe–XVIe siècles). Bibliographie, itinéraires et inventaire des lieux habités. Ankara: Imprimerie de la Société Turque d’Histoire 1991.

[1] https://travelogues-project.info/.

[2] http://github.com/travelogues/travelogues-corpus.