Serving the city: an automatic information extraction for mapping Amsterdam nightlife (1820-1940)

1. Abstract

This paper discusses how automatic information extraction and linked data can facilitate long-term socio-spatial analyses in urban history, using as a case study a long-term spatial analysis of the development of urban nightlife and entertainment industry in Amsterdam. We demonstrate how automatic information extraction of civil and trade registries can corroborate and validate information found in these sources, using duplicate information from the sources themselves as well as other published corpora [1] that contain overlapping information. We present a technique that greatly reduces the need for manual revision of the data. Such a technique is dearly necessary - often research projects resort to hiring people to manually go through all the data or resort to crowdsourcing, which is expensive, slow, and has the problem that the data has to be looked at by multiple non-experts in order to find ‘likely correct’ data (see e.g. Zheng et al. 2017). Previous efforts to do such a thing have focused on cleaning OCR results by using contextual probabilities or word frequency probabilities (e.g. Reynaert 2014; Mei et al. 2018; Jatowt and Nguyen 2018; Doush, Alkhateeb and Gharaibeh 2018), at times combined with semantic data (e.g. Woodfield et al 2018), which works on texts that are representative of written vocabulary but is not very apt for lists of names and places. Our technique combines and compares several archival sources that include similar information on, for instance, person and place names. Thereafter, a probability estimation disambiguates between one or multiple occurences of references to what is most likely the same entity. To keep track of and to document this process we model and publish the data in the new roar ontology, [2] which allows us to model archival data, while keeping the provenance chain of decision making and entity disambiguation.

Informed by recent developments in the field of digital history as well as sociological literature on urban nightlife, this paper applies these techniques to address the underrepresentation of small, short-lived, and informal cultural venues in quantitative studies on cultural life. Even though the nineteenth and twentieth century witnessed the rise of nightlife industries as we know them today (Baldwin 2012; Schlör 1998; Nasaw 1999, Erenberg 1981), the organization of small entertainment venues and their effects on the production and consumption of urban cultural life have received little systematic analysis beyond case studies on individuals and single venues, especially in the Dutch context. This is partly due to the large number of yet undigitized archival periodicals such as address books, almanacs and program listings. However, the advent of new methods for digitizing and analyzing documents allowed us to develop a new and less labour intensive technique for uncovering and structuring these valuable sources.

Figure 1 - Address book example (1824). Usual structure is: [name] [(initials/prefix)] [street name] [house number] [occupation] [either telephone number or neighbourhood number]

In this paper we help to redress this issue of underrepresentation by using automatic information extraction and entity disambiguation to systematically trace and analyse the spatial and diachronic distribution of clubs, restaurants and bars in relation to more formal cultural venues such as theatres and concert halls, as well as the rise of new entertainment venues such as cinemas. To reconstruct this urban pleasurescape, we contrast established historiographical narratives with a diverse set of historical sources. We reconstruct the public sphere of food and drink consumption by digitizing address books, the ‘yellow pages’ of those times (see Figure 1 for an excerpt of such a book), that listed most businesses in Amsterdam, their locations, and proprietors’ names. The data in the address books is semi-structured and rather consistent over time, whereby it was possible to programmatically extract information. We extracted all persons that had occupations in the food and drink service industry such as ‘koffijhuishouders’ [coffee shop holders] and ‘tappers’ [tavern keepers], but also theatres, cinemas and music venues. Once digitized, an automated comparison of these books significantly reduces the need for manual correction of OCR errors, spelling variation, other artifacts and errors. [3] Other sources, such as the Amsterdam Citizen Registry, are used to validate the automatic extraction from the address books.

We validate the method by gathering as many sources with overlapping information as possible, normalizing its contents slightly (e.g. orthography of occupations and street names) and scoring each bit of information to deduce which data is likely correct and which is not by taking into account source reliability, temporal and geographic distance, and number of sources giving the same and dissenting information. This also provides an order of probability of necessity of manual revision that makes it possible to address the problems in an efficient manner; the data is most quickly improved correcting the certainly wrong information first, and leaving the possibly wrong for later. For address books of subsequent years alone, the match rate is around 50%, which means that half the data does not need checking. For larger gaps between subsequent address books the match rate drops, since the information in the books differs increasingly over time. Adding a second address book from the year preceding the earliest increases the number to 60%. After linking the street names in the books with georeferenced data on Adamlink, we could plot their locations and see the changing patterns over time, demonstrating how patterns of urban expansion impacted on the organisation of urban nightlife in the city of Amsterdam between 1820 and 1940. Finally we fit and publish the data in the roar model, thereby creating transparent linked data reconstructions of historic persona and businesses for others to use.

This systematic digital historical approach has, moreover, the added value of making source bias explicit, as the comparisons show which information present in one is not present in another. The bottom-up approach reveals significant gaps in our current knowledge of urban nightlife. It also shows the value of using multiple sources that overlap in the information they carry in reducing labour and increasing accuracy, and introduces roar as data model for storing this type of information.

[1] See primary sources.

[2] https://w3id.org/roar.

[3] We built on our experiences in the DIGIFIL project, where we developed other techniques to extract data from Dutch film listings over a 50-year period (Kisjes et al. 2019).

Bibliography

Primary sources

Amsterdam City Archive, collection 30274: Adresboeken [address books], inventory 1-96 (coverage 1821-1939). https://archief.amsterdam/inventarissen/overzicht/30274.nl.html
Amsterdam City Archive, collection 30398: Koopmansboekjes [merchant books], inventory 32-37 (coverage 1821-1838). https://archief.amsterdam/inventarissen/overzicht/30398.nl.html
Amsterdam City Archive, Index of the Bevolkingsregisters [citizen registries] (coverage 1851-1893). https://archief.amsterdam/inventarissen/overzicht/5000.nl.html
HisGis, Fryske Akademy, https://hisgis.nl/projecten/amsterdam/

Datasets, tools and ontologies

Adamlink - https://adamlink.nl/
roar, Reconstructions and Observations in Archival Resources - https://w3id.org/roar
Tesseract OCR - https://github.com/tesseract-ocr/tesseract

Secondary sources

Baldwin, P. C. (2012). In the Watches of the Night: life in the nocturnal city, 1820-1930. University of Chicago Press.
Doush, I. A., Alkhateeb, F., & Gharaibeh, A. H. (2018). A novel Arabic OCR post-processing using rule-based and word context techniques. International Journal on Document Analysis and Recognition (IJDAR), 21(1-2), 77-89.
Erenberg, L. A. (1981). Steppin'Out: New York Nightlife and the Transformation of American Culture. University of Chicago Press.
Coustaty, M., Doucet, A., Jatowt, A., & Nguyen, N. V. (2018, November). Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction. In International Conference on Asian Digital Libraries (pp. 278-289). Springer, Cham.
Kisjes, Ivan, K. Beelen, T. van Oort, K. Lotze: Mining the Movie Landscape. Extracting Film Listings from Digital Newspapers. September 2019. Presented at DHBenelux 2019, 11-13 September. Liège, Belgium.
Mei, J., Islam, A., Moh’d, A., Wu, Y., & Milios, E. (2018). Statistical learning for OCR error correction. Information Processing & Management, 54(6), 874-887.
Nasaw, D. (1999). Going out: The rise and fall of public amusements. Harvard University Press.
Reynaert, M. (2014, August). TICCLops: Text-Induced Corpus Clean-up as online processing system. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations (pp. 52-56).
Schlör, J. (1998). Nights in the big city: Paris, Berlin, London 1840-1930. Reaktion Books.
Zheng, Y., Li, G., Li, Y., Shan, C., & Cheng, R. (2017). Truth inference in crowdsourcing: Is the problem solved?. Proceedings of the VLDB Endowment, 10(5), 541-552.
Woodfield, S. N., Seeger, S., Litster, S., Liddle, S. W., Grace, B., & Embley, D. W. (2018, October). Ontological Deep Data Cleaning. In International Conference on Conceptual Modeling (pp. 100-108). Springer, Cham.