Linking Time, Space, and Statements in One GIS System: A Use Case of Studying Individuals' Biographies

1. Abstract

Linking Time, Space, and Statements in One GIS System: A Use
Case of Studying Individuals’ Biographies

Yi-Fan Peng, Pi-Ling Pai and Chao-Lin Liu[1]

Introduction

Digital humanities research methods have been applied to many research fields in recent years. The content contained in text corpora can be roughly summarized into five Ws[2], especially for biographies. In this research, we focus on spatial analysis and natural language processing (NLP) techniques to find two elements: “Where (Location)” and “When (Time)” in the biographical data. By linking and showing the extracted information via a GIS system, we built a spatiotemporal system for scholars to study historical issues from different perspectives.

Research material and methods

Mr. Chiang Ching-Kuo was a former president of the Republic of China. During his tenure, Taiwan enjoyed a long period of economic growth. The corpus we use is the Chronological History of Chiang Ching-Kuo (the Chronicle, henceforth) which was compiled by the Chiang Ching-Kuo Foundation.[3] The Chronicle records Chiang’s life year by year, and the total number of characters exceeds 600,000. The numbers of characters for different periods are shown in Table 1.

Table 1. Statistics for each volume

Figure 1 shows the three main steps of data processing, each of which will be explained below.

Figure 1. Main steps in data processing

Data preprocess

We converted source texts that are stored as the MS Word files (Figure 2) into pure text files, and we extracted information about the daily activities and their time labels simultaneously.

Figure 2. A page of the original material

Data extraction

With the NLP techniques, we can capture the related information for future use. We employ the CKIP[4] tools to segment words and extract the part-of-speech (POS) information. In the corpus, the POS label for the placenames is Nc[5], which is one of the important elements in spatial information. Figure 3 shows the statistics about the quantity of placenames of different lengths.

Figure 3. Length distribution of Nc words

Integration process

After extracting the placenames, we geocoded them. Since the Chronicle records events after 1910CE, the placenames are modern names. Therefore, we may use the Google Maps API[6] for geocoding. By further combining the temporal information, the placenames can be shown on a temporal scale.

Spatiotemporal System

Next, we construct a WebGIS system that links statements in the Chronicle that mentioned the placenames in the GIS subsystem. To visualize the relevance of the corpus to spatial information, the system will show the corresponding original statements in Figure 4 when scholars click on the icons for the placenames in Figure 5.

Figure 4. Original statements in the Chronicle for a chosen placename will be shown for reading

Figure 5. Placenames shown in the WebGIS system and can be clicked to query

Figure 6 shows an example, where the relevant statements were displayed on the left side of the interface. The map shows the spatial locations of the placenames. The time described by the text content is indicated in the timeline at the bottom of the system.

Figure 6. An integration of time, place, and statements

Also, we provide an analysis of the frequency-time relation of placenames. It shows the year in which the placename appeared in the corpus and the frequencies of their appearances. In general, the more frequent a placename is mentioned in a particular period of the text, the more attention it should be paid to the place. Next, we use a practical example to illustrate how to analyze the text content through the system we have built.

Spatial information aspect

Figure 7. Placenames near the Central Cross-island Highway

Temporal information aspect

By analyzing the frequency and time distribution of the placenames, shown in Figure 8. We may find that the peak frequency of the co-occurrence of these placenames falls between 1956 and 1962.

Figure 8. The number of occurrences of placename and its years. (a) is for the placename “東西橫貫公路”. (b) is for the placename “天祥”. (c) is for the placename “梨山”

Corpus aspect

With the spatial and temporal information derived from the previous analysis, we can find out the time period for the road construction and its effects. It can be seen from the text that the period from 1956 to 1962 was the time for the construction of the Central Cross-island Highway. The Tianxiang Scenic Area was established immediately, and the farms in Lishan were gradually developed. With the establishment of traffic routes, these locations along the highway became frequent visits by Mr. Chiang. As such, the researchers can read the statements in the Chronicle for specific times to learn about the relevant events.

The aforementioned example shows that our system allows researchers to conduct a study by first identifying the relevant placesnames via our WebGIS interface. Our system links the placenames to the statements in the Chronicle. Since we have associated the statements with time labels, we could analyze the temporal distributions of the placenames. Naturally, researchers can read the statements on our WebGIS system on a temporal scale as well.

Conclusion

The two elements of time and space play a very important role in the study of biographical data. Through NLP techniques and spatial information methods, we can extract important information from the corpora and build a spatiotemporal system. In this proposal, we illustrate an example of finding the time span for a specific event by analyzing the temporal frequencies of the placenames in the corpus. Researchers can discover the hidden information among the statements via our system. We believe that showing placenames on a GIS interface and relevant statements about the placenames on a time-related scale can provide helpful hints to complex research work.

[1] Yi-Fan Peng is a doctoral student of National Chengchi University, and a project manager at the Academia Sinica in Taiwan. He can be reached at yfpeng@gate.sinica.edu.tw. Pi-Ling Pai is a post-doctoral researcher at the Academia Sinica in Taiwan, and she can be reached at lingpai@gate.sinica.edu.tw. Chao-Lin Liu is a professor at the National Chengchi University, and can be reached at chaolin@g.nccu.edu.tw.

[2] Five W's means Who, Why, What, Where, and When.

[3] Information about the Chiang Ching-Kuo Foundation is available at http://www.cckf.org/en/.

[4] Ma, W. and Chen, K. (2003). Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff, In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, 17:168-171. DOI: https://doi.org/10.3115/1119250.1119276

[5] Huang, C. and Chen, K. (1998). Sinica Corpus. DOI:https://doi.org/10.13140/RG.2.1.2357.2963

[6] Google. Developer Guide | Geocoding API [cited 2019 October 12]. Available from: https://developers.google.com/maps/documentation/geocoding/intro.