Mapping Topic Evolution Across Poetic Traditions

1. Abstract

Poetic traditions across languages evolved differently, but we find that certain semantic topics occur in several of them, albeit sometimes with temporal delay, or with diverging trajectories over time. We apply Latent Dirichlet Allocation (LDA) to poetry corpora of four languages: German (74k poems), English (85k poems), Russian (18k poems), and Czech (80k poems). We manually align and interpret salient topics, their trend over time (1600–1925 CE), showing similarities and disparities across poetic traditions with a few select topics, and use their trajectories over time to pinpoint specific literary periods with a focus on Romanticism and Modernism.

1 Corpora & Model

To determine the evolution of topics across poetic traditions, we collect four poetry corpora in Czech, Russian, German and English. See Table 1 for a size overview and where they were mined from. Our corpora cover a wide range of different genres and authors, but they are mildly contaminated with foreign language poems.

To learn semantic topics, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) has proved useful. We use the vanilla LDAMultiCore implementation as it is provided in genism (Rehurek and Sojka, 2011; https://radimrehurek.com/gensim/models/ldamulticore.html). LDA assumes that a particular document contains a mixture of a few salient topics of semantically related words.

We transform our documents to a bag of words representation and set the desired number of topics=100 and train for 100 epochs (passes) to attain a reasonable distinctness of topics. We choose 100 topics, as previous research on poetic topics (Haider, 2019; Navarro-Colorado, 2018) determined this parameter to be optimal for distant reading. Since we deal with highly inflected languages (Czech, Russian), lemmas were used instead of word forms. For lemmatization and POS-tagging of English and German texts we use the TreeTagger (Schmid, 1994), for lemmatization and POS-tagging of Czech texts we use the MorphoDita (Straková et al., 2014), for lemmatization of Russian texts we use MyStem (Segalovich, 2003). In Czech, German, and English all parts-of-speech (POS) except for nouns, adjectives, and verbs were filtered out. In Russian, the list of stopwords is provided by the NLTK library and manually extended by us.

2 Experiment Setup

We approach diachronic variation in poetry as distant reading task to visualize the development of interpretable topics over time and across languages. We retrieve the most important (likely) words for all topics and interpret these (sorted) word lists as aggregated topics. We are then able to manually translate several topics that align over all four corpora.

To discover trends over time, we bin our documents into time slots of 25 years width each, except for early English where two large slots (1600–1674 and 1675–1749) were used due to sparse data. See Figures 1 and 2 for a plot of the number of documents per bin. To visualize trends of singular topics over time, we follow the strategy of Haider (2019): We aggregate all documents d in slot s and sum the probabilities of topic t given d and divide by the number of all d in s. This gives us the average probability of a topic per time slot. We then plot the trajectories for each single topic.

3 Alignment and Interpretation of topic Trajectories

Based on a few selected topics, we can trace similarities and disparities over poetic traditions. See Figures 3–8 for a selection of interpretable topic trends where the four languages align.

Figure 3 shows the topic "Nation", which has a similar trend in German, Czech, and Russian, but is not present in the English corpus (cf. completely different geopolitical situation of the British empire). In the German corpus it emerges in the second half of the 18th century and peaks around 1825 to 1850 (outlining the period of ‘Vormärz’). The same peak can be found in the Czech corpus (late National Revival), and slightly delayed in Russian. It loses importance in all three corpora after 1850/60, but it is gaining traction once again at the beginning of the 20th century.

Figure 4 shows the topic "Sea", which similarly rises towards the second half of the 19th century and then stays stable into Modernism. This topic is most pronounced around the Russian and German period of Romanticism, after which it appears to taper off, while still following an upward trajectory into the 20th century for English and Czech.

The topic "Sleep" (Figure 5) appears quite correlated with the topic "Sea" in English, German, and Russian. While it is basically non-existent in the Early Modern Age (Baroque, Renaissance), it became increasingly popular toward late Romanticism and then Modernism, broadly delineating the long 19th century. Yet, it is rather marginal in the Czech corpus.

Figure 6 shows the topic "Sorrow", illustrating distinctly separate trends, with English and German on one hand and Czech and Russian on the other. For the first language pair it is associated with the period of Romanticism (although becoming prominent much earlier in English). Concerning the second language pair, we find this topic in late 19th century Modernism (although in Russian it already emerges in the period of Romanticism, around 1825 to 1850).

Figure 7 shows the topic "Stars" that is pronounced in English and German High Romanticism (1800 to 1825) and in Russian Late Romanticism (1825 to 1850). In Czech the peak occurs delayed in the generation of "Máj" (period 1850 to 1875). Note that these authors claimed themselves as the followers of Karel Hynek Mácha (1810–1836), who in turn is well-known for bringing English Romanticism themes into Czech poetry.

Lastly, Figure 8 shows the topic "Wine", which is clearly associated with the Anacreontics. This topic is already quite visible before the onset of Romanticism, with accents in early 18th century English poetry, second half 18th century and later German poetry (High Romanticism), and late 18th century Czech poetry (almanacs edited by A. J. Puchmajer). In Russian poetry it surprisingly peaks in the period of romanticism (1825 to 1850).

4 Conclusion & Future Work

In this paper we used Latent Dirichlet Allocation for a visualization of topic trends across languages, illustrating the similarities and disparities between different poetic traditions. Our method is largely based on reading and translating topic distributions and finally interpreting the trajectories of relative topic importance against the backdrop of literary history. We find that some topics, especially the selected examples, do align across languages, sometimes with temporal delay (as they were picked up later in another language), while other topics were not as heavily discussed in other poetic discourses (such as "Nation" in English). In future work, we intend to look into cross-lingual alignment methods, e.g., through multi-lingual word embeddings or poly-lingual topic models without the need for parallel data. Finally, the over- or underrepresentation of certain authors or the presence of (near-)duplicates of poems (from different editions) can lead to corpus imbalance. Consequently, this impacts our measure to calculate the relative importance of a topic given a certain time stamp and should be addressed in future work.

References

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.

Thomas N Haider. 2019. Diachronic topics in new high german poetry. Proceedings of the International Digital Humantities Conference DH2020 in Utrecht.

Borja Navarro-Colorado. 2018. On poetic topic modeling: extracting themes and motifs from a corpus of spanish poetry. Frontiers in Digital Humanities, 5:15.

Radim Rehurek and Petr Sojka. 2011. Gensim—statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD.

Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK.

Ilya Segalovich. 2003. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In MLMTA.

Jana Straková, Milan Straka, and Jan Hajicˇ. 2014. Open-source tools for morphology, lemmatization, pos tagging and named entity recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13–18, Baltimore, Maryland, jun. Association for Computational Linguistics.