Automatic Extraction of Poetry from Digitally Scanned Books

1. Abstract

We present an automatic, learned model for the extraction of poetry from digitally scanned books. This abstract highlights our recent work on poetry identification from Internet Archive books and the public resources (code, data and models) that exist as a result. We hope that this is the beginning of deeper and richer research into poetry in the digital humanities because curating custom collections of poetry should be less expensive.

Poetry in Digital Libraries

Digital libraries have expanded rapidly in quantity and quality of content over the past decade. Out-of-copyright and public domain works are available from the inventing of the printing press all the way to the early twentieth century.

Unfortunately, this explosion in content has not quite connected all the way to different genres: large collections of poetry are not available because they are typically curated manually.

The intersection of poetry and digital methods is actually fairly common and has been studied in a diverse set of languages and cultures e.g., Bangla (Rakshit et al., 2015), Arabic (Ahmed and Trausan-Matu, 2017) and Thai (Promrit and Waijanya, 2017). Features of poetry have also been studied using computational methods, e.g, meter (Hamidi et al., 2009), style (Baumann et al., 2018), authorship and time (Can et al., 2011), emotion (Alsharif et al., 2013; Barros et al., 2013; Kumar and Minz, 2014), and even content (Jamal et al., 2012; Choi et al., 2016; Lou et al., 2015; Kesarwani, 2018). Kaur an Saini’s recent work on classifying Punjabi poems into four categories is not a survey, but does provide a table of recent work, language targeted, and features discussed (2017).

However, most of these works use small datasets (10s-100s of poems), because the cost of collecting and curating poetry is so high. There is a lot of poetry available in digital libraries, but it’s effectively hidden in those books.

Automatic Extraction of Poetry

Underwood et al. (2013) present a study of genre in Hathi Trust books, and one of their genres is poetry, which they extend to page level labels in later work (Underwood, 2014). Other recent work uses image classification approaches (Lorang et al., 2015), focuses on Australian newspapers (Kilner and Fitch, 2017) or is language-specific on a small collection (Tizhoosh et al., 2008).

These existing approaches cannot be cleanly applied to discover poems such as this poem about “Sweet Peas” that our algorithm identified in the middle of a gardening guide (Figure 1).

Figure 1: A Poem printed in the middle of a Gardening Guide (Rockwell et al., 1917). This is the kind of “hidden” poetry our algorithm was designed to target.

Drawing inspiration and ideas from these works, we formulated the poetry identification problem: does a given scanned book page contain poetry on it?

Using a few thousand labeled pages as training data and only language-independent features, we developed a new model for poetry identification. This model is both effective (F1 = 0.83) and efficient (500,000 books/hour - single machine). It runs on DJVU-XML books from the Internet Archive.

Public Resources, Code, & Open Data

We released a variety of public resources. There is a dataset of our identification task as well as a JSON-formatted collection of 600,000 pages identified to contain poetry from a random selection of 50,000 books. Our model is available and our methodology can be found in more detail in my dissertation (Foley, 2019).


Ahmed, M. A. and Trausan-Matu, S. (2017). Using natural language processing for analyzing arabic poetry rhythm. In Networking in Education and Research (RoEduNet), 2017 16th RoEduNet Conference, pages 1–5. IEEE.

Alsharif, O., Alshamaa, D., and Ghneim, N. (2013). Emotion classification in arabic poetry using machine learning. International Journal of Computer Applications, 65(16).

Barros, L., Rodriguez, P., and Ortigosa, A. (2013). Automatic classification of literature pieces by emotion detection: A study on Quevedo’s poetry. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pages 141–146. IEEE.

Baumann, T., Hussein, H., and Meyer-Sickendiek, B. (2018). Style detection for free verse poetry from text and speech. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1929–1940.

Can, E. F., Can, F., Duygulu, P., and Kalpakli, M. (2011). Automatic categorization of ottoman literary texts by poet and time period. In Computer and Information Sciences II, pages 51–57. Springer.

Choi, K., Lee, J. H., Hu, X., and Downie, J. S. (2016). Music subject classification based on lyrics and user interpretations. Proceedings of the Association for Information Science and Technology, 53(1):1–10.

Foley, J. (2019). Poetry: Identification, Entity Recognition, and Retrieval. PhD thesis, University of Massachusetts.

Hamidi, S., Razzazi, F., and Ghaemmaghami, M. P. (2009). Automatic meter classification in persian poetries using support vector machines. In Signal Processing and Information Technology, 2009 IEEE International Symposium on, pages 563–567. IEEE.

Jamal, N., Mohd, M., and Noah, S. A. (2012). Poetry classification using support vector machines. Journal of Computer Science, 8(9):1441.

Kaur, J. and Saini, J. R. (2017). Punjabi poetry classification: The test of 10 machine learning algorithms. In Proceedings of the 9th International Conference on Machine Learning and Computing, pages 1–5. ACM.

Kesarwani, V. (2018). Automatic Poetry Classification Using Natural Language Processing. PhD thesis, Universit ?e d’Ottawa/University of Ottawa.

Kilner, K. and Fitch, K. (2017). Searching for My Lady’s Bonnet: discovering poetry in the National Library of Australia’s newspapers database. Digital Scholarship in the Humanities.

Kumar, V. and Minz, S. (2014). Multi-view ensemble learning for poem data classification using sentiwordnet. In Advanced Computing, Networking and Informatics-Volume 1, pages 57–66. Springer.

Lorang, E. M., Soh, L.-K., Datla, M. V., and Kulwicki, S. (2015). Developing an image-based classifier for detecting poetic content in historic newspaper collections. Technical report, University of Nebraska - Lincoln.

Lou, A., Inkpen, D., and Tanasescu, C. (2015). Multilabel subject-based classification of poetry. In The Twenty-Eighth International Flairs Conference.

Promrit, N. and Waijanya, S. (2017). Convolutional neural networks for thai poem classification. In International Symposium on Neural Networks, pages 449–456. Springer.

Rakshit, G., Ghosh, A., Bhattacharyya, P., and Haffari, G. (2015). Automated Analysis of Bangla Poetry for Classification and Poet Identification. In Proceedings of the 12th International Conference on Natural Language Processing, pages 247–253.

Rockwell, F., Loveless, A., and Hottes, A. (1917). Garden Guide: The Amateur Gardener’s Handbook. Internet Archive:

Tizhoosh, H. R., Sahba, F., and Dara, R. (2008). Poetic features for poem recognition: A comparative study. Journal of Pattern Recognition Research, 3(1):24–39.

Underwood, T. (2014). Understanding Genre in a Collection of a Million Volumes. Technical report, University of Illinois, Urbana-Champaign.

Underwood, T., Black, M. L., Auvil, L., and Capitanu, B. (2013). Mapping mutable genres in structurally complex volumes. In IEEE Big Data, 95–103.

John Foley (, Smith College, United States of America

Theme: Lux by Bootswatch.