Taming the Data Web-Scraping and De-Duplicating Messy Multilingual Philosophy Corpora

1. Abstract

This poster presents a technical report and a method for corpus expansion in the humanities, with an application to early modern philosophy, alongside a case study of dealing with heavy data redundancy in several Latin, English, and French title corpora. It enlarges on the steps taken during the initial stages of a data-intensive research project that aims to go beyond established writers and views in natural philosophy between 1600 and 1800 and it reflects on the collaboration between a humanist and a data scientist with respect to web-scraping and redundant multilingual data taming in Python.

Raluca A. Tanasescu (r.a.tanasescu@rug.nl), University of Groningen, Netherlands, The and Cristian A. Marocico (c.a.marocico@rug.nl), University of Groningen, Netherlands, The

