<title type="main">Taming the Data

<title type="main">Taming the DataWeb-Scraping and De-Duplicating Messy Multilingual Philosophy CorporaTanasescuRaluca A.University of Groningen, Netherlands, Ther.a.tanasescu@rug.nlMarocicoCristian A.University of Groningen, Netherlands, Thec.a.marocico@rug.nl43921Name, Institution

StreetCityCountryName

Converted from an OASIS Open Document

DHConvalidatorPaperPosterweb crawlingtext analysisdata cleaningdeduplicationcorpus expansionEuropeEnglish15th-17th Century18th Centuryinformation retrieval and querying algorithms and methodstext mining and analysisHumanities computingPhilosophy

This poster presents a technical report and a method for corpus expansion in the humanities, with an application to early modern philosophy, alongside a case study of dealing with heavy data redundancy in several Latin, English, and French title corpora. It enlarges on the steps taken during the initial stages of a data-intensive research project that aims to go beyond established writers and views in natural philosophy between 1600 and 1800 and it reflects on the collaboration between a humanist and a data scientist with respect to web-scraping and redundant multilingual data taming in Python.