<?xml version="1.0" encoding="UTF-8"?><TEI xmlns="http://www.tei-c.org/ns/1.0"><teiHeader><fileDesc><titleStmt><title type="full"><title type="main">Taming the Data</title><title type="sub">Web-Scraping and De-Duplicating Messy Multilingual Philosophy Corpora</title></title></titleStmt><author><persName><surname>Tanasescu</surname><forename>Raluca A.</forename></persName><affiliation>University of Groningen, Netherlands, The</affiliation><email>r.a.tanasescu@rug.nl</email></author><author><persName><surname>Marocico</surname><forename>Cristian A.</forename></persName><affiliation>University of Groningen, Netherlands, The</affiliation><email>c.a.marocico@rug.nl</email></author><editionStmt><edition><date>43921</date></edition></editionStmt><publicationStmt><publisher>Name, Institution</publisher><address><addrLine>Street</addrLine><addrLine>City</addrLine><addrLine>Country</addrLine><addrLine>Name</addrLine></address></publicationStmt><sourceDesc><p>Converted from an OASIS Open Document</p></sourceDesc></fileDesc><encodingDesc><appInfo><application ident="DHCONVALIDATOR" version="1.22"><label>DHConvalidator</label></application></appInfo></encodingDesc><profileDesc><textClass><keywords scheme="ConfTool" n="category"><term>Paper</term></keywords><keywords scheme="ConfTool" n="subcategory"><term>Poster</term></keywords><keywords scheme="ConfTool" n="keywords"><term>web crawling</term><term>text analysis</term><term>data cleaning</term><term>deduplication</term><term>corpus expansion</term></keywords><keywords scheme="ConfTool" n="topics"><term>Europe</term><term>English</term><term>15th-17th Century</term><term>18th Century</term><term>information retrieval and querying algorithms and methods</term><term>text mining and analysis</term><term>Humanities computing</term><term>Philosophy</term></keywords></textClass></profileDesc></teiHeader><text><body><p>This poster presents a technical report and a method for corpus expansion in the humanities, with an application to early modern philosophy, alongside a case study of dealing with heavy data redundancy in several Latin, English, and French title corpora. It enlarges on the steps taken during the initial stages of a data-intensive research project that aims to go beyond established writers and views in natural philosophy between 1600 and 1800 and it reflects on the collaboration between a humanist and a data scientist with respect to web-scraping and redundant multilingual data taming in Python.</p></body></text></TEI>