Crowdsourcing Training Data: Efficacy and Ethics

1. Abstract

Paid crowdsourcing presents a useful tool for building large-scale datasets of many kinds, including for humanities and cultural heritage work, but use of crowd labour is not without logistical and ethical challenges. This paper summarizes how the Visibility of Knowledge Project is using Mechanical Turk to develop training data and relates our best practices to broader challenges in paid crowdsourcing’s ethics and efficacy. The experience of the VOK team suggests that devising itinerant communication tactics is necessary for any digital research projects that wish to use paid training data crowdsourcing in a manner that is both effective and ethical. Unfortunately, the nature of crowdsourcing work and paid platform design are such that the ethics of using crowd-labeled training data will almost certainly remain fraught, even as the need for large training datasets increases in many knowledge fields.