Legal Literacies for Text Data Mining

1. Abstract

Our presentation, “Legal Literacies for Text Data Mining,” will introduce digital humanities researchers and professionals to core skills needed to navigate law, policy, ethics, and risk in digital humanities (DH) text and data mining (TDM) projects.

Computational text analysis and text data mining is a mainstay of DH research, but scholars often shy away from building and openly sharing diverse and representative corpora due to uncertainty around copyright and licensing restrictions for the materials they use. While misunderstanding and confusion abounds, there are few guidelines or training programs to help researchers pilot these waters. In “Legal Literacies for Text Data Mining,” we will highlight five essential legal literacies (copyright, licensing, privacy, ethics and policy, and special use cases (like international collaborations)) that TDM researchers can develop to confidently compile and publish text corpora. Our presentation will build on the research we have done in creating an NEH-funded, four-day Institute (https://buildinglltdm.org/) hosted at UC Berkeley in June of 2020. The focus of our work is United States law, with an eye toward cross-boundary research.

Helping researchers and professionals build these literacies is essential for advancing knowledge in the humanities. In a recent study of the text analysis needs of humanities scholars (Green et al., 2016), participants noted that access to in-copyright texts was a “frequent obstacle” in their ability to select appropriate texts for data mining. The perception of legal obstacles does not just deter research; it biases research toward particular topics and sources of data. In response to content provider resistance, confusing license terms, and other perceived legal roadblocks, some researchers have gravitated to low-friction research questions and corpora to avoid decisions about rights-protected data. Yet their concern about working with copyrighted materials may be unfounded, as courts have found TDM methodologies that make use of copyright-protected texts to be fair uses.

When researchers artificially limit their research scope to texts that do not have access restrictions, research may be skewed to leave important questions unanswered, and the resulting TDM findings are rendered less broadly applicable. A growing body of research also demonstrates how race, gender, and other biases found in openly available corpora have contributed to and exacerbated bias in the development of artificial intelligence tools (Barocas & Selbst, 2016, Larson et al., 2016; Levendowski, 2018). At the same time, DH scholars and professionals who exercise their fair use rights in conducting TDM research may inadvertently misinterpret the scope of those rights and other legal considerations that affect how they access, store, and disseminate rights-protected works. While fair use gives clear protection to the core activity of TDM analysis, copyright and other legal regimes may nevertheless limit the inputs and outputs of that activity in important ways. Undue caution leads to missed research opportunity, but undue confidence can lead to needless risk for the researcher and her institution.

We have examined the legal contours of text data mining and identified five legal literacies for researchers and professional staff conducting or supporting TDM research. With an understanding of these literacies, researchers performing text data mining are better positioned to make legal and ethical decisions in building their corpora without fear. Currently, though, few trainings or resources integrate these legal literacies into DH TDM outreach and instruction. Moreover, our own experiences suggest that digital humanities scholars and professionals field many of the questions that arise around legal issues and TDM at the time of crisis (e.g., when university access to a database is suspended due to unlawful downloading). This places undue stress on one’s ability to conduct DH TDM research and may lead institutions to unduly restrict such research via policy.

To address this gap, we created a four-day, NEH-funded institute, Building Legal Literacies for Text Data Mining (Building LLTDM), hosted by UC Berkeley during June 23-26, 2020. The goals of the Institute are to: (1) understand how law, policy, and risk management interact with digital humanities TDM projects; (2) integrate workflows for TDM research and professional support so participants can confidently pursue valuable research; (3) practice sharing their new tools and knowledge with others through exercises based on authentic consultations; (4) prototype plans for more broadly disseminating their knowledge; and (5) develop communities of practice and coordinate, where practicable, cross-institutional outreach about the TDM legal landscape. To maximize impact, all instructional materials (including sample lesson plans and exercises) will be shared publicly as a CC0 (Creative Commons Zero waiver) open educational resource (OER). This hands-on curriculum supports 32 participants (16 DH researchers and 16 professional DH support staff including librarians) and will be taught by a combination of experienced legal scholars, librarians, and researchers—all of whom are immersed in these subject literacies and workflows.

Our DH2020 presentation will elucidate these five legal literacies with reference to common DH use cases. We will also reflect lightly on the institute model and its efficacy for helping DH scholars and professionals confidently chart a course forward. Through the legal literacies we have developed, text data mining researchers are better positioned to make legal and ethical decisions in building their corpora with confidence.

The project team is composed of Rachael G. Samberg (PI, UC Berkeley), Timothy Vollmer (PM, UC Berkeley), Scott Althaus (University of Illinois), David Bamman (UC Berkeley), Brandon Butler (University of Virginia), Beth Cate (Indiana University Bloomington), Kyle K. Courtney (Harvard University), Sean Flynn (American University Washington College of Law), Maria Gould (California Digital Library), Cody Hennesy (University of Minnesota), Eleanor Dickson Koehl (University of Michigan), Thomas Padilla (University of Nevada Las Vegas), Stacy Reardon (UC Berkeley), Matthew Sag (Loyola University Chicago School of Law), Brianna L. Schofield (Authors Alliance), Megan Senseney (University of Arizona), and Glen Worthey (HathiTrust Research Center).

For more information about the Institute, see https://buildinglltdm.org.

Works Cited

Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact. California Law Review, 104: 671-730. Retrieved from http://www.californialawreview.org/wp- content/uploads/2016/06/2Barocas-Selbst.pdf

Green, H. E., Dickson, E. F., Nay, L. R., & Zegler-Poleska, E. (2016). Scholarly needs for text analysis resources: A user assessment study for the HathiTrust Research Center. Proceedings of the Charleston Library Conference. Charleston, SC. doi: http://dx.doi.org/10.5703/1288284316464

Larson, J., Angwin, J., & Parris, T. (2016, October 19). Breaking the black box: How machines learn to be racist. ProPublica. Retrieved from https://www.propublica.org/article/breaking-the-black-box-how-machines-learn- to-be-racist

Levendowski, A. (2018). How copyright can fix artificial intelligence’s implicit bias problem. Washington Law Review, 93(2): 579-630. Retrieved from http://digital.law.washington.edu/dspace- law/bitstream/handle/1773.1/1804/93WLR0579.pdf