<em>Project Twitter Literature</em>: Scraping, Analyzing, and Archiving Twitter Data in Literary Research

1. Abstract

Project Twitter Literature (TwitLit), seeks to address a growing gap in the literary-historical record[1] by establishing a consistent, rigorous, and ethical method for scraping and cleaning up Twitter data for the use of humanities scholars. In particular, my project explores the growing community of amateur writers who are using Twitter as a means of publication and dissemination for their literary output. There are three parts to my project: the research findings related to the global literary community on Twitter, the tools and resources developed as part of the project and made openly available to other scholars, and partnership with a university library to ensure the long-term preservation of the collected data.

The data that I have collected shows that social media is altering literary practices by providing a space for amateur writers to publish, disseminate, and receive feedback from a global community of writers. Preliminary figures put the number of active Anglophone writers using Twitter as a publication platform for their literary output at over 1 million users per year since 2015, and writers working in non-English languages on Twitter raise these numbers even higher. This practice is changing how literature is produced, published, and shared. Readerships too are changing, for rather than being tied to print subscriptions or access to physical books, audiences of social media literature are based on online communities and tied to the costs of physical devices and internet access.

My presentation will showcase these research findings in order to highlight the importance and necessity of social media archival work. In so doing, I will discuss how I collected the data using a Python script (co-developed by myself and several other scholars), challenges of cleaning up and visualizing this data (using ArcGIS and tools developed by Documenting the Now), and ethical best-practices for using social media data in research. Information relating to this process – including detailed instructions, the Python scripts used to collect Twitter data, and a list of resources – are free and openly accessible on the project’s website (www.twit-lit.com) and GitHub repository (https://github.com/TwitLit/TwitLitSource). Other scholars are invited to use these scripts and other resources to collect their own social media data.

Additionally, my project has attempted to plan for the long-term preservation of the over eight million tweets that I have collected. This preservation has been made difficult by Twitter’s strict Developer Policy and Agreement, which prevents individuals from keeping or disseminating large data sets for more than 30 days. The only exception to this policy is made on behalf of academic institutions, which may store Twitter data for unlimited amounts of time on behalf of academic research.[2] The Project TwitLit project thus presents best-practices for establishing a working relationship with university libraries for storing and disseminating Twitter data in a way that is both in accord with Twitter’s legal restrictions and responsive to the needs of scholars. In short, Project TwitLit provides a case-study of a growing community on Twitter while simultaneously developing a set of tools and guidelines for other scholars seeking to engage in similar work.

[1] In December 2017, the Library of Congress, which began archiving Twitter in 2010, announced that it would no longer collect all Tweets; instead, Tweets produced after December 2017 would only be collected on a selective basis. There are no other ongoing, systematic efforts to collect and preserve this digital material. See Library of Congress, “Update on the Twitter Archive at the Library of Congress” (December 2017).

[2] For more information related to the challenges of collecting and storing Twitter data, please see Christian Howard, “Studying and Preserving the Global Networks of Twitter Literature,” in Post-45.

Project Twitter Literature: Scraping, Analyzing, and Archiving Twitter Data in Literary Research

1. Abstract