Extending the Utility of the HTRC Extracted Features Dataset Through Linked Data

1. Abstract

A poster describing the latest version of the Extracted Features dataset derived from the HathiTrust Digital Library's 17+ million volume corpus. This version employs Linked Data standards to both, make the dataset more accessible and to incorporate richer metadata describing the volumes from which the data was derived. The dataset is arranged by the volumes and the data (tokens, part of speech tags, language tags, line counts, etc.) is directly associated with the metadata describing the volume in the form individual JSON-LD documents. The EF dataset provides a ready means of interacting with volumes whose intellectual content remains under copyright and allows a variety of analytics, such as visualizing word usage over time, to be carried out on data that would not otherwise be accessible.

Jacob Jett (jjett2@illinois.edu), University of Illinois at Urbana-Champaign, United States of America, Boris Capitanu (jdownie@illinois.edu), University of Illinois at Urbana-Champaign, United States of America, Deren Kudeki , University of Illinois at Urbana-Champaign, United States of America, Timothy W. Cole , University of Illinois at Urbana-Champaign, United States of America and J. Stephen Downie , University of Illinois at Urbana-Champaign, United States of America

Theme: Lux by Bootswatch.