From Silo to Repo Enforcing File Structure to Improve Workflow and Access

1. Abstract

From Silo to Repo:
Enforcing File Structure to Improve Workflow and Access

Dussault, Jessica; Weakly, Laura K.
University of Nebraska-Lincoln

The University of Nebraska-Lincoln’s Center for Digital Research in the Humanities (CDRH) has created more than 80 projects since 1998.1 Because of the collaborative and experimental nature of Digital Humanities projects, a variety of technologies and structures were utilized, which has proved challenging to support as we continue to develop projects.2 Recently the CDRH has implemented policies that separate data from websites and standardize file locations. These policies have been reinforced by building software with expectations about data location, which makes creating, maintaining, and upgrading sites more manageable. This paper examines the structural changes of organizing files, the impact on workflow, and the increased accessibility of data for researchers.

File Organization

The CDRH’s largest and most well-known projects are document-driven digital archives, producing tens of thousands of encoded TEI-XML3 files. Historically, these files were part of a website's code base, but projects shared no common structure. The first step towards managing data for dozens of projects was to move the documents out of website code into "data repositories."4 Datura5, CDRH-developed software, provides customizable scripts for transforming a data repository's documents into other formats, such as HTML or JSON for search indexes. The location of documents is enforced by Datura's expectations of file types.

Figure 1: The basic structure of a data repository. The source documents, stored in subdirectories by type, are transformed by scripts which apply overrides and configuration settings into files in the output directory. Most commonly, TEI files are transformed into HTML for display or JSON for populating search indexes.

Images are stored separately from the website served with an International Image Interoperability Framework-compatible image server.6 These changes to data and image locations allow us to focus on standardizing other aspects of our websites, such as publishing documents with a center-wide API.

Workflow

This Datura-driven standardization has led to improvements in project workflow. Students and other researchers can be commonly trained on file naming and data structures. They are trained to use a version control system, currently git hosted on GitHub, allowing easier access to file sharing and virtually eliminating data being lost or overwritten. Previously students working on multiple projects needed training on each project as to where metadata files and images were stored, scan specifications, file naming, and file sharing. Now students can work across projects with little to no learning curve for each new project.

Increased Accessibility

The CDRH has a commitment to making data available for scholars. We have long exposed XML source files powering individual pages7, but the new data repository system makes downloading and working with project files even easier. The CDRH publishes data repositories on the code-sharing platform GitHub, tagged with "tei-xml" to aid discoverability.8 These public repositories can be downloaded or referenced remotely. An artist has already used repository documents to power a small site.9 The public repositories also provide student contributors with a resource to showcase their project work.10

Conclusion

Though this system has drawbacks that we will discuss during our presentation, our data-first approach has dramatically decreased development time spent on creating websites and on student training. Gathering documents together has improved the internal workflow of collaborators, laid the groundwork for future initiatives to renovate aging websites, and aided in the dissemination of data and scholarship to researchers around the globe.

Footnotes

1 https://cdrh.unl.edu/about/creatingcdrh#establishment

2 Discussion around designing Digital Humanities projects to enable future support is increasingly important in the discipline. See http://www.digitalhumanities.org/dhq/vol/13/1/000411/000411.html

3 https://tei-c.org/

4 Typically, each project has a single data repository, but in the case of larger and more complex projects like The Walt Whitman Archive, a project may have multiple data repositories to group documents with different concerns like marginalia or scribal documents.

5 https://github.com/CDRH/datura

6 We use the Cantaloupe server (https://cantaloupe-project.github.io/) with International Image Interoperability Framework Image API (https://iiif.io/api/image/2.1/)

7 The Willa Cather Archive is an early example of this practice and continues to provide links to source files on individual document pages: https://cather.unl.edu/

8 https://github.com/search?q=topic%3Atei-xml+org%3ACDRH&type=Repositories

9 https://love-and-seduction.glitch.me/ by John Skiles Skinner

10 For more on the complexities of properly crediting student work, see http://www.digitalhumanities.org/dhq/vol/11/3/000322/000322.html

Works Cited

Cantaloupe Project (2019). "Cantaloupe." University of Illinois. https://cantaloupe-project.github.io/

Center for Digital Research in the Humanities (2019). "CDRH Organization." GitHub. https://github.com/cdrh/

Center for Digital Research in the Humanities (2019). "Creating the CDRH." University of Nebraska-Lincoln. https://cdrh.unl.edu/about/creatingcdrh#establishment

Center for Digital Research in the Humanities (2019). "Datura." GitHub. https://github.com/CDRH/datura

Dalziel, Karin; Dussault, Jessica; Tunink, Greg (2018). "Legacy No Longer: Designing Sustainable Systems for Website Development." DH2018. https://dh2018.adho.org/en/legacy-no-longer-designing-sustainable-systems-for-website-development/

GitHub (2019). "About." GitHub, Inc. https://github.com/about

Git Project, The (2019). "git." The Git Project. https://git-scm.com/

International Image Interoperability Framework (2019). "International Image Interoperability Framework." IIIF Consortium. https://iiif.io/
Mauro, Aaron, et al. (2017) “Towards a Seamful Design of Networked Knowledge: Practical Pedagogies in Collaborative Teams.” Digital Humanities Quarterly 11:3. http://www.digitalhumanities.org/dhq/vol/11/3/000322/000322.html

Skinner, John Skiles (2019). "love-and-seduction." Glitch. https://love-and-seduction.glitch.me/

Smithies, James, et al. (2019). "Managing 100 Digital Humanities Projects: Digital Scholarship & Archiving in King's Digital Lab." Digital Humanities Quarterly 13:1. http://www.digitalhumanities.org/dhq/vol/13/1/000411/000411.html

Text Encoding Initiative (2019). "TEI: Text Encoding Initiative", Text Encoding Initiative. https://tei-c.org/

Walt Whitman Archive, The (2019). "The Walt Whitman Archive." Center for Digital Research in the Humanities. https://whitmanarchive.org/

Willa Cather Archive, The (2019). "The Willa Cather Archive." Center for Digital Research in the Humanities. https://cather.unl.edu/

Jessica Dussault (jdussault@unl.edu), University of Nebraska-Lincoln, United States of America and Laura K Weakly (lweakly2@unl.edu), University of Nebraska-Lincoln, United States of America

Theme: Lux by Bootswatch.