Large digital libraries like the HathiTrust Digital Library (HTDL) provide texts of historical, cultural, or literary significance at unprecedented scales. However, the size and the consortial approach to building them can confuse computational attempts to model the collection, due to issues such as uneven duplication and incomplete metadata. This paper presents the technical workflow of a project seeking to address those challenges.