Creating a Meaningful Genre Schema and Metadata using IMDb data for a Large-Scale Digital Humanities Project in Media Studies

1. Abstract

We are currently engaged in a long-term DH project examining the social networks of TV and film actors and crews across more than 32,500 media items they worked on, starting in 1938. The primary source is the Internet Movie Database (IMDb). IMDb is one of the most robust databases available and provides free downloadable data about the actors and crew that worked on various media items, including TV, movies, and video games. However, it is problematic in a number of ways, as we presented at DH2018. IMDb appears to be under-researched as a source for media studies in DH, with many scholars focusing on fan activity, or on film actors and directors only, with most not questioning IMDb genres. This presentation will explore the contrast between IMDb’s methods and the schema we have had to create. It also discusses some of the challenges we faced, and how we think this could aid other researchers creating public digital humanities databases.

Steve Neale writes in Genre and Contemporary Hollywood, “genres can be approached from the point of view of the industry and its infrastructure, from the point of view of their aesthetic traditions, from the point of view of the broader socio-cultural environment upon which they draw and into which they feed, and from the point of view of audience understanding and response.” We must consider all of these concepts as our work encompasses not only Hollywood film, the subject of most genre analysis, but media of every type including much that is obscure or forgotten. IMDb’s genre methodology was inadequate, so we turned to industry-focused websites to enhance our understanding.

Other schemas use very vague macro descriptors, or idiosyncratic descriptors allowing a media item to be included in multiple “lists.” A movie in the Library of Congress is simply Comedy, Drama, Action, etc. AFI divides films into “best” but also “most thrilling” (including action, horror, and adventure). The Telegraph writes that Netflix’s “genres, based on a complicated algorithm that uses reams of data about users' viewing habits . . . number in the tens of thousands” including commonly-accepted genres like “Action” but also “Family Watch Together TV.” We attempted to taxonomize, expanding and limiting the options

available to encompass the full range of programming, but not get so esoteric that they cannot be clustered together and measured. We have created a vocabulary/schema to allow for clear communication as part of Public Humanities. We cannot, for example, talk about the representation of women in a particular subgenre if we do not have a shared understanding of it. If other scholars also use this schema (or work with us to adapt it), each media item can be described in a way that allows for effective and relatively consistent coding by multiple scholars.

Genre on IMDb is handled in ways that are not useful for analysis because terms are used in inconsistent ways. There’s not enough inter-rater reliability, the tags are misleading, and scholars do not all agree on how to use them. As Deb Verhoeven states in “Mapping the Movies,” a project like her team’s “only works if the existing data collection is both sufficiently comprehensive and thoroughly reliable, since it will have to be accepted by all partners” (Verhoeven). There is little consistent agreement on genre among scholars or fans, as borne out in genre theory. What IMDb designates as “genres” actually combines traditional genres, subgenres, and target audience categories. As IMDb allows those who enter the data to select any number of these terms, and many fans enjoy labeling media items to fit multiple lists, it becomes impossible to analyze using IMDb’s categories. In large part this is because the database relies heavily on users: sometimes cast/crew members, agents, producers, and fans, for its data and for much editing. As Wasserman et al., point out, “Although user editing allows a reference website such as IMDb to be up-to-date, it diffuses the responsibility for fact-checking, leading to greater uncertainty about accuracy and objectivity of information” (Wasserman).

It has taken significant additional research and reorganization to use the data effectively because, as media researchers Marsden, et al. explain, there is not enough agreement about metadata. While most people can tell a Western from Science Fiction, IMDb makes it more difficult to deal with hybrid genres such as Dramedies or Family movies, or where a particular movie or show combines genres, such superimposing Western generic concepts into Science Fiction, or an Action/Adventure movie with a strong romantic plot. The schemas used by the Library of Congress, Netflix, Amazon and others are too reductive or imprecise for our purposes.

Therefore, we not only had to create a taxonomy with a variety of categories, including subjects, styles, modes, and purposes, but our own concise definitions for these categories. We will share parts of our schema in this presentation.

Conaway, Cindy and Diane Shichtman. “Seinfeld at The Nexus of the Universe: Using IMDb Data and Social Network Theory to Create a Digital Humanities Project.” DH2018. Mexico City. MX. June 24-30, 2018.

Lutter, Mark. "Do women suffer from network closure? The moderating effect of social capital on gender inequality in a project-based labor market, 1929 to 2010." American Sociological Review 80.2 (2015): 329-358.

Lutter, Mark. "Creative success and network embeddedness: Explaining critical recognition of film directors in Hollywood, 1900–2010." Max Planck Institute for the Study of Societies. (2014).

Marsden, Alan et al. “Tools for Searching, Annotation and Analysis of Speech, Music, Film and Video--A Survey.” Literary and Linguistic Computing 22.4 (2007): 469–488. Web.

Neale, Stephen. Genre and Contemporary Hollywood. British Film Inst, 2002.

Rossman, Gabriel, Nicole Esparza, and Phillip Bonacich. "I’d Like to Thank The Academy, Team Spillovers, and Network Centrality." American Sociological Review 75.1 (2010): 31-51.

Titcomb, James. "Netflix Codes: The Secret Numbers that Unlock Thousands of Hidden Films and TV Shows." The Telegraph, -12-20 2017, Web. Jan 3, 2020 .

Verhoeven, Deb, Kate Bowles, and Colin Arrowsmith. "Mapping the Movies." Digital Tools in Media Studies. 1. 2015.

Wasserman, Max, et al. "Correlations between User Voting Data, Budget, and Box Office for Films in the Internet Movie Database." Journal of the Association for Information Science & Technology 66.4 (2015): 858-68.

Cindy Conaway (, SUNY-Empire State College, United States of America and Diane Shichtman (, SUNY-Empire State College, United States of America

Theme: Lux by Bootswatch.