HANDLE Get a Grip on MALLET

1. Abstract

Large collections of digital text from various sources are now available to researchers and the public, and topic modeling is one of the most common approaches to mining that text. However, existing topic modeling tools are often inaccessible to less technically-adept users. The most popular tools require either using the command line or writing code1. While a GUI2 exists, researchers have expressed frustrations with its limited feature set. HANDLE (Heuristic Analytical Digital Language Environment) is a better GUI for MALLET3 that offers easier text importing, topic model building, visualization, and data exporting, and was developed at UCLA’s Scholarly Innovation Lab. Increasing the accessibility of text mining facilitates research at all levels, from undergraduate to faculty, by making it easier to perform research and communicate it to a wider audience.

One-step Importing

HANDLE removes one of the stumbling blocks for first-time users by simplifying importing text. It features a drag-and-drop interface for importing files so that students do not need to learn the command line tools for MALLET. HANDLE can import plain text and Word documents. Second, a window shows the text as MALLET will understand it: punctuation is removed, and stopwords are replaced with struck-through text. Third, the defaults preserve accent marks and non-Latin characters; more advanced users can adjust these settings. Fourth, stopword settings can be adjusted: a user can drag-and-drop stopword list files into the program, and new stopwords can be added through a contextual menu.

Convenient Experimentation

Most users have to try a few different topic models with different settings before they find one that works, but MALLET only accepts input from the command line, leaving users with the tedious task of retyping commands to change one setting. HANDLE addresses this issue by providing a graphical interface for creating topic models, reducing the number of opportunities for errors. HANDLE also shows a list of all the topic models a user has created and remembers the settings used for each, making comparisons and documentation easier. HANDLE also allows a user to save a project (a collection of documents and topic models) and come back to it later.

Easy Visualization

HANDLE has integrated visualizations. For each topic model, an LDAViz4-inspired interface can be used to explore topic similarity and the top words in each topic. Another interface displays the topics that make up a specific document, and their proportions. These images can be exported for articles or papers.

Export Data into Familiar Tools

Because no tool can anticipate all use cases, HANDLE can export its data in formats that are easy to manipulate. Summaries of all topics within all documents, called “document topic reports” in MALLET, can be exported as spreadsheets. Topic model files can be exported for use with command-line MALLET, or for other computers running HANDLE. HANDLE allows users to focus on the data rather than getting the tool to run, and to work with their data in environments more familiar to them.

HANDLE will be released as an open-source project on GitHub by summer 2020.

Works Cited

Grün, Bettina, and Kurt Hornik, ‘Topicmodels?: An R Package for Fitting Topic Models’, Journal of Statistical Software, 40.13 (2011)

Jonathan Scott Enderle, Arun Balagopalan, Xiaojing Li, and David Newman, Senderle/Topic-Modeling-Tool: First Stable Release (Zenodo, 2017)

McCallum, Andrew Katchites, MALLET: A Machine Learning for Language Toolkit. (Amherst, MA: UMass Amherst, 2002)

?eh??ek, Radim, and Petr Sojka, ‘Software Framework for Topic Modelling with Large Corpora’, in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (Valletta, Malta: ELRA, 2010), pp. 45–50

Sievert, Carson, and Kenny Shirley, LDAvis: Interactive Visualization of Topic Models, version 0.3.2, 2015 [accessed 9 October 2019]

1For example, see Bettina Grün and Kurt Hornik, ‘Topicmodels?: An R Package for Fitting Topic Models’, Journal of Statistical Software, 40.13 (2011) . and Radim ?eh??ek and Petr Sojka, ‘Software Framework for Topic Modelling with Large Corpora’, in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (Valletta, Malta: ELRA, 2010), pp. 45–50.

2Jonathan Scott Enderle and others, Senderle/Topic-Modeling-Tool: First Stable Release (Zenodo, 2017) .

3Andrew Katchites McCallum, MALLET: A Machine Learning for Language Toolkit. (Amherst, MA: UMass Amherst, 2002) .

4Carson Sievert and Kenny Shirley, LDAvis: Interactive Visualization of Topic Models, version 0.3.2, 2015 [accessed 9 October 2019].

David Lawrence Shepard (shepard.david@gmail.com), UCLA, United States of America

Theme: Lux by Bootswatch.