Over the last ten years topic modeling has increasingly become a digital tool and technique utilized by historians and digital humanists. A topic model “is a type of statistical model for discovering the abstract ‘topics’ that occur in a collection of documents.”1 Plain text documents are run through a topic modeling software, such as MALLET, to generate the “relative importance of topics in the composition of each document.”2 Additionally, the software generates a “topic proportion” which describes the relative weight or proportion of each topic in the body of texts. This project used MALLET, or Machine Learning for Language Toolkit, to generate topics and Tesseract to convert the articles to plain text.
Topic modeling is often described as “Distant Reading”, a concept coined by Franco Moretti.3 The idea being that with an ever expanding corpus of digital texts, computers and statistical modeling can aid the historian by “reading” the documents algorithmically. The computer then returns models composed of groups of words, or “topics,” that have a statistical relation to each other. Robert K. Nelson has stated that the potential of topic modeling “isn’t at the level of the individual document. Topic modeling, instead, allows us to step back from individual documents and look at larger patterns among all the documents, to practice not close but distant reading, to borrow [Moretti’s] memorable phrase.”4 Topic modeling allows historians to look at a group of documents as a whole rather than performing a close reading of each one. This allows historians to potentially analyze thousands of documents without spending years reading each one closely. Furthermore, topic modeling may return relationships between words that suggest a connection that may not have been seen by a human. Topic Modeling is in no way a replacement for traditional historical techniques, but it is one example of how digital tools are helping the historian do research in new ways and learn thing that may not have been possible without technology.
Using this framework I have run a selection of Ullback’s articles through MALLET and generated some visualizations based on the resulting topic models. Clay Templeton has identified five elements of a topic modeling project that describe the process: ((Clay Templeton, “Topic Modeling in the Humanities: An Overview,” Blog, Maryland Institute for Technology in the Humanities, August 1, 2011, http://mith.umd.edu/topic-modeling-in-the-humanities-an-overview/.))
- Unit of Analysis
- Post Processing
Below is a description of each of these five elements as they relate to this project.
- Corpus: Sylvia Ullback’s articles from Photoplay magazine. 5 There are a total of 41 articles, all from Photoplay Magazine, included in the corpus.
- Technique: Basic LDA using MALLET
- Unit of Analysis: Article (one per month over two periods)
- Post Processing: I reorganized the data in excel by organizing by date and topic rather than by highest occurrence.
- Visualization: I graphed the output on several basic charts.
One issue I had with the data was OCR accuracy. While I was able to clean up the OCR significantly, there were still some errors. I made a custom stoplist in order to omit some characters such as: \, &, ~. But nevertheless, some OCR errors such as ‘si’ do appear in the topics. These errors are a result of OCR accuracy and the scanned quality of the articles.
- “Topic Model,” Wikipedia, the Free Encyclopedia, November 23, 2013, http://en.wikipedia.org/w/index.php?title=Topic_model&oldid=582855285. [↩]
- Ibid. [↩]
- Franco Moretti, Graphs, Maps, Trees: Abstract Models for Literary History (London; New York: Verso, 2007). [↩]
- Clay Templeton, “Topic Modeling in the Humanities: An Overview,” Blog, Maryland Institute for Technology in the Humanities, August 1, 2011, http://mith.umd.edu/topic-modeling-in-the-humanities-an-overview/. [↩]
- She wrote for the magazine between two periods: February 1932-July 1935 and October 1936-September1937. From August 1935-September 1936 she wrote for Modern Screen Magazine. These articles as well as the Photoplay articles for July 1933 to January 1934 are not included in the analysis because they are not digitized. [↩]