Text Analysis with Voyant Tools

A Workshop at the University of California, Riverside

Glossary

  • Black Box: A device, system or object which can be viewed in terms of its inputs and outputs without any knowledge of its internal workings. In Digital Humanities, projects and tools are often critiqued for being “black boxes” which prevent scholars from studying and analyzing the methodological choices made when composing the project.
  • Collocation: a sequence of words or terms that co-occur more often than would be expected by chance.
  • Corpus: a collection of documents.
  • Optical Character Recognition (OCR): Software that turns images of text into machine readable plain text documents.
  • Reproducible Research: This term originates from fields such as computer science or engineering that rely on computers to generate scholarly work. It refers to the idea that the ultimate product of academic research is the paper along with the full computational environment used to produce the results in the paper (i.e code, data, etc) so that anyone can use it to reproduce the results and create new work based on the research.
  • Stop list or Stop Words: A list of words that is automatically omitted from an index of the most frequent words in a corpus.
  • Text Mining: Is often also referred to as data mining. The overarching goal is usually to turn text into dat for analysis and it frequently relies heavily on natural language processing. For an excellent description of the history of the field and various applications see the Wikipedia entry for text mining.
  • TF-IDF: Short for term frequency-inverse document frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a corpus. This statistic is frequently used to weight terms by relevance or importance within a corpus.

These definitions have been pulled from wikipedia and contextualization has been added as necessary.