Skip to Main Content

DH Boot Camp for Librarians

Resources for the Digital Scholarship Services Librarians' Boot Camp.

Bibliography, Resources, and Glossary

The first iteration of the Boot Camp didn't cover topic modeling directly, but future instruction will include work with MALLET and other tools for mining corpora of texts.  The glossary and resources below comprise a learning guide for the topic modeling workshop.

 

Key Terms

Corpus: A collection of documents.  Selection and cleaning of corpora can be the most time- and labor-intensive aspects of topic modeling, and topic modeling outcomes directly depend on the quality and volume of data in the corpus. A very small corpus is unlikely to yield many useful or specific topics; larger corpora (250,000+ words) usually generate better results.  This is because topic modeling is essentially a machine learning process: the more training data the modeling program has, the more refined its topics become over time.  

Document: A discrete logical unit of text.  For topic modeling, documents can vary in length depending on the nature of the corpus or the kind of data you hope to surface.  Because topics arise from documents, it is wise to think carefully about how to segment your data.  For example, if you have 25,000 emails, do you treat each one as a document?  All emails by a given author as a single document?  The choices you make at this stage will directly affect your outcomes.

In a more functional sense for topic modeling, a document is a probability distribution over topics.

Latent Dirichlet Allocation: The underlying algorithm that MALLET (and other programs) use for topic modeling.  Developed by David Blei, Andrew Ng, and Michael I. Jordan, LDA represents a probabilistic technique for inferring which topics are likely to be present in a given document.  Links to more technical and detailed discussions of LDA are included below.  For the purposes of topic modeling, however, you might simplify the algorithm like this:

  1. Assign words in a document to N topics at random (you specify N).
  2. For any word W in document D, look at each topic Z and figure out
    1. how often W appears in topic Z elsewhere; and
    2. how often W appears in the rest of document D.
  3. For each possible topic Z, multiply the frequency of word W in Z by the number of other words in D that already belong to Z.

The result of (3) is the probability that W came from topic Z.  

Because topic modeling is iterative, the model gradually improves as more documents are processed.  The hyperparameters of LDA affect the quality of topics, too: alpha is related to document-topic density (higher = more topics per document; the default value is 5.0) and beta is related to topic-word density (higher = ask MALLET to use more words when determining a topic; the default value is 0.01).

Text mining: A general term encompassing several different types of automated discovery from a corpus of texts.  In common use, "text mining" often refers to topic modeling.  

Topic: A group of words that have a high likelihood of clustering together.  ("High" is relative, of course; we might also say "any likelihood," depending on the number of topics and the size of the corpus.)  For the purposes of topic modeling, topics are a probability distribution over words. 

Topic modeling: A probabilistic technique for determining (and locating) words that are likely to cluster together as "topics" across a set of documents. It's a method of analyzing a collection of documents to infer what topics (aka discourses, or themes) might have generated them.

Sample Corpora

  • DH Toychest (Alan Liu): Liu has compiled varied textual data for modeling and other forms of computational analysis.  He's done a good bit of cleaning and organization of texts that can be quickly deployed for exploration and experimentation:  http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets.  
  • Early English Books Online (EEBO): The Text Creation Partnership (http://www.textcreationpartnership.org/) has made over 25,000 texts in the EEBO collection freely available via GitHub:  https://github.com/textcreationpartnership.  The texts are generally encoded in XML, which you can remove with MALLET's --skip-html flag (an option for the import-dir command).  A more useful and structured approach to extracting texts might be to use Python (via the xml.etree.cElementTree library or BeautifulSoup).
  • HathiTrust: The HathiTrust Digital Library makes out-of-copyright texts available for download.  There are limits to what you can retrieve via the API, but details about how to download larger volumes of text for analysis are at https://www.hathitrust.org/datasets.
  • The Internet Archive (https://archive.org/details/texts): With over 15.5 million texts (as of 3/2018), the Internet Archive is a rich source of materials across disciplines and historical periods.  Digitization/OCR quality varies from work to work but is often excellent. Programming Historian has an informative blog post about how to download IA materials in bulk with a simple Python program.
  • The Oxford Text Archive (https://ota.ox.ac.uk/).  While these texts may require some processing before they're broken up into reasonable "documents," they represent a large  and varied corpus for computational analysis. (Caveat: many texts here are restricted-access.)
  • Project Gutenberg (http://www.gutenberg.org/) is a standard source of free, plaintext (UTF-8/ASCII) electronic books.  Note that the PG website will block attempts to scrape the site or otherwise download it wholesale; your best bet for getting a large volume of PG etexts is to download an ISO image via BitTorrent.  The most recent image contains about 30,000 texts: http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project

Online Resources & Citations