The first iteration of the Boot Camp didn't cover topic modeling directly, but future instruction will include work with MALLET and other tools for mining corpora of texts. The glossary and resources below comprise a learning guide for the topic modeling workshop.
Corpus: A collection of documents. Selection and cleaning of corpora can be the most time- and labor-intensive aspects of topic modeling, and topic modeling outcomes directly depend on the quality and volume of data in the corpus. A very small corpus is unlikely to yield many useful or specific topics; larger corpora (250,000+ words) usually generate better results. This is because topic modeling is essentially a machine learning process: the more training data the modeling program has, the more refined its topics become over time.
Document: A discrete logical unit of text. For topic modeling, documents can vary in length depending on the nature of the corpus or the kind of data you hope to surface. Because topics arise from documents, it is wise to think carefully about how to segment your data. For example, if you have 25,000 emails, do you treat each one as a document? All emails by a given author as a single document? The choices you make at this stage will directly affect your outcomes.
In a more functional sense for topic modeling, a document is a probability distribution over topics.
Latent Dirichlet Allocation: The underlying algorithm that MALLET (and other programs) use for topic modeling. Developed by David Blei, Andrew Ng, and Michael I. Jordan, LDA represents a probabilistic technique for inferring which topics are likely to be present in a given document. Links to more technical and detailed discussions of LDA are included below. For the purposes of topic modeling, however, you might simplify the algorithm like this:
The result of (3) is the probability that W came from topic Z.
Because topic modeling is iterative, the model gradually improves as more documents are processed. The hyperparameters of LDA affect the quality of topics, too: alpha is related to document-topic density (higher = more topics per document; the default value is 5.0) and beta is related to topic-word density (higher = ask MALLET to use more words when determining a topic; the default value is 0.01).
Text mining: A general term encompassing several different types of automated discovery from a corpus of texts. In common use, "text mining" often refers to topic modeling.
Topic: A group of words that have a high likelihood of clustering together. ("High" is relative, of course; we might also say "any likelihood," depending on the number of topics and the size of the corpus.) For the purposes of topic modeling, topics are a probability distribution over words.
Topic modeling: A probabilistic technique for determining (and locating) words that are likely to cluster together as "topics" across a set of documents. It's a method of analyzing a collection of documents to infer what topics (aka discourses, or themes) might have generated them.