Basic Text Summaries and Analyses
- Word frequency (lists of words and their frequencies)
(See also: Word counts are amazing, Ted Underwood)
- Collocation (words commonly appearing near each other)
- Concordance (the contexts of a given word or set of words)
- N-grams (common two-, three-, etc.- word phrases)
- Entity recognition (identifying names, places, time periods, etc.)
- Dictionary tagging (locating a specific set of words in the texts)
High-level Goals for Text Analysis
(From Underwood, T. (2012). Where to start with text mining.)
- Document categorization
- Information retrieval (e.g., search engines)
- Supervised classification (e.g., guessing genres)
- Unsupervised clustering (e.g., alternative “genres”)
- Corpora comparison (e.g., political speeches)
- Language use over time (e.g., Google ngram viewer)
- Detecting clusters of document features (i.e., topic modeling)
- Entity recognition/extraction (e.g., geoparsing)
- Visualization