Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Introduction to Text Analysis: Analysis Methods and Tools

Types of Text Analysis

Basic Text Summaries and Analyses

  • Word frequency (lists of words and their frequencies)
    (See also: Word counts are amazing, Ted Underwood)
  • Collocation (words commonly appearing near each other)
  • Concordance (the contexts of a given word or set of words)
  • N-grams (common two-, three-, etc.- word phrases)
  • Entity recognition (identifying names, places, time periods, etc.)
  • Dictionary tagging (locating a specific set of words in the texts)

High-level Goals for Text Analysis

(From Underwood, T. (2012). Where to start with text mining.)

  • Document categorization
    • Information retrieval (e.g., search engines)
    • Supervised classification (e.g., guessing genres)
    • Unsupervised clustering (e.g., alternative “genres”)
  • Corpora comparison (e.g., political speeches)
  • Language use over time (e.g., Google ngram viewer)
  • Detecting clusters of document features (i.e., topic modeling)
  • Entity recognition/extraction (e.g., geoparsing)
  • Visualization

Tools with Their Analysis Methods

Web Tools

  • Voyant Tools – word frequencies, concordance, word clouds, visualizations
  • TAPorWare – various data cleaning, annotating, and summarizing tools in a web interface
  • Netlytic – word frequencies, concordance, dictionary tagging, network analysis
  • Wmatrix – frequency profiles, concordances, compare frequency lists, n-grams and c-grams, collocations
  • Natural Language Processor & Analyzer - word frequencies, collocations, concordance, tokenizer, etc.
  • ManyEyes – interactive text visualizations (network diagram, word tree, phrase net, tag cloud, word cloud)
  • Overview – Automatic topic tagging and visualization
  • Monk Workbench – Corpus selection from library holdings, frequencies and corpora comparisons, supervised classification
  • LIWC - Web version will output a few linguistic dimensions; full version can be licensed for ~$100

Downloadable Applications
(no programming required)

Other Lists of Tools