LibGuides: Introduction to Text Analysis: Analysis Methods and Tools

Types of Text Analysis

Basic Text Summaries and Analyses

Word frequency (lists of words and their frequencies)
(See also: Word counts are amazing, Ted Underwood)
Collocation (words commonly appearing near each other)
Concordance (the contexts of a given word or set of words)
N-grams (common two-, three-, etc.- word phrases)
Entity recognition (identifying names, places, time periods, etc.)
Dictionary tagging (locating a specific set of words in the texts)

High-level Goals for Text Analysis

(From Underwood, T. (2012). Where to start with text mining.)

Document categorization

Information retrieval (e.g., search engines)
Supervised classification (e.g., guessing genres)
Unsupervised clustering (e.g., alternative “genres”)

Corpora comparison (e.g., political speeches)
Language use over time (e.g., Google ngram viewer)
Detecting clusters of document features (i.e., topic modeling)
Entity recognition/extraction (e.g., geoparsing)
Visualization

Advanced Text Analysis

Text Annotation Tools

Natural Language Processing

GATE
nltk
Stanford NLP Group Software
National Centre for Text Mining (includes some tools for medical texts)
Reporters' Lab Reviews: Entity Extraction
Michael Collins' notes on NLP
Natural (natural language facilities for Node.js)

Sentiment Analysis

Most powerful open source sentiment analysis tools
Bing Liu's Resources on Opinion Mining (including a sentiment lexicon)
NaCTeM Sentiment Analysis Test Site (web form)
pattern web mining module (python)
SentiWordNet
Umigon (for tweets, etc.)
List of sentiment analysis tools for Twitter

Programming Resources

Tools with Their Analysis Methods

Web Tools

Voyant Tools – word frequencies, concordance, word clouds, visualizations
TAPorWare – various data cleaning, annotating, and summarizing tools in a web interface
Netlytic – word frequencies, concordance, dictionary tagging, network analysis
Wmatrix – frequency profiles, concordances, compare frequency lists, n-grams and c-grams, collocations
Natural Language Processor & Analyzer - word frequencies, collocations, concordance, tokenizer, etc.
ManyEyes – interactive text visualizations (network diagram, word tree, phrase net, tag cloud, word cloud)
Overview – Automatic topic tagging and visualization
Monk Workbench – Corpus selection from library holdings, frequencies and corpora comparisons, supervised classification
LIWC - Web version will output a few linguistic dimensions; full version can be licensed for ~$100

Downloadable Applications
(no programming required)

AntWord – word frequencies
AntConc – frequency lists, concordances, collocations, keywords, n-grams
TextSTAT – word frequencies, concordances
Concordance – word frequencies, concordances, indexes
Cowo - semantic network
WordHoard - word frequencies, concordances, collocations, scripting (includes tagged literary corpora)
CasualConc - kwic concordance lines, word clusters, collocation analysis, and word count
NVivo (Duke info) - can cluster sources based on text, also produces phrase nets and tag clouds
Tableau (LibGuide) - word clouds

Other Lists of Tools

TAPoR 2

TAPoRware recipes (tutorials)

DiRT - digital research tools

Introduction to Text Analysis: Analysis Methods and Tools

Types of Text Analysis

Basic Text Summaries and Analyses

High-level Goals for Text Analysis

Advanced Text Analysis

Text Annotation Tools

Natural Language Processing

Sentiment Analysis

Programming Resources

Tools with Their Analysis Methods

Web Tools

Downloadable Applications
(no programming required)

Other Lists of Tools

Contact Us

Services for...

Introduction to Text Analysis: Analysis Methods and Tools

Types of Text Analysis

Basic Text Summaries and Analyses

High-level Goals for Text Analysis

Advanced Text Analysis

Text Annotation Tools

Natural Language Processing

Sentiment Analysis

Programming Resources

Tools with Their Analysis Methods

Web Tools

Downloadable Applications(no programming required)

Other Lists of Tools

Contact Us

Services for...

Downloadable Applications
(no programming required)