Skip to main content

Introduction to Text Analysis: Cleaning/Parsing

Cleaning Text for Analysis

Before you can do a text analysis project, you often need to do a lot of cleaning and parsing to the text.  This is because most text is created and stored so that humans can understand it, and it is not always easy for a computer to process that text.

Computers work well when there is structure to a data source or, at least, some regular patterns that it can identify.  Most cleaning and parsing for text analysis involves increasing the regularity (for example, fixing typos) or adding structure (tagging certain words as important, or even splitting documents up into different sections that have special meaning - title, authors, chapters, etc.).

The major ways of analyzing texts are listed under Analysis Methods, and you may need to know a bit about your analysis methods and the tools you'll be using before you know what type of cleaning you need to do.  For example,  some techniques and tools will be very precise when counting the individual words, and they may count a lower-case and an upper-case version of the same word separately.  Here are some other cleaning and parsing techniques you might need to look into:

  • Removing stop words (deleting very common words like "a", "the", "and", etc.)
  • Stemming or lemmatization (ways of combining words that have the same linguistic root or stem)

Tip: Tools like Wordle may remove stop words, but they will likely count a word and the plural of that word separately, or preserve differences in case as mentioned above.  Try converting everything to lower case and using a quick stemming tool before loading things into word cloud generators.

File Conversion

Correcting/Standardizing Text