LibGuides: DH Boot Camp for Librarians: Data and Regular Expressions

Key Terms

ASCII text: American Standard Code for Information Interchange text format. An aging plaintext format. It represents 128 characters plus 128 "extended" characters (diacritics, certain mathematical symbols). It is an entrenched standard, but it encodes the Latin alphabet and not much else. New files should typically be encoded in a Unicode standard.

character: An individual letter, glyph, or punctuation mark. For example, a, b, c, 1, 8, @, ?, +, etc.

character set (regex): A special element of regular expressions that matches any of a set of given characters. You can use sets such as [A-Z] (all upper-case letters), [a-z] (all lower-case letters), or [ABC] (matches either A, B, or C). See below for a reference sheet listing common character sets and metacharacters.

conceptual layer of data: In our model of computational data, the conceptual layer is the "top": it's what data looks like in an application that displays the information in a way that makes sense to users.

logical layer of data: The logical layer of data renders physical inscriptions as bits and bytes. It's one way of looking at the information that's physically inscribed on a support, and it's the way data is processed at a low level by software.

metacharacter: A character that has a special meaning in the context of a regular expression. For example, \b is a metacharacter indicating a word boundary. See below for a fuller list of metacharacters.

physical layer of data: Magnetic inscriptions on a disk, optical media "pits and lands," punched cards, etc: the basic physical support on whcih data are recorded.

plaintext: Text without any special formatting or other non-textual features (bold, italics, links, images, etc.). Traditionally, "plaintext" refers to either ASCII or Unicode (UTF-8, UTF-16) text. It's preferable for most programming and data applications because:

It's computationally tractable: software and programming languages process it without special libraries or add-ons
It's widely legible: nobody needs special word processing software or other tools to read it
It's sustainable: UTF-8 is almost certainly not going away, but it can be transformed into other formats easily if needed

regular expression: Essentially, regular expressions are a way of using a sequence of characters to match strings. They're sometimes abbreviated "regex" or "regexp." A common example is the wildcard, e.g. "librar*" to match "library" or "libraries" or "librarian" or "librarians" or "librarianship." Regular expressions are more powerful than simple wildcards, though; they can match types of characters, such as upper case, lower case, digits, whitespace, word boundaries (typically spaces), tabs, newlines, and others. They also allow you to specify the number of times something must appear to constitute a match: three upper case letters, twelve digits, at least one space, etc.

string: Any sequence of characters, possibly including whitespace.

structured data: The key qualities of structured data are consistency and predictability. If plain text data adhere to a known structure, it's easier to work with them in automated, generalized ways.

Unicode: Text encoded with the Unicode standard. UTF-8 (Unicode Transmission Format - 8 bit) is most widely used. Unicode can encode over 1 million code points (mostly characters) and can represent non-Latin script systems of writing. It's a superset of ASCII, so it's backwards-compatible in important ways. (You will typically want to use UTF-8 for creating new plaintext transcriptions/OCR.)

unstructured data: Data that lack a predictable or repeating structure, e.g. free-form text files.

whitespace: Empty space in a text, either horizontal or vertical. Unicode allows several types of horizontal space; tabs and carriage returns (enter, linebreaks) can also create whitespace.

Online Resources & Citations

Regex 101
Regex 101 is a useful in-browser tool for developing and testing regular expressions.
Common Regular Expression Metacharacters
Adapted from the Library Carpentry "Intro to Data" course, this reference sheet supplies an overview of common regex metacharacters.
DH Boot Camp Slides: Data & Regular Expressions
The slide deck from our boot camp on data and regular expressions.
Regexper
A site that visualizes the logic of your regular expression. Its charts are useful for confirming that your code matches what you think it's going to match.

DH Boot Camp for Librarians

Key Terms

Online Resources & Citations

Contact Us

Services for...