ASCII text: American Standard Code for Information Interchange text format. An aging plaintext format. It represents 128 characters plus 128 "extended" characters (diacritics, certain mathematical symbols). It is an entrenched standard, but it encodes the Latin alphabet and not much else. New files should typically be encoded in a Unicode standard.
character: An individual letter, glyph, or punctuation mark. For example, a, b, c, 1, 8, @, ?, +, etc.
character set (regex): A special element of regular expressions that matches any of a set of given characters. You can use sets such as [A-Z] (all upper-case letters), [a-z] (all lower-case letters), or [ABC] (matches either A, B, or C). See below for a reference sheet listing common character sets and metacharacters.
conceptual layer of data: In our model of computational data, the conceptual layer is the "top": it's what data looks like in an application that displays the information in a way that makes sense to users.
logical layer of data: The logical layer of data renders physical inscriptions as bits and bytes. It's one way of looking at the information that's physically inscribed on a support, and it's the way data is processed at a low level by software.
metacharacter: A character that has a special meaning in the context of a regular expression. For example, \b is a metacharacter indicating a word boundary. See below for a fuller list of metacharacters.
physical layer of data: Magnetic inscriptions on a disk, optical media "pits and lands," punched cards, etc: the basic physical support on whcih data are recorded.
plaintext: Text without any special formatting or other non-textual features (bold, italics, links, images, etc.). Traditionally, "plaintext" refers to either ASCII or Unicode (UTF-8, UTF-16) text. It's preferable for most programming and data applications because:
regular expression: Essentially, regular expressions are a way of using a sequence of characters to match strings. They're sometimes abbreviated "regex" or "regexp." A common example is the wildcard, e.g. "librar*" to match "library" or "libraries" or "librarian" or "librarians" or "librarianship." Regular expressions are more powerful than simple wildcards, though; they can match types of characters, such as upper case, lower case, digits, whitespace, word boundaries (typically spaces), tabs, newlines, and others. They also allow you to specify the number of times something must appear to constitute a match: three upper case letters, twelve digits, at least one space, etc.
string: Any sequence of characters, possibly including whitespace.
structured data: The key qualities of structured data are consistency and predictability. If plain text data adhere to a known structure, it's easier to work with them in automated, generalized ways.
Unicode: Text encoded with the Unicode standard. UTF-8 (Unicode Transmission Format - 8 bit) is most widely used. Unicode can encode over 1 million code points (mostly characters) and can represent non-Latin script systems of writing. It's a superset of ASCII, so it's backwards-compatible in important ways. (You will typically want to use UTF-8 for creating new plaintext transcriptions/OCR.)
unstructured data: Data that lack a predictable or repeating structure, e.g. free-form text files.
whitespace: Empty space in a text, either horizontal or vertical. Unicode allows several types of horizontal space; tabs and carriage returns (enter, linebreaks) can also create whitespace.