Regular Expressions

Conrad Huang

April 14, 2017

Portions Copyright © 2005-06 Python Software Foundation.

Introduction

How to count the blank lines in a file?
- Most people consider a line with just spaces and tabs to be blank
- But examining characters one by one is tedious
- More complex patterns (like telephone numbers or email addresses) are hard to describe in code
Use regular expressions (REs) instead
- Represent patterns as strings
- Just like the * in the shell's *.txt
Warning: the notation is ugly
- Have to use what's on the keyboard, instead of inventing new symbols the way mathematicians do

A Simple Example

The simplest kind of RE matches a fixed string of characters
- Similar to the in operator

import re

dragons = [
['CTAGGTGTACTGATG',    'Antipodean Opaleye'],
['AAGATGCGTCCGTAT',    'Common Welsh Green'],
['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
]

for (dna, name) in dragons:
if re.search('ATGCGT', dna):
    print name

Common Welsh Green
Hungarian Horntail

This or That

Modify the regular expression a little

import re

dragons = [
['CTAGGTGTACTGATG',    'Antipodean Opaleye'],
['AAGATGCGTCCGTAT',    'Common Welsh Green'],
['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
]

for (dna, name) in dragons:
if re.search('ATGCGT|GCT', dna):
    print name

Common Welsh Green
Hebridean Black
Hungarian Horntail
Norwegian Ridgeback

The vertical bar | means “or”
- So this RE matches any string containing either "ATGCGT" or "GCT"

Precedence

What about matching either "ATA" or "ATC" (both of which code for isoleucine)?
- ATA|C will not work: it matches either "ATA" or "C"
- ATA|ATC will work, but it's a bit redundant

Solution: use parentheses, just as in math

import re

tests = [
['ATA',   True],
['xATCx', True],
['ATG',   False],
['AT',    False],
['ATAC',  True]
]

for (dna, expected) in tests:
    actual = re.search('AT(A|C)', dna) is not None
    assert actual == expected

Note that there's no output: the assert statement will crash the program if any of the tests fail

Escaping Special Characters

[Double Compilation of Regular Expressions]

How to match an actual "|", "(", or ")"?
Solution is to use \|, $, or $ in the RE
- And of course \\ to match a backslash
But in order to put a backslash in a Python string, you have to escape it
- So the written form of the RE is "\\|", "\$", "\$", or "\\\\"
What you type in is being compiled twice:
- Once by Python to create a string
- Once by the regular expression library to create the RE

Raw Strings

To help keep things readable, Python supports raw strings
- Written as r'abc' or r"this\nand\nthat"
- Inside a raw string, a backslash is just a backslash
- So r'\n' is a string containing the two characters "\" and "n", not a newline
Raw strings are not automatically converted into REs
- But that is their most common use

Sequences

In the shell, "*" matches zero or more characters
In an RE, * is an operator that means, “match zero or more occurrences of a pattern”
- Comes after the pattern, not before

Sequences (cont.)

Example: match any strand of DNA in which "TTA" and "CTA" are separated by any number of "G"

tests = [
['TTACTA',    True],  # separated by zero G's
['TTAGCTA',   True],  # separated by one G
['TTAGGGCTA', True],  # separated by three G's
['TTAXCTA',   False], # an X in the way
['TTAGCGCTA', False], # an embedded X in the way
]

for (dna, expected) in tests:
    actual = re.search('TTAG*CTA', dna) is not None
    assert actual == expected

Note that the RE matches "TTACTA" because G* can match zero occurrences of "G"

Sequences (cont.)

+ matches one or more (i.e., won't match the empty string)

assert re.search('TTAG*CTA', 'TTACTA')
assert not re.search('TTAG+CTA', 'TTACTA')

Making Something Optional

The ? operator means “optional”
- i.e., zero or one occurrences, but no more

assert re.search('AC?T', 'AT')
assert re.search('AC?T', 'ACT')
assert not re.search('AC?T', 'ACCT')

Character Sets

Use [] to match sets of characters
- The expression [abcd] matches exactly one "a", "b", "c", or "d"
- Can be abbreviated as [a-d]
Often combined with *, +, or ?
- [aeiou]+ matches any non-empty sequence of vowels

Character Sets (cont.)

Example: find lines containing numbers

import re

lines = [
    "Charles Darwin (1809-82)",
    "Darwin's principal works, The Origin of Species (1859)",
    "and The Descent of Man (1871) marked a new epoch in our",
    "understanding of our world and ourselves.  His ideas",
    "were shaped by the Beagle's voyage around the world in",
    "1831-36."
]

for line in lines:
    if re.search('[0-9]+', line):
        print line

Charles Darwin (1809-82)
Darwin's principal works, The Origin of Species (1859)
and The Descent of Man (1871) marked a new epoch in our
1831-36.

Try writing this without using regular expressions…

Abbreviations

Some character sets occur so often that they have abbreviations

Sequence	Equivalent	Explanation
`\d`	`[0-9]`	Digits
`\s`	`[ \t\r\n]`	Whitespace
`\w`	`[a-zA-Z0-9_]`	Word characters (i.e., those allowed in variable names)
Regular Expression Escapes in Python

Special Cases

[^abc] means “anything except the characters in this set”
. means “any character except the end of line”
- Equivalent to [^\n]
\b matchs the break between word and non-word characters
- Doesn't consume any actual characters

Special Cases (cont.)

Example: find words that end in a vowel

Use split method to break on whitespace before applying RE

import re

words = '''Born in New York City in 1918, Richard Feynman earned a
bachelor's degree at MIT in 1939, and a doctorate from Princeton in
1942. After working on the Manhattan Project in Los Alamos during
World War II, he became a professor at CalTech in 1951.  Feynman won
the 1965 Nobel Prize in Physics for his work on quantum
electrodynamics, and served on the commission investigating the
Challenger disaster in 1986.'''.split()

end_in_vowel = set()
for w in words:
    if re.search(r'[aeiou]\b', w):
        end_in_vowel.add(w)
for w in end_in_vowel:
    print w

a
Prize
degree
became
doctorate
the
he

Anchoring

How to find blank lines?
- re.search(r'\s*', line) will match "start end"
Use anchors
- ^ matches the beginning of the string
- $ matches the end
- Neither consumes any characters

Anchoring (cont.)

Examples:

Pattern	Text	Result
`b+`	`"abbc"`	Matches
`^b+`	`"abbc"`	Fails (string doesn't start with `b`)
`c$`	`"abbc"`	Matches (string ends with `c`)
`^a*$`	`aabaa`	Fails (something other than `"a"` between start and end of string)
Regular Expression Anchors in Python

Extracting Matches

Problem: want to find comments in a data file
- A comment starts with a "#", and extends to the end of the line

First try: If the RE matches, split on the "#"

import sys, re

lines = '''Date: 2006-03-07
On duty: HP # 01:30 - 03:00
Observed: Common Welsh Green
On duty: RW #03:00-04:30
Observed: none
On duty: HG # 04:30-06:00
Observed: Hebridean Black
'''.split('\n')

for line in lines:
    if re.search('#', line):
        comment = line.split('#')[1]
        print comment

 01:30 - 03:00
03:00-04:30
 04:30-06:00

Output is inconsistent
split followed by strip seems clumsy

Match Objects

Result of re.search is actually a match object that records what what matched, and where

mo.group() returns the whole string that matched the RE
mo.start() and mo.end() are the indices of the match's location

import re

text = 'abbcb'
for pattern in ['b+', 'bc*', 'b+c+']:
    mo = re.search(pattern, text)
    print '%s / %s => "%s" (%d, %d)' % (pattern, text, mo.group(), mo.start(), mo.end())

b+ / abbcb => "bb" (1, 3)
bc* / abbcb => "b" (1, 2)
b+c+ / abbcb => "bbc" (1, 4)

Match Groups

Every parenthesized subexpression in the RE is a group
- Group 0 is the entire match
- Text that matched N^th parentheses (counting from left) is group N
- mo.group(3) is the text that matched the third subexpression, m.start(3) is where it started

Match Groups (cont.)

Extracting comments is now easy:

import sys, re

lines = '''Date: 2006-03-07
On duty: HP # 01:30 - 03:00
Observed: Common Welsh Green
On duty: RW #03:00-04:30
Observed: none
On duty: HG # 04:30-06:00
Observed: Hebridean Black
'''.split('\n')

for line in lines:
    match = re.search(r'#\s*(.+)', line)
    if match:
        comment = match.group(1)
        print comment

01:30 - 03:00
03:00-04:30
04:30-06:00

Reversing Columns

REs are the power tools of text processing
- Can do things in one line that would otherwise take many lines of code

Example: reverse two-column data

import re

def reverse_columns(line):
    match = re.search(r'^\s*(\d+)\s+(\d+)\s*$', line)
    if not match:
        return line
    return match.group(2) + ' ' + match.group(1)

tests = [
    ['10 20',    'easy case'],
    [' 30  40 ', 'padding'],
    ['60 70 80', 'too many columns'],
    ['90 end',   'non-numeric']
]

for (fixture, title) in tests:
    actual = reverse_columns(fixture)
    print '%s: "%s" => "%s"' % (title, fixture, actual)

easy case: "10 20" => "20 10"
padding: " 30  40 " => "40 30"
too many columns: "60 70 80" => "60 70 80"
non-numeric: "90 end" => "90 end"

Compiling

[Regular Expressions as Finite State Machines]

The RE library compiles patterns into a more concise form for matching
- Each regular expression becomes a finite state machine
- Library follows the arcs in the FSM as it reads characters
- Drawing FSMs is a good way to debug REs

Compiling (cont.)

You can improve a program's performance by compiling the RE once, and re-using the compiled form
- Use re.compile(pattern) to get the compiled RE
- Its methods have the same names and behavior as the functions in the re module
- E.g., matcher.search(text) searches text for matches to the RE that was compiled to create matcher

Finding Title Case Words

Example: find and print all Title Case words in a document

import re

# Put pattern outside 'find_all' so that it's only compiled once.
pattern = re.compile(r'\b([A-Z][a-z]*)\b(.*)')

def find_all(line):
    result = []
    match = pattern.search(line)
    while match:
        result.append(match.group(1))
        match = pattern.search(match.group(2))
    return result

lines = [
    'This has several Title Case words',
    'on Each Line (Some in parentheses).'
]
for line in lines:
    print line
    for word in find_all(line):
        print '\t', word

This has several Title Case words
	This
	Title
	Case
on Each Line (Some in parentheses).
	Each
	Line
	Some

Finding All Matches

Notice how the function gets all matches:
- Pattern captures what we want in group 1, and everything else on the line in group 2
- Each time there's a match, continue the search in the remainder captured in group 2

Using `findall`

Much easier to use the findall method

import re

lines = [
    'This has several Title Case words',
    'on Each Line (Some in parentheses).'
]
pattern = re.compile(r'\b([A-Z][a-z]*)\b')
for line in lines:
    print line
    for word in pattern.findall(line):
        print '\t', word

This has several Title Case words
	This
	Title
	Case
on Each Line (Some in parentheses).
	Each
	Line
	Some

Reference Material

Pattern	Matches	Doesn't Match	Explanation
`a*`	`""`, `"a"`, `"aa"`, …	`"A"`, `"b"`	`*` means “zero or more” matching is case sensitive
`b+`	`"b"`, `"bb"`, …	`""`	`+` means “one or more”
`ab?c`	`"ac"`, `"abc"`	`"a"`, `"abbc"`	`?` means “optional” (zero or one)
`[abc]`	`"a"`, `"b"`, or `"c"`	`"ab"`, `"d"`	`[…]` means “one character from a set”
`[a-c]`	`"a"`, `"b"`, or `"c"`		Character ranges can be abbreviated
`[abc]*`	`""`, `"ac"`, `"baabcab"`, …		Operators can be combined: zero or more choices from `"a"`, `"b"`, or `"c"`
Regular Expression Operators

Reference Material (cont.)

Method	Purpose	Example	Result
`split`	Split a string on a pattern.	`re.split('\\s,\\s', 'a, b ,c , d')`	`['a', 'b', 'c', 'd']`
`findall`	Find all matches for a pattern.	`re.findall('\\b[A-Z][a-z]*', 'Some words in Title Case.')`	`['Some', 'Title', 'Case']`
`sub`	Replace matches with new text.	`re.sub('\\d+', 'NUM', 'If 123 is 456')`	`"If NUM is NUM"`
Regular Expression Object Methods

But Wait, There's More

We've only scratched the surface
- Regular expressions have proved to be too useful to remain clean and elegant
For example, use pat{N} to match exactly N occurrences of a pattern
- More generally, pat{M,N} matches between M and N occurrences
Most important thing is to build up complex REs one step at a time
- Write something that matches part of what you're looking for
- Test it
- Add to it

Summary

Regular expressions are available in almost every language
- As a library: C/C++, Java, …
- Built into the language: Perl, Ruby, …
- Syntax varies slightly, but the ideas are the same
For a broader tutorial, see [Wilson 2005]
- And if you're going to be doing serious work, check out [ Good 2005] or [ Friedl 2002]
Because regular expressions are very powerful, there is a tendency to try to use them for too many things. But remember a famous saying:
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. -- Jamie Zawinski, Netscape engineer

Exercises

By default, regular expression matches are greedy: the first term in the RE matches as much as it can, then the second part, and so on. As a result, if you apply the RE X(.*)X(.*) to the string "XaX and XbX", the first group will contain "aX and Xb", and the second group will be empty.

It's also possible to make REs match reluctantly, i.e., to have the parts match as little as possible, rather than as much. Find out how to do this, and then modify the RE in the previous paragraph so that the first group winds up containing "a", and the second group " and XbX".
What is the easiest way to write a case-insensitive regular expression? (Hint: read the documentation on compilation options.)

Exercises

What does the VERBOSE option do when compiling a regular expression? Use it to rewrite some of the REs in this lecture in a more readable way.
What does the DOTALL option do when compiling a regular expression? Use it to get rid of the call to string.split in the example that finds words ending in vowels.

Regular Expressions

Conrad Huang

April 14, 2017

Portions Copyright © 2005-06 Python Software Foundation.

Introduction

A Simple Example

This or That

Precedence

Escaping Special Characters

Raw Strings

Sequences

Sequences (cont.)

Sequences (cont.)

Making Something Optional

Character Sets

Character Sets (cont.)

Abbreviations

Special Cases

Special Cases (cont.)

Anchoring

Anchoring (cont.)

Extracting Matches

Match Objects

Match Groups

Match Groups (cont.)

Reversing Columns

Compiling

Compiling (cont.)

Finding Title Case Words

Finding All Matches

Using findall

Reference Material

Reference Material (cont.)

But Wait, There's More

Summary

Exercises

Exercises

Using `findall`