Table of Contents

REGEX

Module re

This Python module provide support for regular expressions (RE) matching operations similar to those found in Perl.

Regular expressions are search patterns used to locate a matching part of a string. The pattern can be any ordinary or special characters, Unicode, ASCII or even null bytes.

For example, any alphabetical sequence of characters can be used as pattern. However, this simplistic regular expression is not very useful.

REGEX FUNCTIONS

The following are the most important regex functions in the re module: - match() Match a regular expression pattern to the beginning of a string. - search() Search a string anywhere for the presence of a pattern. - findall() Find all occurrences of a pattern in a string and returns a list. - finditer() Find all occurrences of a pattern and return an iterator yielding an object for each match.
- sub() Find and replace with a substitute string all occurrences of a pattern found in a string. - split() Split a string by the occurrences of a pattern. - compile() Compile a pattern into a Pattern object against which above functions can be called.

Compile Function

It is possible to create a pattern object against which to call regex functions.

  import re
  p = re.compile('\d+')
  string = "213 Lincon street, Floor 7, Flat 8"
  p.match(string)  ==> <re.Match object; span=(0, 3), match='213'>
  p.search(string) ==> <re.Match object; span=(0, 3), match='213'>
  p.findall(string) ==> ['213', '7', '8']
  [x.group() for x in p.finditer(string)] ==> ['213', '7', '8']

The regex pattern may cross multiline if you use triple quote and parentheses. You must end the extended pattern with flag re.Verbose or re.X. Such pattern may include comments preceded by hash =

  charref = re.compile(r"""
   &[=]                = Start of a numeric entity reference
   (
       0[0-7]+         = Octal form
     | [0-9]+          = Decimal form
     | x[0-9a-fA-F]+   = Hexadecimal form
   )
   ;                   = Trailing semicolon
   """, re.VERBOSE)

Match Function

Syntax : match(pattern, string, flags=0) It tries to apply the pattern at the start of the string, returning a Match object, or None if no match was found.

The re module has a number of functions. The simplest of which is the match() function. It matches the regular expression pattern to the first match in a string and returns a match object. This object has multiple attributes and functions which can be accessed by calling them against that object, for example:

  • span() : returns 2-tuple (match.start, match.end)
  • group() : returns the match if it is the only one or a tuple of matching subgroups (in the case of the other function "search"). These matches are indexed by names or indices. The function group() takes the indices as an argument. So, when using match function, no arguments are needed as it returns the entire match.
  • end(group=0) : return index of the end of the substring matched by group.
  string = "It rains cats and dogs"
  match = re.match('It', string)
  =><re.Match object; span=(0, 2), match='It'>
  match.group() ==> "It"
  match.end() ==> 2
  match.start() ==> 0
  match.span() ==> (0, 2)

The power of match function comes with special regex characters which start from the beginning of string and search for other patterns in the rest of string.

For example, to search for cats and dogs in the previous string, you may give a patterns contained inside brackets to form separate groups. The whole string is group 0 and each each other group is given a number increasing from left to right.

  string = "It rains cats and dogs"
  match = re.match('.*\s(.*)\sand(.*)', string)
  match.group(1) ==> 'dogs'
  match.group() ==> 'It rains cats and dogs'
  match.group(1) ==> 'cats'
  match.group(2) ==> 'dogs'

Search Function

The Search() function search for a match anywhere in the string not only at the beginning. However, it returns only one instance of the match even if there are more than one match. To find all the matches, use findall() function.

  import re
  string = "There are wild cats, domestics cats and pussy cats"
  match = re.search('cats', string)
  match.group() ==> 'cats'

Findall Function

Syntax : findall(pattern, string, flags=0)

Findall function returns a list of all non-overlapping matches in the string.

for example, to match any one single decimal number in an address :

  string = "213 Lincon street, Floor 7, Flat 8"
  match = re.findall('\d', string) ==> ['2', '1', '3', '7', '8']

Find and Replace

There is a dedicated function to find a sequence of characters matching a pattern and replace it with another provided as an argument. You can use backreference to indicate part of the pattern in the replacement string.

SYNTAX: re.sub(pattern, repl, string[, count, flags])

  • pattern: Search pattern i.e. pattern by which you have to make replacements
  • repl: Replacement string
  • string: Original string
  • count: No of replacements to make (optional parameter)
  import re
  string = "213 Mobile, Charger 7, Case 8"
  re.sub('(\d+)','\\1$', string) ==> '213$ Mobile, Charger 7$, Case 8$'

SPLIT With Regex

Regular expressions allow slicing of a string in ingenious ways. With Split() function you may slice a string by whatever sequence of characters. You can also limit how many splits you like to get.

Syntax : split(pattern, string, maxsplit=0, flags=0)

Split divide the source string by the occurrences of the pattern, returning a list containing the resulting substrings.

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

  import re
  re.split('\s', "This is an example")
  re.split('are', "words are words are words")
  re.split('\W+', 'Words, words, words.')
  re.split('(\W+)', 'Words, words, words.')
  re.split('\W+', 'Words, words, words.', 1) ==> ['Words', 'words, words.']
  re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) ==> ['0', '3', '9']
  import re
  text = """Ross McFluff: 834.345.1254 155 Elm Street
    Ronald Heathmore: 892.345.3428 436 Finley Avenue
    Frank Burger: 925.541.7625 662 South Dogwood Way
    Heather Albrecht: 548.326.4584 919 Park Place"""
  entries = re.split("\n+", text)
  [re.split(":? ", entry, 3) for entry in entries]
  =>maxsplit 4 to split house number from street name

FLAGS

Some re module functions take flags as an optional argument. You may need to make search regardless of case (case insensitive : I) or to extend the search beyond the end of line characters (the whole string whatever the number of lines :M).

Character Word Action
A ASCII make \w, \W, \b, \B, \d, \D match the corresponding ASCII character
rather than the whole Unicode(default)
I IGNORECASE Perform case insensitive matching.
L LOCALE Make \w, \W, \b, \B, dependent on the current locale.
M MULTILINE Extend matching to the end of the string.
S DOTALL "." matches any character at all, including the newline.
X VERBOSE Ignore whitespace and comments for nicer looking RE's.
U UNICODE For compatibility only. Ignored for string patterns (it is the default),
and forbidden for bytes patterns.
I IGNORECASE Perform case-insensitive matching.
M MULTILINE "^" matches the beginning of lines (after a newline) as well as the string.
"$" matches the end of lines (before a newline) as well as the end of the string.
S DOTALL "." matches any character at all, including the newline.
X VERBOSE Ignore whitespace and comments for nicer looking RE's.

SPECIAL CHARACTERS

The special characters are:

character Role
"." Matches any character except a newline.
"^" Matches the start of the string.
"$" Matches the end of the string or just before the newline at
the end of the string.
"*" Matches 0 or more (greedy) repetitions of the preceding RE.
Greedy means that it will match as many repetitions as possible.
"+" Matches 1 or more (greedy) repetitions of the preceding RE.
"?" Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Nongreedy versions of the previous three special characters.
{m,n} Matches from m to n repetitions of the preceding RE.
{m,n}? Nongreedy version of the above.
"\" Either escapes special characters or signals a special sequence.
[] Indicates a set of characters.
A "^" as the first character indicates a complementing set.
"" AB, creates an RE that will match either A or B.
(...) Matches the RE inside the parentheses.
The contents can be retrieved or matched later in the string.
(?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
(?:...) Nongrouping version of regular parentheses.
(?P...) The substring matched by the group is accessible by name.
(?P=name) Matches the text matched earlier by the group named name.
(?=...) A comment; ignored.
(?=...) Matches if ... matches next, but doesn't consume the string.
(?!...) Matches if ... doesn't match next.
(?<=...) Matches if preceded by ... (must be fixed length).
(?<!...) Matches if not preceded by ... (must be fixed length).
(?(id/name)yesno) Matches yes pattern if the group with id/name matched,
the (optional) no pattern otherwise.

Match any character

The dot . is special regex character used within the search pattern which stands for any character except the newline character.

  import re
  string = "It rains cats and dogs"
  match = re.findall('.{3,4}s', string)
  match ==> ['rains', ' cats', ' dogs']

Repetition of Characters

The Asterix character repeat the preceding character as many times as possible. This is the greedy use. To limit such behaviour follow the Asterix with question mark "?"

The same result can be obtained by the plus sign which is more precise as it repeat the preceding characters once or more than once.

Another way of repeating the previous characters is using curly brackets with minimal and maximum reatition. So, asterix is equal to {0,} and + is equal to {1,}

  import re
  string = "a aa aaaa aaaaaa"
  match = re.findall('.*', string)
  match = re.findall('.+', string)
  match = re.findall('.{0,}', string)
  match = re.findall('.{1,}', string)

Starts With

To match a character or pattern at the beginning of a string use \^

In Multiline text with newline characters, use the flag Multiline or M.

  text = """Ross McFluff: 834.345.1254 155 Elm Street
    Ronald Heathmore: 892.345.3428 436 Finley Avenue
    Frank Burger: 925.541.7625 662 South Dogwood Way
    Heather Albrecht: 548.326.4584 919 Park Place"""
  match = re.findall('^[R].*', text, re.M)
  match ==>['Ross McFluff: 834.345.1254 155 Elm Street', 'Ronald Heathmore: 892.345.3428 436 Finley Avenue']

Pattern at End of String

To match a pattern at end of string or just before the newline character use the dollar sign $

  text = """Ross McFluff: 834.345.1254 155 Elm Street
    Ronald Heathmore: 892.345.3428 436 Finley Avenue
    Frank Burger: 925.541.7625 662 South Dogwood Way
    Heather Albrecht: 548.326.4584 919 Park Place"""
  match = re.findall('.*Street$', text, re.M)
  match ==> ['Ross McFluff: 834.345.1254 155 Elm Street']

Set of Characters

You may enclose a set of characters inside square brackets. Any one of the characters in that set will be matched against the string. Thus, [a-z] means any lowercase alphabetical character, [A-Z] any uppercase alphabetical character, [0-9] any number, and [a-zA-Z0-9 ] means any alphanumerical character and any space too.

You can use a partial set of characters and indicate range by a hyphen. If you add \^ outside the set, it will match any string starting with any character in the set. If you include the \^ inside the set it will mean the opposite: it will exclude any character in the set from matching.

  text = """Ross McFluff: 834.345.1254 155 Elm Street
    Ronald Heathmore: 892.345.3428 436 Finley Avenue
    Frank Burger: 925.541.7625 662 South Dogwood Way
    Heather Albrecht: 548.326.4584 919 Park Place"""
  match = re.findall('[^M-Z].*', text, re.M)

Grouping

A group is a subexpression of the regex search pattern which is contained within parentheses. Groups are numbered from left to right starting with 1, while group 0 is the whole text matched by the entire regex expression.

  • You can catch one or more of the matching groups by match.group(n) function.
  • You can catch a group by backreference where \1 refer to group 1 and so on.
  • A captured group in a search pattern may be repeated by backreference in the same pattern.
      import re
      p = re.compile(r'\b(\w+)\s+\1\b')
      p.search('Paris in the the spring').group()  ==> 'the the'
  • Groups can be nested and here you count their number by counting the open parenthesis from left to right.
  import re
  string = "Ethnic groups in UK: Asians, Africans and Whites groups"
  match = re.search("(Asians), (.*) and (Whites)", string)
  match.group()
  match.group(0) ==> Asians, Africans and Whites'
  match.group(1) ==> 'Asians'
  match.group(2) ==> 'Africans'
  match.group(3) ==> 'Whites'
  • Groups may be named and backreferenced with \ and defined by (?Ppattern)
  import re
  string = "Today at 12:30 PM on Rakesh's Echo"
  regexp_1 = re.compile(r'(?P<day>\w+) at (?P<time>(\d+):(\d+) (\w+)) on (?P<place>\w+)')
  re_match = regexp_1.match(input_example)
  list(re_match.groups()) ==> ['Today', '12:30 PM', '12', '30', 'PM', 'Rakesh']
  re_match.group('day') ==>'Today'
  re_match.group('time')  ==> '12:30 PM'
  re_match.group('place') ==>'Rakesh' 
  import re
  p = re.compile(r'(?P<word>\b\w+\b)')
  m = p.search( '(((( Lots of punctuation )))' )
  m.group('word')  ==> 'Lots'
  m.group(1)       ==> 'Lots'
  m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
  m.groupdict()  ==> {'first_name': 'Malcolm', 'last_name': 'Reynolds'}

Lookahead and Lookbehind

Lookahead and Lookbehind are known as Lookarounds. They are assertions that some characters come before or after the target search pattern. So, you may add more than one Lookarounds and the engine would still start matching at the pattern given. They do not consume the pattern as the engine remains at its start. Lookarounds can be positive assertion or negative assertion.

To use lookahead, add the characters following the pattern in parenthesis after "?=".

  import re
  p = re.compile(r'.\d+(?= dollars)')
  p.search('It costs 123 dollars and 25 cents').group() ==> 100

In the above example, the engine matches the 123 and then asserts that the digits at that position is followed immediately by the word dollars.

To use look behind, you may add the same assertion before the pattern.

  import re
  p = re.compile(r'.(?=\d+ cents)\d+')
  p.search('It costs 123 dollars and 25 cents').group() ==> 25

A much better way is to use "?< =" to indicate that the assertion will be for a pattern before '<' the assertion.

  import re
  p = re.compile(r'(?<=USD)\d{3}')
  p.search('It costs USD123 dollars and 25 cents').group() ==> 123

To use negative assertion, i.e. that the pattern is not preceded or followed by the assertion given, use ! in the place of equal '=' sign.

  \d{3}(?<!USD\d{3})

Python require a lookbehind to match strings of a fixed length, so (?<=cat|dogs) will not work.

Special Escape Characters

They consist of "\" and a character.

Sequence Action
\number Matches the contents of the group of the same number.
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit; equivalent to the set'[0-9]'
\D Matches any non-digit character; equivalent to [caret\d]
\s Matches any whitespace character; equivalent to '[ \t\n\r\f\v]'
\S Matches any non-whitespace character; equivalent to [caret\s].
\w Matches any alphanumeric character; equivalent to '[a-zA-Z0-9_]'
\W Matches the complement of \w.
\\ Matches a literal backslash.

Back to Top