5 minute read

Updated:

In this post, I compile a cheatsheet of the main regexes that I use in my projects.

Digital communication relies heavily on regular expressions to make it work. These are sequences of characters that specify a search pattern in the text. It is usually these types of patterns that are used by string-searching algorithms when they are attempting to “find” and/or “replace” strings or when they are attempting to validate input. Regular expression techniques are developed in theoretical computer science and formal language theory.

A regular expression (regex) is a sequence of characters that specifies a search pattern. Regexes are commonly used in text processing tasks, such as finding and replacing specific patterns of characters in a body of text.

Regexes can be used to search for patterns of characters in a string, or to match or replace strings based on specific patterns. They are often used in text editors, programming languages, and command-line utilities to perform these types of tasks.

Regexes are powerful because they allow you to define complex search patterns using a compact and concise syntax. For example, you can use a regex to search for all the email addresses in a document, or to find and replace all instances of a particular word in a piece of text.

There are many different flavors of regexes, with different syntax and capabilities. Some of the most commonly used regexes are based on the Perl programming language, and are known as “Perl-compatible regular expressions” or PCREs.

Regexes can be used for a wide range of text processing tasks, such as:

  • Searching for specific patterns of characters in a body of text
  • Validating that a string matches a specific pattern (e.g. to ensure that a password meets certain criteria)
  • Extracting specific substrings from a larger string (e.g. to extract all the email addresses from a document)
  • Finding and replacing strings based on specific patterns (e.g. to replace all instances of a particular word in a piece of text)

It is common to use regular expressions and other text processing utilities, for example sed and AWK, to search and replace in text processors, as well as in lexical analysis and in text processing. The majority of general-purpose programming languages support regex capabilities either natively or with the aid of libraries. Examples of such languages include Python, C, C++, Java, and JavaScript.

An example of a regular expression is to locate a word spelled two different ways in a text editor, the regular expression seriali[sz]e matches both “serialise” and “serialize”.

Table of Contents

Character Classes

All characters used in digital communication can be categorized the classes shown in the table below.

Character Class Same as Meaning
[[:alnum:]] [0-9A-Za-z] Letters and digits
[[:alpha:]] [A-Za-z] Letters
[[:ascii:]] [\x00-\x7F] ASCII codes 0-127
[[:blank:]] [\t ] Space or tab only
[[:cntrl:]] [\x00-\x1F\x7F] Control characters
[[:digit:]] [0-9] Decimal digits
[[:graph:]] [[:alnum:][:punct:]] Visible characters (not space)
[[:lower:]] [a-z] Lowercase letters
[[:print:]] [ -~] == [ [:graph:]] Visible characters
[[:punct:]] [!"#$%&’()*+,-./:;<=>?@[]^_`{\|}~] Visible punctuation characters
[[:space:]] [\t\n\v\f\r ] Whitespace
[[:upper:]] [A-Z] Uppercase letters
[[:word:]] [0-9A-Za-z_] Word characters
[[:xdigit:]] [0-9A-Fa-f] Hexadecimal digits
[[:<:]] [\b(?=\w)] Start of word
[[:>:]] [\b(?<=\w)] End of word

Python’s regex module

The regular expressions module can be imported using the command

1
import re

It contains the following functions to be used.

re.findall

Returns a list containing all matches:

1
2
3
4
>>> re.findall(r'\bs?pare?\b', 'par spar apparent spare part pare')
['par', 'spar', 'spare', 'pare']
>>> re.findall(r'\b0*[1-9]\d{2,}\b', '0501 035 154 12 26 98234')
['0501', '154', '98234']

re.finditer

Returns an iterable of match objects (one for each match):

1
2
3
>>> m_iter = re.finditer(r'[0-9]+', '45 349 651 593 4 204')
>>> [m[0] for m in m_iter if int(m[0]) < 350]
['45', '349', '4', '204']

re.search

Returns a Match object if there is a match anywhere in the string:

1
2
3
4
5
>>> sentence = 'This is a sample string'
>>> bool(re.search(r'this', sentence, flags=re.I))
True
>>> bool(re.search(r'xyz', sentence))
False

re.split

Returns a list where the string has been split at each match:

1
2
>>> re.split(r'\d+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']

re.sub

Replaces one or many matches with a string:

1
2
3
4
5
>>> ip_lines = "catapults\nconcatenate\ncat"
>>> print(re.sub(r'^', r'* ', ip_lines, flags=re.M))
* catapults
* concatenate
* cat

Tip: You can also use string methods {: .notice–info} {: .text-justify}

re.compile

Compiles a regular expression pattern for later use:

1
2
3
4
5
6
7
>>> pet = re.compile(r'dog')
>>> type(pet)
<class '_sre.SRE_Pattern'>
>>> bool(pet.search('They bought a dog'))
True
>>> bool(pet.search('A cat crossed their path'))
False

re.escape

Flags

code (short) code (long) Description
re.I re.IGNORECASE Ignore case
re.M re.MULTILINE Multiline
re.L re.LOCALE Make \w, \b, \s locale dependent
re.S re.DOTALL Dot matches all (including newline)
re.U re.UNICODE Make \w, \b, \d, \s unicode dependent
re.X re.VERBOSE Readable style

Cookbook

Suppose we have two paragraphs as such

1
2
3
4
5
6
7
8
9
10
11
paragraph = """

Start: 
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Sodales ut eu sem integer vitae 
justo eget magna. 

Tincidunt praesent semper feugiat nibh sed pulvinar proin 
gravida. Praesent semper feugiat nibh sed. Mi proin sed libero enim sed faucibus 
turpis. Tortor pretium viverra suspendisse potenti nullam ac. end
"""

Select everything between the keywods start and end

1
2
3
4
5
6
7
8
9
>>> result = re.search(r"(?<=Start:)((.|\n)*)(?=end)", paragraph).group()
>>> print(result)
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Sodales ut eu sem integer vitae 
justo eget magna. 

Tincidunt praesent semper feugiat nibh sed pulvinar proin 
gravida. Praesent semper feugiat nibh sed. Mi proin sed libero enim sed faucibus 
turpis. Tortor pretium viverra suspendisse potenti nullam ac.

Select email addresses

Suppose we want to extract the emails contained in the following paragraph:

1
2
3
4
5
6
7
8
9
10
11
paragraph = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Sodales ut eu sem integer vitae 
justo eget magna. John Silva Doe <[email protected]> 
Josh Tree Done '[email protected]'

Jane Doe <[email protected]>

Malesuada fames ac turpis egestas integer eget. Cras semper auctor neque vitae 
tempus. Sed adipiscing diam donec adipiscing tristique risus nec. 
"""
1
2
3
>>> result = re.findall(r"<?(\S+@[\w.-]+\.[a-zA-Z]{2,4}\b)", paragraph)
>>> result
['[email protected]', '[email protected]', '[email protected]']

References:

  • [1] https://www.regexr.com
  • [2] https://quickref.me/regex
  • [3] https://www.regex101.com

Comments