Asset-Frameworker/PythonCheatsheats/regular-expressions.md

---
title: Python Regular Expressions - Python Cheatsheet
description: A regular expression (shortened as regex) is a sequence of characters that specifies a search pattern in text and used by string-searching algorithms.
---

<base-title :title="frontmatter.title" :description="frontmatter.description">
Regular Expressions
</base-title>

<base-disclaimer>
  <base-disclaimer-title>
    <a target="_blank" href="https://en.wikipedia.org/wiki/Regular_expression">Regular expressions</a>
  </base-disclaimer-title>
  <base-disclaimer-content>
    A regular expression (shortened as regex [...]) is a sequence of characters that specifies a search pattern in text. [...] used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.
  </base-disclaimer-content>
</base-disclaimer>

1. Import the regex module with `import re`.
2. Create a Regex object with the `re.compile()` function. (Remember to use a raw string.)
3. Pass the string you want to search into the Regex object’s `search()` method. This returns a `Match` object.
4. Call the Match object’s `group()` method to return a string of the actual matched text.

All the regex functions in Python are in the re module:

```python
>>> import re
```

## Regex symbols

| Symbol                   | Matches                                                |
| ------------------------ | ------------------------------------------------------ |
| `?`                      | zero or one of the preceding group.                    |
| `*`                      | zero or more of the preceding group.                   |
| `+`                      | one or more of the preceding group.                    |
| `{n}`                    | exactly n of the preceding group.                      |
| `{n,}`                   | n or more of the preceding group.                      |
| `{,m}`                   | 0 to m of the preceding group.                         |
| `{n,m}`                  | at least n and at most m of the preceding p.           |
| `{n,m}?` or `*?` or `+?` | performs a non-greedy match of the preceding p.        |
| `^spam`                  | means the string must begin with spam.                 |
| `spam$`                  | means the string must end with spam.                   |
| `.`                      | any character, except newline characters.              |
| `\d`, `\w`, and `\s`     | a digit, word, or space character, respectively.       |
| `\D`, `\W`, and `\S`     | anything except a digit, word, or space, respectively. |
| `[abc]`                  | any character between the brackets (such as a, b, ).   |
| `[^abc]`                 | any character that isn’t between the brackets.         |

## Matching regex objects

```python
>>> phone_num_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

>>> mo = phone_num_regex.search('My number is 415-555-4242.')

>>> print(f'Phone number found: {mo.group()}')
# Phone number found: 415-555-4242
```

## Grouping with parentheses

```python
>>> phone_num_regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> mo = phone_num_regex.search('My number is 415-555-4242.')

>>> mo.group(1)
# '415'

>>> mo.group(2)
# '555-4242'

>>> mo.group(0)
# '415-555-4242'

>>> mo.group()
# '415-555-4242'
```

To retrieve all the groups at once use the `groups()` method:

```python
>>> mo.groups()
('415', '555-4242')

>>> area_code, main_number = mo.groups()

>>> print(area_code)
415

>>> print(main_number)
555-4242
```

## Multiple groups with Pipe

You can use the `|` character anywhere you want to match one of many expressions.

```python
>>> hero_regex = re.compile (r'Batman|Tina Fey')

>>> mo1 = hero_regex.search('Batman and Tina Fey.')
>>> mo1.group()
# 'Batman'

>>> mo2 = hero_regex.search('Tina Fey and Batman.')
>>> mo2.group()
# 'Tina Fey'
```

You can also use the pipe to match one of several patterns as part of your regex:

```python
>>> bat_regex = re.compile(r'Bat(man|mobile|copter|bat)')
>>> mo = bat_regex.search('Batmobile lost a wheel')

>>> mo.group()
# 'Batmobile'

>>> mo.group(1)
# 'mobile'
```

## Optional matching with the Question Mark

The `?` character flags the group that precedes it as an optional part of the pattern.

```python
>>> bat_regex = re.compile(r'Bat(wo)?man')

>>> mo1 = bat_regex.search('The Adventures of Batman')
>>> mo1.group()
# 'Batman'

>>> mo2 = bat_regex.search('The Adventures of Batwoman')
>>> mo2.group()
# 'Batwoman'
```

## Matching zero or more with the Star

The `*` (star or asterisk) means “match zero or more”. The group that precedes the star can occur any number of times in the text.

```python
>>> bat_regex = re.compile(r'Bat(wo)*man')
>>> mo1 = bat_regex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'

>>> mo2 = bat_regex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'

>>> mo3 = bat_regex.search('The Adventures of Batwowowowoman')
>>> mo3.group()
'Batwowowowoman'
```

## Matching one or more with the Plus

The `+` (or plus) _means match one or more_. The group preceding a plus must appear at least once:

```python
>>> bat_regex = re.compile(r'Bat(wo)+man')

>>> mo1 = bat_regex.search('The Adventures of Batwoman')
>>> mo1.group()
# 'Batwoman'

>>> mo2 = bat_regex.search('The Adventures of Batwowowowoman')
>>> mo2.group()
# 'Batwowowowoman'

>>> mo3 = bat_regex.search('The Adventures of Batman')
>>> mo3 is None
# True
```

## Matching specific repetitions with Curly Brackets

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets:

```python
>>> ha_regex = re.compile(r'(Ha){3}')

>>> mo1 = ha_regex.search('HaHaHa')
>>> mo1.group()
# 'HaHaHa'

>>> mo2 = ha_regex.search('Ha')
>>> mo2 is None
# True
```

Instead of one number, you can specify a range with minimum and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.

```python
>>> ha_regex = re.compile(r'(Ha){2,3}')
>>> mo1 = ha_regex.search('HaHaHaHa')
>>> mo1.group()
# 'HaHaHa'
```

## Greedy and non-greedy matching

Python’s regular expressions are greedy by default: in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

```python
>>> greedy_ha_regex = re.compile(r'(Ha){3,5}')

>>> mo1 = greedy_ha_regex.search('HaHaHaHaHa')
>>> mo1.group()
# 'HaHaHaHaHa'

>>> non_greedy_ha_regex = re.compile(r'(Ha){3,5}?')
>>> mo2 = non_greedy_ha_regex.search('HaHaHaHaHa')
>>> mo2.group()
# 'HaHaHa'
```

## The findall() method

The `findall()` method will return the strings of every match in the searched string.

```python
>>> phone_num_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups

>>> phone_num_regex.findall('Cell: 415-555-9999 Work: 212-555-0000')
# ['415-555-9999', '212-555-0000']
```

## Making your own character classes

You can define your own character class using square brackets. For example, the character class _[aeiouAEIOU]_ will match any vowel, both lowercase and uppercase.

```python
>>> vowel_regex = re.compile(r'[aeiouAEIOU]')
>>> vowel_regex.findall('Robocop eats baby food. BABY FOOD.')
# ['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']
```

You can also include ranges of letters or numbers by using a hyphen. For example, the character class _[a-zA-Z0-9]_ will match all lowercase letters, uppercase letters, and numbers.

By placing a caret character (`^`) just after the character class’s opening bracket, you can make a negative character class that will match all the characters that are not in the character class:

```python
>>> consonant_regex = re.compile(r'[^aeiouAEIOU]')
>>> consonant_regex.findall('Robocop eats baby food. BABY FOOD.')
# ['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', '
# ', 'B', 'B', 'Y', ' ', 'F', 'D', '.']
```

## The Caret and Dollar sign characters

- You can also use the caret symbol `^` at the start of a regex to indicate that a match must occur at the beginning of the searched text.

- Likewise, you can put a dollar sign `$` at the end of the regex to indicate the string must end with this regex pattern.

- And you can use the `^` and `$` together to indicate that the entire string must match the regex.

The `r'^Hello`' regular expression string matches strings that begin with 'Hello':

```python
>>> begins_with_hello = re.compile(r'^Hello')
>>> begins_with_hello.search('Hello world!')
# <_sre.SRE_Match object; span=(0, 5), match='Hello'>

>>> begins_with_hello.search('He said hello.') is None
# True
```

The `r'\d\$'` regular expression string matches strings that end with a numeric character from 0 to 9:

```python
>>> whole_string_is_num = re.compile(r'^\d+$')

>>> whole_string_is_num.search('1234567890')
# <_sre.SRE_Match object; span=(0, 10), match='1234567890'>

>>> whole_string_is_num.search('12345xyz67890') is None
# True

>>> whole_string_is_num.search('12 34567890') is None
# True
```

## The Wildcard character

The `.` (or dot) character in a regular expression will match any character except for a newline:

```python
>>> at_regex = re.compile(r'.at')

>>> at_regex.findall('The cat in the hat sat on the flat mat.')
['cat', 'hat', 'sat', 'lat', 'mat']
```

## Matching everything with Dot-Star

```python
>>> name_regex = re.compile(r'First Name: (.*) Last Name: (.*)')

>>> mo = name_regex.search('First Name: Al Last Name: Sweigart')
>>> mo.group(1)
# 'Al'

>>> mo.group(2)
'Sweigart'
```

The `.*` uses greedy mode: It will always try to match as much text as possible. To match any and all text in a non-greedy fashion, use the dot, star, and question mark (`.*?`). The question mark tells Python to match in a non-greedy way:

```python
>>> non_greedy_regex = re.compile(r'<.*?>')
>>> mo = non_greedy_regex.search('<To serve man> for dinner.>')
>>> mo.group()
# '<To serve man>'

>>> greedy_regex = re.compile(r'<.*>')
>>> mo = greedy_regex.search('<To serve man> for dinner.>')
>>> mo.group()
# '<To serve man> for dinner.>'
```

## Matching newlines with the Dot character

The dot-star will match everything except a newline. By passing `re.DOTALL` as the second argument to `re.compile()`, you can make the dot character match all characters, including the newline character:

```python
>>> no_newline_regex = re.compile('.*')
>>> no_newline_regex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()
# 'Serve the public trust.'

>>> newline_regex = re.compile('.*', re.DOTALL)
>>> newline_regex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()
# 'Serve the public trust.\nProtect the innocent.\nUphold the law.'
```

## Case-Insensitive matching

To make your regex case-insensitive, you can pass `re.IGNORECASE` or `re.I` as a second argument to `re.compile()`:

```python
>>> robocop = re.compile(r'robocop', re.I)

>>> robocop.search('Robocop is part man, part machine, all cop.').group()
# 'Robocop'

>>> robocop.search('ROBOCOP protects the innocent.').group()
# 'ROBOCOP'

>>> robocop.search('Al, why does your programming book talk about robocop so much?').group()
# 'robocop'
```

## Substituting strings with the sub() method

The `sub()` method for Regex objects is passed two arguments:

1. The first argument is a string to replace any matches.
1. The second is the string for the regular expression.

The `sub()` method returns a string with the substitutions applied:

```python
>>> names_regex = re.compile(r'Agent \w+')

>>> names_regex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
# 'CENSORED gave the secret documents to CENSORED.'
```

## Managing complex Regexes

To tell the `re.compile()` function to ignore whitespace and comments inside the regular expression string, “verbose mode” can be enabled by passing the variable `re.VERBOSE` as the second argument to `re.compile()`.

Now instead of a hard-to-read regular expression like this:

```python
phone_regex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')
```

you can spread the regular expression over multiple lines with comments like this:

```python
phone_regex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)
```