Regular Expressions in Python
Regular Expressions in Python
Introduction
Regular Expressions (Regex) are patterns used to match character combinations in strings.
They are an essential tool for processing text and data across various domains. In web
development, regex is used to validate user inputs like email addresses or passwords. Data
scientists and analysts use regex to clean, transform, and extract meaningful patterns from
raw datasets. Similarly, regex plays a critical role in parsing log files, identifying errors in
large datasets, and extracting specific information from documents or web pages. Its
versatility makes it a foundational skill for software engineers, data professionals, and system
administrators alike. They are widely used for:
Python provides the re module to work with regular expressions. This document explains
regex concepts with examples and outputs to help beginners understand and apply regex
effectively.
• Raw strings in Python (e.g., r"\d") treat backslashes literally, simplifying regex
patterns. Without the r prefix, double backslashes are required.
• Example:
import re
pattern = r"\d"
string = "abc123"
result = re.search(pattern, string)
print(result.group()) # Output: 1
import re
result = re.match(r'\d+', '123abc')
if result:
print(result.group()) # Output: 123
2. re.search()
3. re.findall()
4. re.split()
5. re.sub()
6. re.compile()
pattern = re.compile(r'\d+')
result = pattern.findall('123abc456')
print(result) # Output: ['123', '456']
for s in strings:
# Find all numbers in each string
matches = re.findall(r'\d+', s)
print(f"Numbers in '{s}': {matches}")
Output:
Numbers in 'abc123': ['123']
Numbers in '456def': ['456']
Numbers in 'ghi789': ['789']
Here, Python processes the pattern r'\d+' every time you call re.findall.
for s in strings:
# Use the compiled pattern to find numbers
matches = pattern.findall(s)
print(f"Numbers in '{s}': {matches}")
Output:
Numbers in 'abc123': ['123']
Numbers in '456def': ['456']
Numbers in 'ghi789': ['789']
Summary:
• re.compile is useful when you use the same pattern repeatedly.
• It saves time by compiling the pattern once and allows you to use the compiled object
for all regex operations.
Quantifiers
Matches / Fails to
Character/Pattern Description Example
Match
Matches zero or more repetitions Matches: "b", "ab",
* "a*b"
of the preceding element. "aaab"; Fails: "cab", "c"
Matches / Fails to
Character/Pattern Description Example
Match
Matches one or more repetitions of Matches: "ab", "aaab";
+ "a+b"
the preceding element. Fails: "b", "c"
Matches: "aaa"; Fails:
{n} Matches exactly n repetitions. "a{3}"
"aa", "aaaa"
Matches: "aa", "aaa";
{n,} Matches at least n repetitions. "a{2,}"
Fails: "a"
Matches between n and m Matches: "a", "aa", "aaa";
{n,m} "a{1,3}"
repetitions. Fails: "aaaa"
Extracting Data
Multi-line Matching
Case-Insensitive Matching
Here are explanations and examples for each of the regex components mentioned in the
image:
1. | (Either or):
Example:
• Explanation: The pattern matches either "falls" or "stays" in the input text.
Parentheses are used to group parts of a pattern and capture them as separate groups.
Example:
• Explanation:
o (rain|sun) captures "rain" or "sun".
o (falls|stays) captures "falls" or "stays".
o The result is a list of tuples containing the captured groups.
3. [] (Set of characters):
Example:
• Explanation: The pattern [a-z] matches any lowercase letter from 'a' to 'z'. Each match
is returned as a separate element in the list.
4. \ (Special sequence):
Example:
• Explanation: The pattern \d matches any digit (0–9). Here, it finds all the digits in the
input text.
The .group() method in Python regular expressions is used to extract the part of the string
that matches the pattern or a specific group within the match.
Syntax of .group()
match.group([group_number])
• group_number (optional):
o If not specified (i.e., .group()), it returns the entire match.
o group(0): Returns the entire match (same as .group()).
o group(n): Returns the text matched by the n-th capturing group (inside
parentheses).
• Explanation:
o The parentheses () create a capturing group for the digits (\d+).
o .group(1) returns the content of the first capturing group (the digits "123").
You can assign names to groups and access them with .group('name').
Explanation:
Summary: