Capture groups in Regex are a tool that allows us to identify and extract specific parts of a text string.
To define a capture group in Regex, we simply need to enclose a portion of a regular expression with parentheses ( )
.
This group allows us to capture the match of the delimited pattern. That is, in addition to matching parts of the text, we can store those matches and reuse them later in the expression or in the code.
Basic Syntax of Capture Groups
The basic use of capture groups involves delimiting patterns with parentheses. From there, we can extract the captured values for use in subsequent operations.
The syntax for a capture group is simply wrapping the pattern in parentheses:
(pattern)
Where pattern
is any sequence of characters we want to capture. These groups are automatically numbered, starting from 1
. The number 0
always refers to the complete match of the regular expression.
Examples of Capture Groups
We’ll understand this better with some examples.
(\d{3})-(\d{2})-(\d{4})
This regular expression captures a simplified phone number format, where:
(\d{3})
captures the first three digits.(\d{2})
captures the next two digits.(\d{4})
captures the last four digits.
Another example, suppose we have this pattern that captures two words separated by spaces.
(\w+)\s+(\w+)
The sequence (\w+)
captures any group of alphanumeric characters, and \s+
matches one or more whitespace characters.
If we apply the expression to the following text:
Hello World
The resulting captures would be:
- Group 1:
Hello
- Group 2:
World
Quantifiers Applied to Groups
One of the advantages of using groups is that we can apply quantifiers to the entire group instead of just to an individual character. This allows us to define the repetitions of complete sequences more precisely.
(\d{2}-){3}
This pattern captures a sequence of two digits followed by a hyphen and requires that this sequence be repeated exactly three times. It is equivalent to:
\d{2}-\d{2}-\d{2}-
But much cleaner and easier to read.
References to Capture Groups
One of the most useful features of capture groups is the ability to refer to them within the same expression or in subsequent operations. This can be done in various ways, depending on the context in which we are using the regular expression.
References in the Same Expression: Backreferences
Backreferences allow us to use a captured group later in the same expression. This is useful when we want to find parts of the text that repeat.
(\w+)\s+\1
In this expression, \1
refers to the first captured group (\w+)
. This means that the pattern will look for a word followed by a space and the same word repeated immediately after.
Applied to this text:
hello hello world
The capture group activates on the sequence "hello hello"
as the word "hello"
appears twice consecutively.
References in Replacements
When working with replacement functions in programming languages, capture groups allow us to access captured matches and use them to form new strings. In most languages, these matches are numbered sequentially.
For example, in JavaScript, we can use capture groups within the replace
function:
let text = "2024-09-27";
let newText = text.replace(/(\d{4})-(\d{2})-(\d{2})/, "$3/$2/$1");
console.log(newText); // "27/09/2024"
Here, the pattern (\d{4})-(\d{2})-(\d{2})
captures the date format "2024-09-27"
, and the replace
method reorders it in the format "day/month/year"
using the references $1
, $2
, and $3
.
Non-Capturing Groups
In some cases, we do not need to capture the match of a group, but simply use parentheses to group parts of the pattern. For these cases, we can use non-capturing groups, which are defined using the syntax (?:pattern)
.
Example of a Non-Capturing Group:
(?:\d{3})-\d{2}-\d{4}
Here, the first set of three digits (?:\d{3})
will not be captured but is still part of the expression.
Named Capture Groups
In more advanced regular expressions, it is possible to use named capture groups to improve readability and manage more complex patterns. Instead of referring to a group by its number, we can assign it a name and refer to it explicitly.
The syntax for named groups varies by language, but in many cases, it looks like this:
(?<groupName>pattern)
For example:
(?<firstName>\w+)\s(?<lastName>\w+)
In this case, we can capture two words, assigning the name "firstName"
to the first group and "lastName"
to the second. This is useful in languages like Python, where we can access the captures this way:
import re
text = "Luis Perez"
pattern = r"(?P<firstName>\w+)\s(?P<lastName>\w+)"
match = re.search(pattern, text)
print(match.group("firstName")) # Luis
print(match.group("lastName")) # Perez