Capture Groups in Regex

4 min
Advanced

Capture groups in Regex are a tool that allows us to identify and extract specific parts of a text string.

To define a capture group in Regex, we simply need to enclose a portion of a regular expression with parentheses ( ).

This group allows us to capture the match of the delimited pattern. That is, in addition to matching parts of the text, we can store those matches and reuse them later in the expression or in the code.

Basic Syntax of Capture Groups

The basic use of capture groups involves delimiting patterns with parentheses. From there, we can extract the captured values for use in subsequent operations.

The syntax for a capture group is simply wrapping the pattern in parentheses:

(pattern)

Where pattern is any sequence of characters we want to capture. These groups are automatically numbered, starting from 1. The number 0 always refers to the complete match of the regular expression.

Examples of Capture Groups

We’ll understand this better with some examples.

(\d{3})-(\d{2})-(\d{4})

This regular expression captures a simplified phone number format, where:

(\d{3}) captures the first three digits.
(\d{2}) captures the next two digits.
(\d{4}) captures the last four digits.

Another example, suppose we have this pattern that captures two words separated by spaces.

(\w+)\s+(\w+)

The sequence (\w+) captures any group of alphanumeric characters, and \s+ matches one or more whitespace characters.

If we apply the expression to the following text:

Hello World

The resulting captures would be:

Group 1: Hello
Group 2: World

Quantifiers Applied to Groups

One of the advantages of using groups is that we can apply quantifiers to the entire group instead of just to an individual character. This allows us to define the repetitions of complete sequences more precisely.

(\d{2}-){3}

This pattern captures a sequence of two digits followed by a hyphen and requires that this sequence be repeated exactly three times. It is equivalent to:

\d{2}-\d{2}-\d{2}-

But much cleaner and easier to read.

References to Capture Groups

One of the most useful features of capture groups is the ability to refer to them within the same expression or in subsequent operations. This can be done in various ways, depending on the context in which we are using the regular expression.

References in the Same Expression: Backreferences

Backreferences allow us to use a captured group later in the same expression. This is useful when we want to find parts of the text that repeat.

(\w+)\s+\1

In this expression, \1 refers to the first captured group (\w+). This means that the pattern will look for a word followed by a space and the same word repeated immediately after.

Applied to this text:

hello hello world

The capture group activates on the sequence "hello hello" as the word "hello" appears twice consecutively.

References in Replacements

When working with replacement functions in programming languages, capture groups allow us to access captured matches and use them to form new strings. In most languages, these matches are numbered sequentially.

For example, in JavaScript, we can use capture groups within the replace function:

let text = "2024-09-27";
let newText = text.replace(/(\d{4})-(\d{2})-(\d{2})/, "$3/$2/$1");
console.log(newText); // "27/09/2024"

Here, the pattern (\d{4})-(\d{2})-(\d{2}) captures the date format "2024-09-27", and the replace method reorders it in the format "day/month/year" using the references $1, $2, and $3.

Non-Capturing Groups

In some cases, we do not need to capture the match of a group, but simply use parentheses to group parts of the pattern. For these cases, we can use non-capturing groups, which are defined using the syntax (?:pattern).

Example of a Non-Capturing Group:

(?:\d{3})-\d{2}-\d{4}

Here, the first set of three digits (?:\d{3}) will not be captured but is still part of the expression.

Named Capture Groups

In more advanced regular expressions, it is possible to use named capture groups to improve readability and manage more complex patterns. Instead of referring to a group by its number, we can assign it a name and refer to it explicitly.

The syntax for named groups varies by language, but in many cases, it looks like this:

(?<groupName>pattern)

For example:

(?<firstName>\w+)\s(?<lastName>\w+)

In this case, we can capture two words, assigning the name "firstName" to the first group and "lastName" to the second. This is useful in languages like Python, where we can access the captures this way:

import re

text = "Luis Perez"
pattern = r"(?P<firstName>\w+)\s(?P<lastName>\w+)"
match = re.search(pattern, text)
print(match.group("firstName"))  # Luis
print(match.group("lastName"))  # Perez