Main Content

Build Pattern Expressions

Since R2020b

Patterns are a tool to aid in searching for and modifying text. Similar to regular expressions, a pattern defines rules for matching text. Patterns can be used with text-searching functions like contains, matches, and extract to specify which portions of text these functions act on. You can build a pattern expression in a way similar to how you would build a mathematical expression, using pattern functions, operators, and literal text. Because building pattern expressions is open ended, patterns can become quite complicated. Building patterns in steps and using functions like maskedPattern and namedPattern can help organize complicated patterns.

Building Simple Patterns

The simplest pattern is built from a single pattern function. For example, lettersPattern matches any letter characters. There are many pattern functions for matching different types of characters and other features of text. A list of these functions can be found on the pattern reference page.

txt = "abc123def";
pat = lettersPattern;
extract(txt,pat)
ans = 2x1 string
    "abc"
    "def"

Patterns combine with other patterns and literal text by using the plus(+) operator. This operator appends patterns and text together in the order they are defined in the pattern expression. The combined patterns only match text in the same order. In this example, "YYYY/MM/DD" is not a match because a four-letter string must be at the end of the text.

txt = "Dates can be expressed as MM/DD/YYYY, DD/MM/YYYY, or YYYY/MM/DD";
pat = lettersPattern(2) + "/" + lettersPattern(2) + "/" + lettersPattern(4);
extract(txt,pat)
ans = 2x1 string
    "MM/DD/YYYY"
    "DD/MM/YYYY"

Patterns used with the or(|) operator specify that only one of the two specified patterns needs to match a section of text. If neither pattern is able to match then the pattern expression fails to match.

txt = "123abc";
pat = lettersPattern|digitsPattern;
extract(txt,pat)
ans = 2x1 string
    "123"
    "abc"

Some pattern functions take patterns as their input and modify them in some way. For example, optionalPattern makes a specified pattern match if possible, but the pattern is not required for a successful match.

txt = ["123abc" "abc"];
pat = optionalPattern(digitsPattern) + lettersPattern;
extract(txt,pat)
ans = 1x2 string
    "123abc"    "abc"

Boundary Patterns

Boundary patterns are a special type of pattern that do not match characters but rather match the boundaries between a designated character type and other characters or the start or end of that piece of text. For example, digitBoundary matches the boundaries between digit characters and nondigit characters and between digit characters and the start or end of the text. It does not match digit characters themselves. Boundary patterns are useful as delimiters for functions like split.

txt = "123abc";
pat = digitBoundary;
split(txt,pat)
ans = 3x1 string
    ""
    "123"
    "abc"

Boundary patterns are special amongst patterns because they can be negated using the not(~) operator. When negated in this way, boundary patterns match before or after characters that did not satisfy the requirements above. For example, ~digitBoundary matches the boundary between:

  • characters that are both digits

  • characters that are both nondigits

  • a nondigit character and the start or end of a piece of text

Use replace to mark the locations matched by ~digitBoundary with a "|" character.

txt = "123abc";
pat = ~digitBoundary;
replace(txt,pat,"|")
ans = 
"1|2|3a|b|c|"

Building Complicated Patterns in Steps

Sometimes a simple pattern is not sufficient to solve a problem and a more complicated pattern is needed. As a pattern expression grows it can become difficult to understand what it is matching. One way to simplify building a complicated pattern is building each part of the pattern separately and then combining the parts together into a single pattern expression.

For instance, email addresses use the form local_part@domain.TLD. Each of the three identifiers — local_part, domain, and TLD — must be a combination of digits, letters and underscore characters. To build the full pattern, start by defining a pattern for the identifiers. Build a pattern that matches one letter or digit character or one underscore character.

identCharacters = alphanumericsPattern(1) | "_";

Now, use asManyOfPattern to match one or more consecutive instances of identCharacters.

identifier = asManyOfPattern(identCharacters,1);

Next, build a pattern that matches an email containing multiple identifiers.

emailPattern = identifier + "@" + identifier + "." + identifier;

Test the pattern by seeing how well it matches the following example emails.

exampleEmails = ["janedoe@mathworks.com" 
    "abe.lincoln@whitehouse.gov"
    "alberteinstein@physics.university.edu"];
matches(exampleEmails,emailPattern)
ans = 3x1 logical array

   1
   0
   0

The pattern fails to match several of the example emails even though all the emails are valid. Both the local_part and domain can be made of a series of identifiers that are separated by periods. Use the identifier pattern to build a pattern that is capable of matching a series of identifiers. asManyOfPattern matches as many concurrent appearances of the specified pattern as possible, but if there are none the rest of the pattern is still able to match successfully.

identifierSeries = asManyOfPattern(identifier + ".") + identifier;

Use this pattern to build a new emailPattern that can match all of the example emails.

emailPattern = identifierSeries + "@" + identifierSeries + "." + identifier;
matches(exampleEmails,emailPattern)
ans = 3x1 logical array

   1
   1
   1

Organizing Pattern Display

Complex patterns can sometimes be difficult to read and interpret, especially by those you share them with who are unfamiliar with the pattern's structure. For example, when displayed, emailPattern is long and difficult to read.

emailPattern
emailPattern = pattern
  Matching:

    asManyOfPattern(asManyOfPattern(alphanumericsPattern(1) | "_",1) + ".") + asManyOfPattern(alphanumericsPattern(1) | "_",1) + "@" + asManyOfPattern(asManyOfPattern(alphanumericsPattern(1) | "_",1) + ".") + asManyOfPattern(alphanumericsPattern(1) | "_",1) + "." + asManyOfPattern(alphanumericsPattern(1) | "_",1)

Part of the issue with the display is that there are many repetitions of the identifier pattern. If the exact details of this pattern are not important to users of the pattern, then the display of the identifier pattern can be concealed using maskedPattern. This function creates a new pattern where the display of identifier is masked and the variable name, "identifier", is displayed instead. Alternatively, you can specify a different name to be displayed. The details of patterns that are masked in this way can be accessed by clicking "Show all details" in the displayed pattern.

identifier = maskedPattern(identifier);
identifierSeries = asManyOfPattern(identifier + ".") + identifier
identifierSeries = pattern
  Matching:

    asManyOfPattern(identifier + ".") + identifier

  Use details to show more information

Patterns can be further organized using the namedPattern function. namedPattern designates a pattern as a named pattern that changes how the pattern is displayed when combined with other patterns. Email addresses have several important portions, local_part@domain.TLD, which each have their own matching rules. Create a named pattern for each section.

localPart = namedPattern(identifierSeries,"local_part");

Named patterns can be nested, to further delineate parts of a pattern. To nest a named pattern, build a pattern using named patterns and then designate that pattern as a named pattern. For example, Domain.TLD can be divided into the domain, subdomains, and the top level domain (TLD). Create named patterns for each part of domain.TLD.

subdomain = namedPattern(identifierSeries,"subdomain");
domainName = namedPattern(identifier,"domainName");
tld = namedPattern(identifier,"TLD");

Nest the named patterns for the components of domain underneath a single named pattern domain.

domain = optionalPattern(subdomain + ".") + ...
            domainName + "." + ...
            tld;
domain = namedPattern(domain);

Combine the patterns together into a single named pattern, emailPattern. In the display of emailPattern you can see each named pattern and what they match as well as the information on any nested named patterns.

emailPattern = localPart + "@" + domain
emailPattern = pattern
  Matching:

    local_part + "@" + domain

  Using named patterns:

    local_part  : asManyOfPattern(identifier + ".") + identifier
    domain      : optionalPattern(subdomain + ".") + domainName + "." + TLD
      subdomain : asManyOfPattern(identifier + ".") + identifier
      domainName: identifier
      TLD       : identifier

  Use details to show more information

You can access named patterns and nested named patterns by dot-indexing into a pattern. For example, you can access the nested named pattern subdomain by dot-indexing from emailPattern into domain and then dot-indexing again into subdomain.

emailPattern.domain.subdomain
ans = pattern
  Matching:

    asManyOfPattern(identifier + ".") + identifier

  Use details to show more information

Dot-assignment can be used to change named patterns without needing to rewrite the rest of the pattern expression.

emailPattern.domain = "mathworks.com"
emailPattern = pattern
  Matching:

    local_part + "@" + domain

  Using named patterns:

    local_part: asManyOfPattern(identifier + ".") + identifier
    domain    : "mathworks.com"

  Use details to show more information

Copyright 2020 The MathWorks, Inc.

See Also

| | | | |

Related Topics