Tokens in Regular Expressions

Introduction

Parentheses used in a regular expression not only group elements of that expression together, but also designate any matches found for that group as tokens. You can use tokens to match other parts of the same text. One advantage of using tokens is that they remember what they matched, so you can recall and reuse matched text in the process of searching or replacing.

Each token in the expression is assigned a number, starting from 1, going from left to right. To make a reference to a token later in the expression, refer to it using a backslash followed by the token number. For example, when referencing a token generated by the third set of parentheses in the expression, use \3.

As a simple example, if you wanted to search for identical sequential letters in a character array, you could capture the first letter as a token and then search for a matching character immediately afterwards. In the expression shown below, the (\S) phrase creates a token whenever regexp matches any nonwhitespace character in the character array. The second part of the expression, '\1', looks for a second instance of the same character immediately following the first.

poe = ['While I nodded, nearly napping, ' ...
       'suddenly there came a tapping,'];

[mat,tok,ext] = regexp(poe, '(\S)\1', 'match', ...
               'tokens', 'tokenExtents');
mat

mat =

  1×4 cell array

    {'dd'}    {'pp'}    {'dd'}    {'pp'}

The cell array tok contains cell arrays that each contain a token.

tok{:}

ans =

  1×1 cell array

    {'d'}


ans =

  1×1 cell array

    {'p'}


ans =

  1×1 cell array

    {'d'}


ans =

  1×1 cell array

    {'p'}

The cell array ext contains numeric arrays that each contain starting and ending indices for a token.

ext{:}

For another example, capture pairs of matching HTML tags (e.g., <a> and </a>) and the text between them. The expression used for this example is

expr = '<(\w+).*?>.*?</\1>';

The first part of the expression, '<(\w+)', matches an opening angle bracket (<) followed by one or more alphabetic, numeric, or underscore characters. The enclosing parentheses capture token characters following the opening angle bracket.

The second part of the expression, '.*?>.*?', matches the remainder of this HTML tag (characters up to the >), and any characters that may precede the next opening angle bracket.

The last part, '</\1>', matches all characters in the ending HTML tag. This tag is composed of the sequence </tag>, where tag is whatever characters were captured as a token.

hstr = '<!comment><a name="752507"></a><b>Default</b><br>';
expr = '<(\w+).*?>.*?</\1>';

[mat,tok] = regexp(hstr, expr, 'match', 'tokens');
mat{:}

ans =

    '<a name="752507"></a>'


ans =

    '<b>Default</b>'

tok{:}

ans =

  1×1 cell array

    {'a'}


ans =

  1×1 cell array

    {'b'}

Multiple Tokens

Here is an example of how tokens are assigned values. Suppose that you are going to search the following text:

andy ted bob jim andrew andy ted mark

You choose to search the above text with the following search pattern:

and(y|rew)|(t)e(d)

This pattern has three parenthetical expressions that generate tokens. When you finally perform the search, the following tokens are generated for each match.

Match	Token 1	Token 2
`andy`	`y`
`ted`	`t`	`d`
`andrew`	`rew`
`andy`	`y`
`ted`	`t`	`d`

Only the highest level parentheses are used. For example, if the search pattern and(y|rew) finds the text andrew, token 1 is assigned the value rew. However, if the search pattern (and(y|rew)) is used, token 1 is assigned the value andrew.

Unmatched Tokens

For those tokens specified in the regular expression that have no match in the text being evaluated, regexp and regexpi return an empty character vector ('') as the token output, and an extent that marks the position in the string where the token was expected.

The example shown here executes regexp on a character vector specifying the path returned from the MATLAB^® tempdir function. The regular expression expr includes six token specifiers, one for each piece of the path. The third specifier [a-z]+ has no match in the character vector because this part of the path, Profiles, begins with an uppercase letter:

chr = tempdir

chr =

    'C:\WINNT\Profiles\bpascal\LOCALS~1\Temp\'

expr = ['([A-Z]:)\\(WINNT)\\([a-z]+)?.*\\' ...
        '([a-z]+)\\([A-Z]+~\d)\\(Temp)\\'];

[tok, ext] = regexp(chr, expr, 'tokens', 'tokenExtents');

When a token is not found in the text, regexp returns an empty character vector ('') as the token and a numeric array with the token extent. The first number of the extent is the string index that marks where the token was expected, and the second number of the extent is equal to one less than the first.

In the case of this example, the empty token is the third specified in the expression, so the third token returned is empty:

tok{:}

ans =

  1×6 cell array

    {'C:'}    {'WINNT'}    {0×0 char}    {'bpascal'}    {'LOCALS~1'}    {'Temp'}

The third token extent returned in the variable ext has the starting index set to 10, which is where the nonmatching term, Profiles, begins in the path. The ending extent index is set to one less than the starting index, or 9:

ext{:}

Tokens in Replacement Text

When using tokens in replacement text, reference them using $1, $2, etc. instead of \1, \2, etc. This example captures two tokens and reverses their order. The first, $1, is 'Norma Jean' and the second, $2, is 'Baker'. Note that regexprep returns the modified text, not a vector of starting indices.

regexprep('Norma Jean Baker', '(\w+\s\w+)\s(\w+)', '$2, $1')

ans =

    'Baker, Norma Jean'

Named Capture

If you use a lot of tokens in your expressions, it may be helpful to assign them names rather than having to keep track of which token number is assigned to which token.

When referencing a named token within the expression, use the syntax \k<name> instead of the numeric \1, \2, etc.:

poe = ['While I nodded, nearly napping, ' ...
       'suddenly there came a tapping,'];

regexp(poe, '(?<anychar>.)\k<anychar>', 'match')

ans =

  1×4 cell array

    {'dd'}    {'pp'}    {'dd'}    {'pp'}

Named tokens can also be useful in labeling the output from the MATLAB regular expression functions. This is especially true when you are processing many pieces of text.

For example, parse different parts of street addresses from several character vectors. A short name is assigned to each token in the expression:

chr1 = '134 Main Street, Boulder, CO, 14923';
chr2 = '26 Walnut Road, Topeka, KA, 25384';
chr3 = '847 Industrial Drive, Elizabeth, NJ, 73548';

p1 = '(?<adrs>\d+\s\S+\s(Road|Street|Avenue|Drive))';
p2 = '(?<city>[A-Z][a-z]+)';
p3 = '(?<state>[A-Z]{2})';
p4 = '(?<zip>\d{5})';

expr = [p1 ', ' p2 ', ' p3 ', ' p4];

As the following results demonstrate, you can make your output easier to work with by using named tokens:

loc1 = regexp(chr1, expr, 'names')

loc1 = 

  struct with fields:

     adrs: '134 Main Street'
     city: 'Boulder'
    state: 'CO'
      zip: '14923'

loc2 = regexp(chr2, expr, 'names')

loc2 = 

  struct with fields:

     adrs: '26 Walnut Road'
     city: 'Topeka'
    state: 'KA'
      zip: '25384'

loc3 = regexp(chr3, expr, 'names')

loc3 = 

  struct with fields:

     adrs: '847 Industrial Drive'
     city: 'Elizabeth'
    state: 'NJ'
      zip: '73548'