NIRE

NIRE is a regular expression template library for C++.

The development of NIRE ended. No more update for new features or peformance improvement is planned, but only bug fixes might be made.

目次

Features

NIRE is a fork of SRELL. The following features have been inherited from SRELL:

Features that are not inherited from SRELL are: 1) <regex> compatible interfaces, 2) arbitrary iterator support, etc.
NIRE's interfaces generally follow the specification of ECMAScript (JavaScript).

As far as I know, the oldest compiler that can compile NIRE is Visual C++ 2005 in Visual Studio 2005.

Download

How to use

Include nire.hpp.

//  Example 01:
#include <cstdio>
#include <string>
#include <iostream>
#include "nire.hpp"

int main()
{
    nire::RegExp re;    //  Regular expression object
    nire::Match m;      //  Object to receive matching results.

    re = "\\d+[^-\\d]+";    //  Compiles regular expressions.
    m = re.exec("1234-5678-90ab-cdef"); //  Executes a search.
    if (m)
    {
        std::printf("result: %s\n", m[0].str().c_str());    //  If use printf.
        std::cout << "result: " << m[0] << std::endl;       //  If use iostream.
    }
    return 0;
}
	

See also readme_en.txt included in the zip archive.

API

NIRE consists of the following four classes, which are all defined in namespace nire:

The first three template classes are instantiated by replacing the template parameter charT with any character type to be used for a search.
There are typedefs for them as follows. It is recommended to use these typedefs usually:

Typedef list of three basic classes
Type regexpre_matchre_group Interpretation of string Note
char u8cRegExpMatchGroup UTF-8
wchar_t u16wRegExpwMatchwGroup UTF-16 Defined only when 0xFFFF <= WCHAR_MAX < 0x10FFFF.
u32wRegExp UTF-32 Defined only when WCHAR_MAX >= 0x10FFFF.
char8_t u8RegExpu8Matchu8Group UTF-8
char16_t u16RegExpu16Matchu16Group UTF-16
char32_t u32RegExpu32Matchu32Group UTF-32
char bRegExpMatchGroup Sequence of values in the range of 0 to UCHAR_MAX inclusive.

bRegExp is a specialisation for searching against non-Unicode sequences such as binary data. See the later section for details.

regexp<charT>

Creates a internal object from regular expressions, holds it, and executes a search with it. This class is the core of NIRE, and basically follows the specification of RegExp of ECMAScript (JavaScript).
The following members are defined:

Constructors and flags

There are three ways to pass a string of regular expressions as an argument to a constructor: 1) pass a pointer to a null terminated string, 2) pass a pointer and a length, or 3) pass a const reference to an instance of std::basic_string.
Furthermore, there are two ways to pass flags to each of them: a) pass a value of the nire::flag::type type, or b) pass a pointer to a null terminated string. Thus, 3 x 2 = 6, there are six way to pass regular expressions and flags to the constructor.

regexp();
regexp(const regexp &r);    //  Copies.
regexp(regexp &&r) noexcept;    //  C++11 and later.

//  Pointer
regexp(const charT *const p, const flag::type f = flag::ECMAScript);
regexp(const charT *const p, const charT *const flags);

//  Pointer and length
regexp(const charT *const p, const std::size_t len, const flag::type f);
regexp(const charT *const p, const std::size_t len, const charT *const flags);

//  basic_string
regexp(const std::basic_string<charT> &str, const flag::type f = flag::ECMAScript);
regexp(const std::basic_string<charT> &str, const charT *const flags);
				

The following flags are available:

List of flags
Value of nire::flag::type type String
nire::flag::global "g" Global match. If specified, the next position of the matched substring is stored into the lastIndex member, and the next search will start at the index indicated by lastIndex in the subject.
nire::flag::icase "i" Case-insensitive matching.
nire::flag::multiline "m" If specified, ^ and $ match the beginning of a line and the end of a line, respectively.
nire::flag::dotall "s" If specified, . matches every character including newline characters.
nire::flag::sticky "y" If specified, search is executed only at the index indicated by the lastIndex member.
nire::flag::noflags
nire::flag::ECMAScript
"" For indicating no flags explicitly.

When passing a value of nire::flag::type, you can specify multiple flags at the same time by separating them by | (or).
When passing a string, you can specify multiple flags at the same time by writing like "im".

assign()

Copies, moves (C++11 and later), or compiles regular expressions.
For compiling, there are six ways to pass a string of regular expressions and flags as arguments, like constructors.

regexp &assign(const regexp &r);
regexp &assign(regexp &&r) noexcept;    //  C++11 and later.

//  Pointer
regexp &assign(const charT *const p, const flag::type f = flag::ECMAScript);
regexp &assign(const charT *const p, const charT *const flags);

//  Pointer and length
regexp &assign(const charT *const p, std::size_t len, const flag::type f);
regexp &assign(const charT *const p, std::size_t len, const charT *const flags);

//  basic_string
regexp &assign(const std::basic_string<charT> &str, const flag::type f = flag::ECMAScript);
regexp &assign(const std::basic_string<charT> &str, const charT *const flags);
				
operator=()

Copies, moves (C++11 and later), or compiles regular expressions.
Unlike constructors and assign(), there are only two ways to pass an argument: 1) pass a pointer to a null terminated string, or 2) pass a const reference to std::basic_string.

regexp &operator=(const regexp &r);
regexp &operator=(regexp &&r) noexcept; //  C++11 and later.

//  Pointer
regexp &operator=(const charT *const p)

//  basic_string
regexp &operator=(const std::basic_string<charT> &str)
				
lastIndex

The index value in a subject at which the next search is to start. This value is used only if the instance has been created with the global or sticky flag set. If either flag is not set, the value of lastIndex is ignored and a search is always executed at the beginning.

If the match succeeds, the next position of the matched substring (i.e., end of [begin, end)) is written into lastIndex. Otherwise lastIndex is reset to 0

Even if the subject string is changed, lastIndex is not reset automatically.

As this member is mutable, it can be written into even if it is in a const instance of regexp.

exec() const

Executes a search for a match with regular expressions in a subject string.
There are three ways to pass the subject: 1) pass a pointer to a null terminated string, 2) pass a pointer and a length, or 3) pass a const reference to std::basic_string.

To receive results, there are two ways: a) pass as an argument the reference to an instance of re_match and let exec() write matching results into it, or b) let exec() write matching results into a a new instance of re_match and return it.
In the former case, exec() returns a boolean value indicating whether any match has been found or not.

//  Pointer
bool exec(
    re_match<charT> &m,
    const charT *const str
) const;

re_match<charT> exec(
    const charT *const str
) const;

//  Pointer and length
bool exec(
    re_match<charT> &m,
    const charT *const str,
    const std::size_t len
) const;

re_match<charT> exec(
    const charT *const str,
    const std::size_t len
) const;

//  basic_string
bool exec(
    re_match<charT> &m,
    const std::basic_string<charT> &str
) const;

re_match<charT> exec(
    const std::basic_string<charT> &str
) const;
				

When the global or sticky flag is set, the search starts at the index indicated by lastIndex (i.e., the lastIndex-th element of a subject string). If a match is found, the next position of the matched substring (end of [begin,end)) is written into lastIndex. Otherwise, lastIndex is reset to 0.

When exec() is called in a loop, it is recommended to use any overload in which a reference to re_match is passed as an argument.

test() const

Behaves as exec(), except that matching results are not written into an instance of re_match. If a match is found, returns true. Otherwise returns false.

//  Pointer
bool test(
    const charT *const str
) const;

//  Pointer and length
bool test(
    const charT *const str,
    const std::size_t len
) const;

//  basic_string
bool test(
    const std::basic_string<charT> &str
) const;
				
replace() const

Behaves like String.prototype.replace(regexp-object, newSubStr|callback-function) of JavaScript.
Replaces a substring in a subject that matches regular expressions with another string. The resulting string is returned and the original string is left unchanged.

There are two ways to specify the string that replaces a matched substring: 1) pass a replacement string, or 2) pass a callback function.

When the global flag is set, all matched substrings in the subject are to be replaced. Otherwise, only the first match is to be replaced.

//  Replacement with a string
//  Pointer
std::basic_string<charT> replace(
    const charT *const str,
    const charT *const fmt
) const;

//  Pointer and length
std::basic_string<charT> replace(
    const charT *const str,
    const std::size_t len,
    const charT *const fmt_text,
    const std::size_t fmt_len
) const;

//  basic_string
std::basic_string<charT> replace(
    const std::basic_string<charT> &str,
    const std::basic_string<charT> &fmt
) const;

//  Replacement by callback function
typedef std::basic_string<charT> (*replace_function)(const re_match<charT> &);

//  Pointer
std::basic_string<charT> replace(
    const charT *const str,
    const replace_function repfunc
) const;

//  Pointer and length
std::basic_string<charT> replace(
    const charT *str,
    const std::size_t len,
    const replace_function repfunc
) const;

//  basic_string
std::basic_string<charT> replace(
    const std::basic_string<charT> &str,
    const replace_function repfunc
) const;
				

Replacement by specifying a string

The replacement string can contain the special symbols defined in the table in the ECMAScript specification, Runtime Semantics: GetSubstitution. Any character that does not exist in the table is copied as is to a new string to be returned.

Special symbols for replacement
Symbol Inserted string as replacement
$$ $ itself.
$& The entire matched substring.
$` The substring that precedes the matched substring.
$' The substring that follows the matched substring.
$n where n is one of 1 2 3 4 5 6 7 8 9 not followed by a digit. The substring captured by the pair of nth round bracket (1-based index) in the regular expressions. Replaced with an empty string if nothing is captured. Not replaced if the number n is greater than the number of capturing brackets in the regular expressions.
$nn where nn is any value in the range 01 to 99 inclusive The substring captured by the pair of nnth round bracket (1-based index) in the regular expressions. Replaced with an empty string if nothing is captured. Not replaced if the number nn is greater than the number of capturing brackets in the regular expressions.
$<NAME>
  1. If any named group does not exist in the regular expressions, replacement does not happen.
  2. Otherwise, replaced with the substring captured by the pair of round brackets whose group name is NAME. If any capturing group whose name is NAME does not exist or nothing is capture by that group, replaced with an empty string.
//  Sample of replacement with a string.
#include <cstdio>
#include <string>
#include "nire.hpp"

int main()
{
    const nire::RegExp re("(\\d)(\\d)", "g");   //  Searches for two consecutive digits.
    const std::string rep = re.replace("abc0123456789def", "($2$1)");
        //  Exchanges the order of $1 and $2 and encloses them with a pair of brackets.

    std::printf("Result: %s\n", rep.c_str());
    return 0;
}
---- output ----
Result: abc(10)(32)(54)(76)(98)def
					

Replacement by callback function

When a match is found, the passed callback function is called with the argument that is a const reference to an internal instance of re_match<charT> into which matching results have been written.
The callback function has to return a new string of the std::basic_string<charT> type. Special symbols mentioned above do not apply in this case.

//  Sample of replacement by callback function.
#include <cstdio>
#include <string>
#include "nire.hpp"

std::string repfunc1(const nire::Match &m)
{
    return " [" + m[1].str() + ", " + m[2].str() + "] ";
}

int main()
{
    const nire::RegExp re("(\\d)(\\d)", "g");
    const std::string text("abc0123456789def");
    std::string rep;

    rep = re.replace(text, repfunc1);
    std::printf("Result1(C++98/C++03): %s\n", rep.c_str());

    //  Replacement by lambda of C++11 and later.
    rep = re.replace(text, [](const nire::Match &m) -> std::string {
        return " [" + m[1].str() + ", " + m[2].str() + "] ";
    });
    std::printf("Result2(C++11): %s\n", rep.c_str());

    //  C++14 and later.
    rep = re.replace(text, [](const auto &m) {
        return " [" + m[1].str() + ", " + m[2].str() + "] ";
    });
    std::printf("Result3(C++14): %s\n", rep.c_str());

    return 0;
}
---- output ----
Result1(C++98/C++03): abc [0, 1]  [2, 3]  [4, 5]  [6, 7]  [8, 9] def
Result2(C++11): abc [0, 1]  [2, 3]  [4, 5]  [6, 7]  [8, 9] def
Result3(C++14): abc [0, 1]  [2, 3]  [4, 5]  [6, 7]  [8, 9] def
					
split() const

Behaves like String.prototype.split(regexp-object, limit) of JavaScript.
Splits a string by the occurences that match regular expressions into new instances of std::basic_string and return a container that holds them. If regular expressions contain capturing round brackets, the substrings captured by them are also pushed into the resulting array. When a pair of round brackets does not capture anything, an empty string is pushed instead of skipping pushing.

Any container type can be used to receive results if it has at least clear(), size(), push_back() member functions. The returned value is the number of times push_back() was called.

//  Pointer
template <typename ContainerT>
std::size_t split(
    ContainerT &out,
    const charT *const str,
    const std::size_t limit = static_cast<std::size_t>(-1),
    const bool push_remainder = false
) const;

//  Pointer and length (length comes before pointer for overload resolution)
template <typename ContainerT>
std::size_t split(
    ContainerT &out,
    const std::size_t len,
    const charT *const str,
    const std::size_t limit = static_cast<std::size_t>(-1),
    const bool push_remainder = false
) const;

//  basic_string
template <typename ContainerT>
std::size_t split(
    ContainerT &out,
    const std::basic_string<charT> &str,
    const std::size_t limit = static_cast<std::size_t>(-1),
    const bool push_remainder = false
) const;
				

limit specifies the maximum number of times splitting is performed. For example, given that the specified pattern is /,/ and limit is 5, splitting is finished when the fifth ',' is found in the subject string.

Although the feature of specifying a maximum number like this limit is not rare in script languages, in the case of JavaScript, oddly, when the number of times splitting is executed reaches limit, split() throws away the remainder portion that has not scanned yet and does not push it to the container to be returned.
As I am personally unhappy with this behaviour, the push_remainder flag has been added to NIRE's split() as an extension. When push_remainder is true, the returned container has limit + 1 elements at most. As a result, this behaviour is similar to Python's split().

//  Behaviours of limit and push_remainder.
#include <cstdio>
#include <string>
#include <vector>
#include "nire.hpp"

int main()
{
    const nire::RegExp re(":");
    const std::string text("0:12:345:6789:abcdef");
    std::vector<std::string> res;

    //  Splitting unlimitedly.
    re.split(res, text);
    for (std::size_t i = 0; i < res.size(); ++i)
        std::printf("%s%s", i == 0 ? "[" : ", ", res[i].c_str());
    std::puts("]");

    //  limit=2. Remainder is thrown away.
    re.split(res, text, 2);
    for (std::size_t i = 0; i < res.size(); ++i)
        std::printf("%s%s", i == 0 ? "[" : ", ", res[i].c_str());
    std::puts("]");

    //  limit=2 and push_remainder=true. Remainder is all pushed as the final element.
    re.split(res, text, 2, true);
    for (std::size_t i = 0; i < res.size(); ++i)
        std::printf("%s%s", i == 0 ? "[" : ", ", res[i].c_str());
    std::puts("]");

    return 0;
}
---- output ----
[0, 12, 345, 6789, abcdef]
[0, 12]
[0, 12, 345:6789:abcdef]
				
swap()

Swaps the contents of two instances. re1.swap(re2) and nire::swap(re1, re2) are defined.

Flag functions

The following member functions return a boolean value indicating whether a specific flag was set when regular expressions were passed to compile. They have been named after the property names in JavaScript.

List of flag functions
Name Return Value
std::basic_string<charT> flags(); Stringified flags that were passed to the pattern compiler.
bool global(); Whether nire::flag::global or "g" was set.
bool ignoreCase(); Whether nire::flag::icase or "i" was set.
bool multiline(); Whether nire::flag::multiline or "m" was set.
bool dotAll(); Whether nire::flag::dotall or "s" was set.
bool unicode(); Returns true always.
bool sticky(); Whether nire::flag::sticky or "y" was set.

re_match<charT>

Receives matching results. This class has been designed based on the return value of RegExp.exec() of JavaScript.

Cast to bool

An instance of re_match can be converted to a boolean value.

std::size_t length() const
std::size_t size() const

When the instance has successful matching results, returns one plus the number of pairs of capturing brackets in the regular expressions. Otherwise (including the case that the instance has not received any matching results after it was created) returns 0. The reason of "plus one" is because the entire match is treated as captured by the implicit 0th pair of brackets.

For getting the number of elements, .size() is a common member function name in C++ classes, whereas using the .length property is common in JavaScript. In NIRE both size() and length() are provided so that either one can be used.

const re_group<charT> &operator[](unsigned int n) const
const re_group<charT> &operator[](const charT *gname) const
const re_group<charT> &operator[](const std::basic_string<charT> &gname) const

Given that an instance of re_match is m, m[0] returns a pair of indices that represent [begin, end) of the entire match in a subject. m[n] where n >= 1 returns a pair of indices [begin, end) of a substring captured by the n-th (1-based index) pair of round brackets.
An argument n must be any value in the range from 0 to length() - 1 inclusive.

Overloads that take as a parameter a pointer or string return a pair of indices that represent [begin, end] of the substring captured by the group named as gname.

As the type of the return value, re_group<charT> can be cast to std::basic_string<charT>, it is possible to receive the matched substring directly instead of indices information. The details of this type is explained in a later section.

std::size_t index(unsigned int n = 0) const
std::size_t index(const charT *gname) const
std::size_t index(const std::basic_string<charT> &gname) const

Returns the 0-based index value of the first element of the entire match (when the argument is omitted or n is 0) or of a substring captured by the pair of the corresponding round brackets (otherwise).

std::size_t endIndex(unsigned int n = 0) const
std::size_t endIndex(const charT *gname) const
std::size_t endIndex(const std::basic_string<charT> &gname) const

Returns the 0-based index value of the next element of the entire match (when the argument is omitted or n is 0) or of a substring captured by the pair of the corresponding round brackets (otherwise).

swap()

Swaps the contents of two instances. re1.swap(re2) and nire::swap(re1, re2) are defined.

re_group<charT>

This struct holds a pair of 0-based indices indicating the range [begin, end) of an entire match or substring set by any member function of NIRE. This type is is_trivially_copyable. You can safely do memcpy.
As this class holds indices only, to get actually the contents of a matched or captured substring, the original subject string has to still exist at that time.

This class has the following members:

Cast to bool

An instance of re_group can be converted to a boolean value indicating whether the instance has valid indices.

Cast to std::basic_string<charT>
std::basic_string<charT> str() const

Creates a new instance of std::basic_string<charT> that contains a matched or captured substring based on the internal indices and returns it.

std::size_t length() const
std::size_t size() const

Returns the length of the range held by the instance. If the instance does not have any valid range, 0 is returned.

Like re_match, re_group has both size() and length().

matched() const

Returns a boolean value indicating whether the instance has valid indices.

Comparison operators

For comparison between two instances of re_group<charT> or with a instance of std::basic_string<charT>, there are ==, !=, <, <=, >, >=.

re_error

When regular expressions are compiled or a search is executed, if any error occurs, nire::re_error is thrown.
nire::re_error is a derived class from std::runtime_error and has the following members:

//  nire::re_error's members.
const char *what() const;       //  Derived from std::runtime_error.
re_error::type code() const;    //  Returns an error code.
		

The list of error codes is as follows:

List of error codes
Value Condition Note
escape Invalid escaped character. \ is followed by an inappropriate character or exists at the end of a sequence. Errors thrown by the pattern compiler.
backref Invalid backreference. No capturing round brackets corresponding to the backreference.
sqbrack Mismatched [ and ].
paren Mismatched ( and ).
brace Mismatched { and }.
badbrace Invalid character contained in {}.
range Invalid range in a character class, such as [b-a]. The character at the left side of - has a code point value greater than the one at the right side.
badrepeat Invalid character preceding a quantifier * ? + {n,m}.
utf8 Invalid UTF-8 sequence.
complexity Complicated matching. Error thrown when matching is performed.

Additionally, std::bad_alloc is thrown when memory allocation fails.

Features available in C++11 and later

Features introduced in C++11 and later that may be used by NIRE are as follows:

As of May 2020, NIRE determines whether these features are avaialable or not by the following marcos:

#ifdef __cpp_unicode_characters
  #ifndef NIRE_CPP11_CHAR1632_ENABLED
  #define NIRE_CPP11_CHAR1632_ENABLED   //  Do typedefs for char16_t, char32_t.
  #endif
#endif

#ifdef __cpp_rvalue_references
  #ifndef NIRE_CPP11_MOVE_ENABLED
  #define NIRE_CPP11_MOVE_ENABLED   //  Uses move semantics in constructors and assignments.
  #endif
#endif

#ifdef __cpp_char8_t
  #ifndef NIRE_CPP20_CHAR8_ENABLED
  #define NIRE_CPP20_CHAR8_ENABLED  //  Do typedef for char8_t.
  #endif
#endif
	

If your compiler does not set __cpp_* macros despite of the fact that the corresponding features are actually available, you can turn on the feature(s) you need by setting NIRE_CPP_* macro(s) above prior to including NIRE.

NIRE's regular expression syntax

NIRE supports the regular expressions defined in ECMAScript 2020 (ES11) Specification 21.2 RegExp (Regular Expression) Objects (The u flag is assumed to be always specified).

The full list is as follows:

List of Regular Expressions available in NIRE
Characters
.

Matches any character but LineTerminator (i.e., any code point but U+000A, U+000D, U+2028, and U+2029).
If the dotall option flag is passed to the pattern compiler, '.' matches every code point (i.e., equivalent to [\u{0}-\u{10ffff}]). It corresponds to //s in Perl 5.

\0

Matches NULL (\u0000).

\t

Matches Horizontal Tab (\u0009).

\n

Matches Line Feed (\u000a).

\v

Matches Vertical Tab (\u000b).

\f

Matches Form Feed (\u000c).

\r

Matches Carriage Return (\u000d).

\cX

Matches a control character corresponding to ((the code point value of X) & 0x1f) where X is one of [A-Za-z].
If \c is not followed by one of A-Z or a-z, then error_escape is thrown.

\\

Matches a backslash (\u005c) itself.

\xHH

Matches a character whose code unit value in UTF-16 is identical to the value represented by two hexadecimal digits HH.
If \x is not followed by two hexadecimal digits error_escape is thrown.

Because code unit values 0x00-0xFF in UTF-16 represent U+0000-U+00FF respectively, HH in this expression virtually represents a code point.

\uHHHH

Matches a character whose Unicode code point is identical to the value represented by four hexadecimal digits HHHH.
If \u is not followed by four hexadecimal digits error_escape is thrown.

When sequential \uHHHH escapes represent a valid surrogate pair in UTF-16, they are interpreted as a Unicode code point value. For example, /\uD842\uDF9F/ is interpreted as being equivalent to /\u{20B9F}/.

\u{H...}

Matches a character whose Unicode code point is identical to the value represented by one or more hexadecimal digits H....
If the inside of {} in \u{...} is not one or more hexadecimal digits, a value represented by the hexadecimal digits exceeds the max value of Unicode code points (0x10FFFF), or the closing curly bracket '}' does not exist, then error_escape is thrown.

\

When a \ is followed by one of ^ $ . * + ? ( ) [ ] { } | /, the sequence represents the following character itself. This is used for removing the speciality of a character that has usually a special meaning in the regular expression and making the pattern compiler interpret the character literally. (The reason why '/' is also included in the list is probably because a sequence of regular expressions is enclosed by // in ECMAScript.)
In the character class mentioned below, '-' also becomes a member of this group in addition to the fourteen characters above and can be used as "\-".

Any character but
^$.*+?()[]{}|\/

Represents that character itself.

Alternatives
A|B

Matches a sequence of regular expressions A or B. An arbitrary number of '|' can be used to separete expressions, such as /abc|def|ghi?|jkl?/.
Each sequence of regular expressions separeted by '|' is tried from left to right, and only the first sequence that succeeds in matching is adopted.
For example, when matching /abc|abcdef/ against "abcdef", the result is "abc".

Character Class
[]

A character class. A set of characters:

  • [ABC] matches 'A', 'B', or 'C'.
  • [^DEF] matches any character but 'D', 'E', 'F'. When the first charcter in [] is '^', any character being not included in [] is matched. I.e., '^' as the first character means negation.
  • [G^H] matches 'G', '^', or 'H'. '^' that is not the first character in [] is treated as an ordinary character.
  • [I-K] matches 'I', 'J', or 'K'. The sequence CH1-CH2 represents "any character in the range from the Unicode code point of CH1 to the code point of CH2 inclusive".
  • [-LM] matches '-', 'L', or 'M'. '-' that does not fall under the condition above is treated as an ordinary character.
  • [N-P-R] matches 'N', 'O', 'P', '-', or 'R'; does not match 'Q'. '-' following a range sequence represents '-' itself.
  • [S\-U] matches 'S', '-', or 'U'. '-' escaped by \ is treated as '-' itself ("\-" is available only in the character class).
  • [.|({] matches '.', '|', '(', or '{'. These characters lose their special meanings in [].
  • [] is the empty class. It does not match any code point. This expression always makes matching fail whenever it occurs.
  • [^] is the complementary set of the empty class. Thus it matches any code point. The same as [\0-\u{10FFFF}].

Although in Perl's regular expression, ']' immediately after '[' is counted as a ']' itself, there is not such a special treatment in ECMAScript's RegExp. To include ']' in a character class, it is always needed to prefix a '\' to ']' and to write "\]".

If regular expressions contain a mismatched '[' or ']', error_brack is thrown. If regular expressions contain an invalid character range such as [b-a], error_range is thrown.

Predefined Character Classes
\d

Equivalent to [0-9]. This expression can be used also in a character class, such as [\d!"#$%&'()].

\D

Equivalent to [^0-9]. This can be used in a character class, as well as \d.

\s

Equivalent to [ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff]. This can be used in a character class, too.

Note: Strictly speaking, this consists of the union of WhiteSpace and LineTerminator. Whenever some code point(s) were to be added to category Zs in Unicode, the number of code points that \s matches is increased.

\S

Equivalent to [^ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff]. This can be used in a character class, too.

\w

Equivalent to [0-9A-Za-z_]. This can be used in a character class, too.

\W

Equivalent to [^0-9A-Za-z_]. This can be used in a character class, too.

\p{...}

Matches any character that has the Unicode property specified in "...". For example, \p{scx=Latin} matches every Latin character defined in Unicode. This expression can be used in a character class, too.

For the details about what can be specified in "...", see the tables in the latest draft of the ECMAScript specificaton.

Note: As of 2020, both ECMAScript and Unicode specifications are released annually. However, because the time when the specification for a new version of ECMAScript is cut for publication is earlier than the time when a new version of the Unicode standard is released, the situation where the list of values that are available in \p and \P explained in the ECMAScript specification depends on the previous version of the Unicode standard has been continuing.
Thus, for the values being available in \p and \P, NIRE generally references the latest draft of the ECMAScript specificaton that references the latest version of the Unicode standard, instead of a formal version of the ECMAScript specification.

\P{...}

Matches any character that does not have the Unicode property specified in "...". This can be used in a character class, too.

Quantifiers
*
*?

Repeats matching the preceding expression 0 or more times. * tries to match as many as possible, whereas *? tries to match as few as possible.

If this appears without a preceding expression, error_badrepeat is thrown. This applies to the following five also.

+
+?

Repeats matching the preceding expression 1 or more time(s). + tries to match as many as possible, whereas +? tries to match as few as possible.

?
??

Repeats matching the preceding expression 0 or 1 time(s). ? tries to match as many as possible, whereas ?? tries to match as few as possible.

{n}

Repeats matching the preceding expression exactly n times.

If regular expressions contain a mismatched '{' or '}', error_brace is thrown. This applies to the following two also.

{n,}
{n,}?

Repeats matching the preceding expression at least n times. {n,} tries to match as many as possible, whereas {n,}? tries to match as few as possible.

{n,m}
{n,m}?

Repeats matching the preceding expression n time at least and m times at most. {n,m} tries to match as many as possible, whereas {n,m}? tries to match as few as possible.

If an invalid range in {} is specified like {3,2}, error_badbrace is thrown.

Brackets and backreference
(...)

Grouping of regular expressions and capturing a matched string. Every pair of capturing brackets is assigned with a number starting from 1 in the order that its left round bracket '(' appears leftwards in the entire sequence of regular expressions, and the substring matched with the regular expressions enclosed by the pair can be referenced by the number from other position in the expressions.

If regular expressions contain a mismatched '(' or ')', error_paren is thrown.

When a pair of capturing round brackets itself is bound with a quantifier or it is inside another pair of brackets having a quantifier, the captured string by the pair is cleared whenever a repetition happens. So, any captured string cannot be carried over to the next loop.

\N
(N is a
positive
integer)

Backreference. When '\' is followed by a number that begins with 1-9, it is regarded as a backreference to a string captured by (...) assigned with the corresponding number and matching is performed with that string. If a pair of brackets assigned with Number N do not exist in the entire sequence of regular expressions, error_backref is thrown.

For example, /(TO|to)..\1/ matches "TOMATO" or "tomato", but does not match "Tomato".

In RegExp of ECMAScript, capturing brackets are not required to appear prior to its corresponding backreference(s). So expressions such as /\1(abc)/ and /(abc\1)/ are valid and not treated as an error.

When a pair of brackets does not capture anything, it is treated as having captured the special undefined value. A backreference to undefined is equivalent to an empty string, matching with it always succeeds.

(?<NAME>...)

Identical to (...) except that a substring matched with the regular expressions inside a pair of brackets can be referenced by the name NAME as well as the number assigned to the pair of the brackets.

For example, in the case of /(?<year>\d+)\/(?<month>\d+)\/(?<day>\d+)/, the string captured by the first pair of parentheses can be referenced by either \1 or \k<year>.

\k<NAME>

References to a substring captured by the pair of brackets whose name is NAME. If the pair of corresponding brackets does not exist in the entire sequence of regular expressions, error_backref is thrown.

(?:...)

Grouping. Unlike (...), this does not capture anything but only do grouping. So assignment of a number for backreference is not performed.
For example, /tak(?:e|ing)/ matches "take" or "taking", but does not capture anything for backreference. Usually, this is somewhat faster than (...).

Assertions
^

Matches at the beginning of the string. When the multiline option is specified, ^ also matches every position immediately after one of LineTerminator.

$

Matches at the end of the string. When the multiline options is specified, $ also matches every position immediately before one of LineTerminator.

\b

Out of a character class: matches a boundary between \w and \W.

Inside a character class: matches BEL (\u0008).

\B

Out of a character class: matches any boundary where \b does not match.

Inside a character class: error_escape is thrown.

(?=...)

A zero-width positive lookahead assertion. For example, /a(?=bc|def)/ matches "a" followed by "bc" or "def", but only "a" is counted as a match.

(?!...)

A zero-width negative lookahead assertion. For example, /a(?!bc|def)/ matches "a" not followed by "bc" nor "def".

(?<=...)

A zero-width positive lookbehind assertion. For example, /(?<=bc|de)a/ matches "a" following "bc" or "de", but only "a" is counted as a match and "bc" or "de" is not.

(?<!...)

A zero-width negative lookbehind assertion. For example, /(?<!bc|de)a/ matches "a" not following "bc" nor "de".

Footnotes

Others

bRegExp

bRegExp is a specialisation for searching against non-Unicode sequences such as binary data. It interprets an input string as a sequence of values in the range from 0 to ((1 << CHAR_BIT) - 1) inclusive.

//  Test of bRegExp.
#include <cstdio>
#include <string>
#include "nire.hpp"

int main()
{
    //  "\xE3\x81\x82" is HIRAGANA LETTER A (U+3042)
    //  "\xC3\xA3" is 'ã' (U+00E3)
    const std::string text("\xe3\x81\x82""\xc3\xa3");
    nire::RegExp re("\\xe3");   //  Searches for a character whose code point is U+00E3.
    nire::bRegExp bre("\\xe3"); //  Searches for a byte whose value is 0xE3.
    nire::Match mr, mbr;

    mr = re.exec(text);
    mbr = bre.exec(text);
    std::printf("RegExp:%u+%u bRegExp:%u+%u\n", mr.index(), mr[0].length(), mbr.index(), mbr[0].length());

    return 0;
}
---- output ----
RegExp:3+2 bRegExp:0+1
			

re of the RegExp type searched for a UTF-8 sequence that represents U+00E3, whereas bre of the bRegExp type searched for an octet whose value is 0xE3.

Measures against long time thinking

The backtracking algorithm used by the regular expression engine of ECMAScript (and also Perl on which it is based) can require exponential time to search when given regular expressions include nested quantifiers or consecutive expressions having a quantifier each, and the character sets of which are not mutually exclusive. The following are well-known examples:

  • "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /(a*)*b/
  • "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/

Unfortunately, no fundamental measures against this problem which can be applied to in any situation have found yet. So, to avoid holding control for long time, NIRE throws re_error(re_error::type::complexity) when matching at one particular position fails repeatedly more than certain times.

The default value of the "certain times" is 16777216 (256 to the third power). But this value can be changed by setting an arbitrary value to the backtrackingLimit member variable of an instance of the regexp type.

External Links

RegExp of ECMAScript