NIRE is a regular expression template library for C++.
The development of NIRE ended. No more update for new features or peformance improvement is planned, but only bug fixes might be made.
NIRE is a fork of SRELL. The following features have been inherited from SRELL:
'.'
does not match a half of a surrogate pair in UTF-16 strings or does not match a code unit in UTF-8 strings.[丈𠀋]
, and the range also can be specified in a character class such as [\u{1b000}-\u{1b0ff}]
.
Features that are not inherited from SRELL are: 1) <regex> compatible interfaces, 2) arbitrary iterator support, etc.
NIRE's interfaces generally follow the specification of ECMAScript (JavaScript).
As far as I know, the oldest compiler that can compile NIRE is Visual C++ 2005 in Visual Studio 2005.
Include nire.hpp.
// Example 01: #include <cstdio> #include <string> #include <iostream> #include "nire.hpp" int main() { nire::RegExp re; // Regular expression object nire::Match m; // Object to receive matching results. re = "\\d+[^-\\d]+"; // Compiles regular expressions. m = re.exec("1234-5678-90ab-cdef"); // Executes a search. if (m) { std::printf("result: %s\n", m[0].str().c_str()); // If use printf. std::cout << "result: " << m[0] << std::endl; // If use iostream. } return 0; }
See also readme_en.txt included in the zip archive.
NIRE consists of the following four classes, which are all defined in namespace nire
:
regexp<charT>
class: regular expression object. Searching and replacing are performed by the member functions of this class.re_match<charT>
class: receives results of matching.re_group<charT>
struct: holds a range of the entire matched substring or a substring captured by a pair of round brackets.re_error
class: thrown when an error occurs.
The first three template classes are instantiated by replacing the template parameter charT
with any character type to be used for a search.
There are typedef
s for them as follows. It is recommended to use these typedef
s usually:
Type | regexp | re_match | re_group |
Interpretation of string | Note |
---|---|---|---|---|---|
char |
u8cRegExp | Match | Group |
UTF-8 | |
wchar_t |
u16wRegExp | wMatch | wGroup |
UTF-16 | Defined only when 0xFFFF <= WCHAR_MAX < 0x10FFFF . |
u32wRegExp |
UTF-32 | Defined only when WCHAR_MAX >= 0x10FFFF . |
|||
char8_t |
u8RegExp | u8Match | u8Group |
UTF-8 | |
char16_t |
u16RegExp | u16Match | u16Group |
UTF-16 | |
char32_t |
u32RegExp | u32Match | u32Group |
UTF-32 | |
char |
bRegExp | Match | Group |
Sequence of values in the range of 0 to UCHAR_MAX inclusive. |
bRegExp
is a specialisation for searching against non-Unicode sequences such as binary data. See the later section for details.
Creates a internal object from regular expressions, holds it, and executes a search with it. This class is the core of NIRE, and basically follows the specification of RegExp
of ECMAScript (JavaScript).
The following members are defined:
There are three ways to pass a string of regular expressions as an argument to a constructor: 1) pass a pointer to a null terminated string, 2) pass a pointer and a length, or 3) pass a const
reference to an instance of std::basic_string
.
Furthermore, there are two ways to pass flags to each of them: a) pass a value of the nire::flag::type
type, or b) pass a pointer to a null terminated string. Thus, 3 x 2 = 6, there are six way to pass regular expressions and flags to the constructor.
regexp(); regexp(const regexp &r); // Copies. regexp(regexp &&r) noexcept; // C++11 and later. // Pointer regexp(const charT *const p, const flag::type f = flag::ECMAScript); regexp(const charT *const p, const charT *const flags); // Pointer and length regexp(const charT *const p, const std::size_t len, const flag::type f); regexp(const charT *const p, const std::size_t len, const charT *const flags); // basic_string regexp(const std::basic_string<charT> &str, const flag::type f = flag::ECMAScript); regexp(const std::basic_string<charT> &str, const charT *const flags);
The following flags are available:
Value of nire::flag::type type |
String | |
---|---|---|
nire::flag::global |
"g" |
Global match. If specified, the next position of the matched substring is stored into the lastIndex member, and the next search will start at the index indicated by lastIndex in the subject. |
nire::flag::icase |
"i" |
Case-insensitive matching. |
nire::flag::multiline |
"m" |
If specified, ^ and $ match the beginning of a line and the end of a line, respectively. |
nire::flag::dotall |
"s" |
If specified, . matches every character including newline characters. |
nire::flag::sticky |
"y" |
If specified, search is executed only at the index indicated by the lastIndex member. |
nire::flag::noflags |
"" |
For indicating no flags explicitly. |
When passing a value of nire::flag::type
, you can specify multiple flags at the same time by separating them by |
(or).
When passing a string, you can specify multiple flags at the same time by writing like "im"
.
Copies, moves (C++11 and later), or compiles regular expressions.
For compiling, there are six ways to pass a string of regular expressions and flags as arguments, like constructors.
regexp &assign(const regexp &r); regexp &assign(regexp &&r) noexcept; // C++11 and later. // Pointer regexp &assign(const charT *const p, const flag::type f = flag::ECMAScript); regexp &assign(const charT *const p, const charT *const flags); // Pointer and length regexp &assign(const charT *const p, std::size_t len, const flag::type f); regexp &assign(const charT *const p, std::size_t len, const charT *const flags); // basic_string regexp &assign(const std::basic_string<charT> &str, const flag::type f = flag::ECMAScript); regexp &assign(const std::basic_string<charT> &str, const charT *const flags);
Copies, moves (C++11 and later), or compiles regular expressions.
Unlike constructors and assign()
, there are only two ways to pass an argument: 1) pass a pointer to a null terminated string, or 2) pass a const
reference to std::basic_string
.
regexp &operator=(const regexp &r); regexp &operator=(regexp &&r) noexcept; // C++11 and later. // Pointer regexp &operator=(const charT *const p) // basic_string regexp &operator=(const std::basic_string<charT> &str)
The index value in a subject at which the next search is to start. This value is used only if the instance has been created with the global
or sticky
flag set. If either flag is not set, the value of lastIndex
is ignored and a search is always executed at the beginning.
If the match succeeds, the next position of the matched substring (i.e., end of [begin, end)) is written into lastIndex
. Otherwise lastIndex
is reset to 0
Even if the subject string is changed, lastIndex
is not reset automatically.
As this member is mutable
, it can be written into even if it is in a const
instance of regexp
.
Executes a search for a match with regular expressions in a subject string.
There are three ways to pass the subject: 1) pass a pointer to a null terminated string, 2) pass a pointer and a length, or 3) pass a const
reference to std::basic_string
.
To receive results, there are two ways: a) pass as an argument the reference to an instance of re_match
and let exec()
write matching results into it, or b) let exec()
write matching results into a a new instance of re_match
and return it.
In the former case, exec()
returns a boolean value indicating whether any match has been found or not.
// Pointer bool exec( re_match<charT> &m, const charT *const str ) const; re_match<charT> exec( const charT *const str ) const; // Pointer and length bool exec( re_match<charT> &m, const charT *const str, const std::size_t len ) const; re_match<charT> exec( const charT *const str, const std::size_t len ) const; // basic_string bool exec( re_match<charT> &m, const std::basic_string<charT> &str ) const; re_match<charT> exec( const std::basic_string<charT> &str ) const;
When the global
or sticky
flag is set, the search starts at the index indicated by lastIndex
(i.e., the lastIndex
-th element of a subject string). If a match is found, the next position of the matched substring (end of [begin,end)) is written into lastIndex
. Otherwise, lastIndex
is reset to 0
.
When exec()
is called in a loop, it is recommended to use any overload in which a reference to re_match
is passed as an argument.
Behaves as exec()
, except that matching results are not written into an instance of re_match
. If a match is found, returns true
. Otherwise returns false
.
// Pointer bool test( const charT *const str ) const; // Pointer and length bool test( const charT *const str, const std::size_t len ) const; // basic_string bool test( const std::basic_string<charT> &str ) const;
Behaves like String.prototype.replace(regexp-object, newSubStr|callback-function)
of JavaScript.
Replaces a substring in a subject that matches regular expressions with another string. The resulting string is returned and the original string is left unchanged.
There are two ways to specify the string that replaces a matched substring: 1) pass a replacement string, or 2) pass a callback function.
When the global
flag is set, all matched substrings in the subject are to be replaced. Otherwise, only the first match is to be replaced.
// Replacement with a string // Pointer std::basic_string<charT> replace( const charT *const str, const charT *const fmt ) const; // Pointer and length std::basic_string<charT> replace( const charT *const str, const std::size_t len, const charT *const fmt_text, const std::size_t fmt_len ) const; // basic_string std::basic_string<charT> replace( const std::basic_string<charT> &str, const std::basic_string<charT> &fmt ) const; // Replacement by callback function typedef std::basic_string<charT> (*replace_function)(const re_match<charT> &); // Pointer std::basic_string<charT> replace( const charT *const str, const replace_function repfunc ) const; // Pointer and length std::basic_string<charT> replace( const charT *str, const std::size_t len, const replace_function repfunc ) const; // basic_string std::basic_string<charT> replace( const std::basic_string<charT> &str, const replace_function repfunc ) const;
The replacement string can contain the special symbols defined in the table in the ECMAScript specification, Runtime Semantics: GetSubstitution. Any character that does not exist in the table is copied as is to a new string to be returned.
Symbol | Inserted string as replacement |
---|---|
$$ |
$ itself. |
$& |
The entire matched substring. |
$` |
The substring that precedes the matched substring. |
$' |
The substring that follows the matched substring. |
$n where n is one of 1 2 3 4 5 6 7 8 9 not followed by a digit. |
The substring captured by the pair of n th round bracket (1-based index) in the regular expressions. Replaced with an empty string if nothing is captured. Not replaced if the number n is greater than the number of capturing brackets in the regular expressions. |
$nn where nn is any value in the range 01 to 99 inclusive |
The substring captured by the pair of nn th round bracket (1-based index) in the regular expressions. Replaced with an empty string if nothing is captured. Not replaced if the number nn is greater than the number of capturing brackets in the regular expressions. |
$<NAME> |
|
// Sample of replacement with a string. #include <cstdio> #include <string> #include "nire.hpp" int main() { const nire::RegExp re("(\\d)(\\d)", "g"); // Searches for two consecutive digits. const std::string rep = re.replace("abc0123456789def", "($2$1)"); // Exchanges the order of $1 and $2 and encloses them with a pair of brackets. std::printf("Result: %s\n", rep.c_str()); return 0; } ---- output ---- Result: abc(10)(32)(54)(76)(98)def
When a match is found, the passed callback function is called with the argument that is a const
reference to an internal instance of re_match<charT>
into which matching results have been written.
The callback function has to return a new string of the std::basic_string<charT>
type. Special symbols mentioned above do not apply in this case.
// Sample of replacement by callback function. #include <cstdio> #include <string> #include "nire.hpp" std::string repfunc1(const nire::Match &m) { return " [" + m[1].str() + ", " + m[2].str() + "] "; } int main() { const nire::RegExp re("(\\d)(\\d)", "g"); const std::string text("abc0123456789def"); std::string rep; rep = re.replace(text, repfunc1); std::printf("Result1(C++98/C++03): %s\n", rep.c_str()); // Replacement by lambda of C++11 and later. rep = re.replace(text, [](const nire::Match &m) -> std::string { return " [" + m[1].str() + ", " + m[2].str() + "] "; }); std::printf("Result2(C++11): %s\n", rep.c_str()); // C++14 and later. rep = re.replace(text, [](const auto &m) { return " [" + m[1].str() + ", " + m[2].str() + "] "; }); std::printf("Result3(C++14): %s\n", rep.c_str()); return 0; } ---- output ---- Result1(C++98/C++03): abc [0, 1] [2, 3] [4, 5] [6, 7] [8, 9] def Result2(C++11): abc [0, 1] [2, 3] [4, 5] [6, 7] [8, 9] def Result3(C++14): abc [0, 1] [2, 3] [4, 5] [6, 7] [8, 9] def
Behaves like String.prototype.split(regexp-object, limit)
of JavaScript.
Splits a string by the occurences that match regular expressions into new instances of std::basic_string
and return a container that holds them. If regular expressions contain capturing round brackets, the substrings captured by them are also pushed into the resulting array. When a pair of round brackets does not capture anything, an empty string is pushed instead of skipping pushing.
Any container type can be used to receive results if it has at least clear()
, size()
, push_back()
member functions. The returned value is the number of times push_back()
was called.
// Pointer template <typename ContainerT> std::size_t split( ContainerT &out, const charT *const str, const std::size_t limit = static_cast<std::size_t>(-1), const bool push_remainder = false ) const; // Pointer and length (length comes before pointer for overload resolution) template <typename ContainerT> std::size_t split( ContainerT &out, const std::size_t len, const charT *const str, const std::size_t limit = static_cast<std::size_t>(-1), const bool push_remainder = false ) const; // basic_string template <typename ContainerT> std::size_t split( ContainerT &out, const std::basic_string<charT> &str, const std::size_t limit = static_cast<std::size_t>(-1), const bool push_remainder = false ) const;
limit
specifies the maximum number of times splitting is performed. For example, given that the specified pattern is /,/
and limit
is 5
, splitting is finished when the fifth ','
is found in the subject string.
Although the feature of specifying a maximum number like this limit
is not rare in script languages, in the case of JavaScript, oddly, when the number of times splitting is executed reaches limit
, split()
throws away the remainder portion that has not scanned yet and does not push it to the container to be returned.
As I am personally unhappy with this behaviour, the push_remainder
flag has been added to NIRE's split()
as an extension.
When push_remainder
is true
, the returned container has limit + 1
elements at most. As a result, this behaviour is similar to Python's split()
.
// Behaviours of limit and push_remainder. #include <cstdio> #include <string> #include <vector> #include "nire.hpp" int main() { const nire::RegExp re(":"); const std::string text("0:12:345:6789:abcdef"); std::vector<std::string> res; // Splitting unlimitedly. re.split(res, text); for (std::size_t i = 0; i < res.size(); ++i) std::printf("%s%s", i == 0 ? "[" : ", ", res[i].c_str()); std::puts("]"); // limit=2. Remainder is thrown away. re.split(res, text, 2); for (std::size_t i = 0; i < res.size(); ++i) std::printf("%s%s", i == 0 ? "[" : ", ", res[i].c_str()); std::puts("]"); // limit=2 and push_remainder=true. Remainder is all pushed as the final element. re.split(res, text, 2, true); for (std::size_t i = 0; i < res.size(); ++i) std::printf("%s%s", i == 0 ? "[" : ", ", res[i].c_str()); std::puts("]"); return 0; } ---- output ---- [0, 12, 345, 6789, abcdef] [0, 12] [0, 12, 345:6789:abcdef]
Swaps the contents of two instances.
re1.swap(re2)
and nire::swap(re1, re2)
are defined.
The following member functions return a boolean value indicating whether a specific flag was set when regular expressions were passed to compile. They have been named after the property names in JavaScript.
Name | Return Value |
---|---|
std::basic_string<charT> flags(); |
Stringified flags that were passed to the pattern compiler. |
bool global(); |
Whether nire::flag::global or "g" was set. |
bool ignoreCase(); |
Whether nire::flag::icase or "i" was set. |
bool multiline(); |
Whether nire::flag::multiline or "m" was set. |
bool dotAll(); |
Whether nire::flag::dotall or "s" was set. |
bool unicode(); |
Returns true always. |
bool sticky(); |
Whether nire::flag::sticky or "y" was set. |
Receives matching results. This class has been designed based on the return value of RegExp.exec() of JavaScript.
An instance of re_match
can be converted to a boolean value.
When the instance has successful matching results, returns one plus the number of pairs of capturing brackets in the regular expressions. Otherwise (including the case that the instance has not received any matching results after it was created) returns 0
.
The reason of "plus one" is because the entire match is treated as captured by the implicit 0th pair of brackets.
For getting the number of elements, .size()
is a common member function name in C++ classes, whereas using the .length
property is common in JavaScript. In NIRE both size()
and length()
are provided so that either one can be used.
Given that an instance of re_match
is m
, m[0]
returns a pair of indices that represent [begin, end) of the entire match in a subject. m[n]
where n >= 1
returns a pair of indices [begin, end) of a substring captured by the n
-th (1-based index) pair of round brackets.
An argument n
must be any value in the range from 0
to length() - 1
inclusive.
Overloads that take as a parameter a pointer or string return a pair of indices that represent [begin, end] of the substring captured by the group named as gname
.
As the type of the return value, re_group<charT>
can be cast to std::basic_string<charT>
, it is possible to receive the matched substring directly instead of indices information. The details of this type is explained in a later section.
Returns the 0-based index value of the first element of the entire match (when the argument is omitted or n
is 0
) or of a substring captured by the pair of the corresponding round brackets (otherwise).
Returns the 0-based index value of the next element of the entire match (when the argument is omitted or n
is 0
) or of a substring captured by the pair of the corresponding round brackets (otherwise).
Swaps the contents of two instances.
re1.swap(re2)
and nire::swap(re1, re2)
are defined.
This struct
holds a pair of 0-based indices indicating the range [begin, end) of an entire match or substring set by any member function of NIRE. This type is is_trivially_copyable
. You can safely do memcpy
.
As this class holds indices only, to get actually the contents of a matched or captured substring, the original subject string has to still exist at that time.
This class has the following members:
An instance of re_group
can be converted to a boolean value indicating whether the instance has valid indices.
Creates a new instance of std::basic_string<charT>
that contains a matched or captured substring based on the internal indices and returns it.
Returns the length of the range held by the instance. If the instance does not have any valid range, 0
is returned.
Like re_match
, re_group
has both size()
and length()
.
Returns a boolean value indicating whether the instance has valid indices.
For comparison between two instances of re_group<charT>
or with a instance of std::basic_string<charT>
, there are ==
, !=
, <
, <=
, >
, >=
.
When regular expressions are compiled or a search is executed, if any error occurs, nire::re_error
is thrown.
nire::re_error
is a derived class from std::runtime_error
and has the following members:
// nire::re_error's members. const char *what() const; // Derived from std::runtime_error. re_error::type code() const; // Returns an error code.
The list of error codes is as follows:
Value | Condition | Note |
---|---|---|
escape |
Invalid escaped character. \ is followed by an inappropriate character or exists at the end of a sequence. |
Errors thrown by the pattern compiler. |
backref |
Invalid backreference. No capturing round brackets corresponding to the backreference. | |
sqbrack |
Mismatched [ and ] . |
|
paren |
Mismatched ( and ) . |
|
brace |
Mismatched { and } . |
|
badbrace |
Invalid character contained in {} . |
|
range |
Invalid range in a character class, such as [b-a] . The character at the left side of - has a code point value greater than the one at the right side. |
|
badrepeat |
Invalid character preceding a quantifier * ? + {n,m} . |
|
utf8 |
Invalid UTF-8 sequence. | |
complexity |
Complicated matching. | Error thrown when matching is performed. |
Additionally, std::bad_alloc
is thrown when memory allocation fails.
Features introduced in C++11 and later that may be used by NIRE are as follows:
char16_t
and char32_t
typeschar8_t
typeAs of May 2020, NIRE determines whether these features are avaialable or not by the following marcos:
#ifdef __cpp_unicode_characters #ifndef NIRE_CPP11_CHAR1632_ENABLED #define NIRE_CPP11_CHAR1632_ENABLED // Do typedefs for char16_t, char32_t. #endif #endif #ifdef __cpp_rvalue_references #ifndef NIRE_CPP11_MOVE_ENABLED #define NIRE_CPP11_MOVE_ENABLED // Uses move semantics in constructors and assignments. #endif #endif #ifdef __cpp_char8_t #ifndef NIRE_CPP20_CHAR8_ENABLED #define NIRE_CPP20_CHAR8_ENABLED // Do typedef for char8_t. #endif #endif
If your compiler does not set __cpp_*
macros despite of the fact that the corresponding features are actually available, you can turn on the feature(s) you need by setting NIRE_CPP_*
macro(s) above prior to including NIRE.
NIRE supports the regular expressions defined in ECMAScript 2020 (ES11) Specification 21.2 RegExp (Regular Expression) Objects (The u
flag is assumed to be always specified).
The full list is as follows:
Characters | |
---|---|
. |
Matches any character but LineTerminator (i.e., any code point but U+000A, U+000D, U+2028, and U+2029). |
\0 |
Matches NULL ( |
\t |
Matches Horizontal Tab ( |
\n |
Matches Line Feed ( |
\v |
Matches Vertical Tab ( |
\f |
Matches Form Feed ( |
\r |
Matches Carriage Return ( |
\cX |
Matches a control character corresponding to ( |
\\ |
Matches a backslash ( |
\xHH |
Matches a character whose code unit value in UTF-16 is identical to the value represented by two hexadecimal digits
Because code unit values |
\uHHHH |
Matches a character whose Unicode code point is identical to the value represented by four hexadecimal digits
When sequential |
\u{H...} |
Matches a character whose Unicode code point is identical to the value represented by one or more hexadecimal digits |
\ |
When a |
Any character but ^$.*+?()[]{}|\/ |
Represents that character itself. |
Alternatives | |
A|B |
Matches a sequence of regular expressions A or B. An arbitrary number of |
Character Class | |
[] |
A character class. A set of characters:
Although in Perl's regular expression,
If regular expressions contain a mismatched |
Predefined Character Classes | |
\d |
Equivalent to |
\D |
Equivalent to |
\s |
Equivalent to
Note: Strictly speaking, this consists of the union of WhiteSpace and LineTerminator. Whenever some code point(s) were to be added to category Zs in Unicode, the number of code points that |
\S |
Equivalent to |
\w |
Equivalent to |
\W |
Equivalent to |
\p{...} |
Matches any character that has the Unicode property specified in "
For the details about what can be specified in "
Note: As of 2020, both ECMAScript and Unicode specifications are released annually. However, because the time when the specification for a new version of ECMAScript is cut for publication is earlier than the time when a new version of the Unicode standard is released, the situation where the list of values that are available in |
\P{...} |
Matches any character that does not have the Unicode property specified in " |
Quantifiers | |
* *? |
Repeats matching the preceding expression 0 or more times.
If this appears without a preceding expression, |
+ +? |
Repeats matching the preceding expression 1 or more time(s). |
? ?? |
Repeats matching the preceding expression 0 or 1 time(s). |
{n} |
Repeats matching the preceding expression exactly
If regular expressions contain a mismatched |
{n,} {n,}? |
Repeats matching the preceding expression at least |
{n,m} {n,m}? |
Repeats matching the preceding expression
If an invalid range in |
Brackets and backreference | |
(...) |
Grouping of regular expressions and capturing a matched string. Every pair of capturing brackets is assigned with a number starting from 1 in the order that its left round bracket
If regular expressions contain a mismatched When a pair of capturing round brackets itself is bound with a quantifier or it is inside another pair of brackets having a quantifier, the captured string by the pair is cleared whenever a repetition happens. So, any captured string cannot be carried over to the next loop. |
\N (N is a positive integer) |
Backreference. When
For example,
In RegExp of ECMAScript, capturing brackets are not required to appear prior to its corresponding backreference(s). So expressions such as
When a pair of brackets does not capture anything, it is treated as having captured the special |
(?<NAME>...) |
Identical to
For example, in the case of |
\k<NAME> |
References to a substring captured by the pair of brackets whose name is |
(?:...) |
Grouping. Unlike |
Assertions | |
^ |
Matches at the beginning of the string.
When the |
$ |
Matches at the end of the string.
When the |
\b |
Out of a character class: matches a boundary between
Inside a character class: matches BEL ( |
\B |
Out of a character class: matches any boundary where
Inside a character class: |
(?=...) |
A zero-width positive lookahead assertion. For example, |
(?!...) |
A zero-width negative lookahead assertion. For example, |
(?<=...) |
A zero-width positive lookbehind assertion. For example, |
(?<!...) |
A zero-width negative lookbehind assertion. For example, |
'\'
, or a backslash that is followed by a character which is not explained in the table above appears, error_escape
is thrown./\1\u0030/
, or 2) write the digit character inside a character class, which consists of that character only, such as /\1[0]/
. NIRE's pattern compiler translates both types of the expressions into the same internal representation.\ooo
and \0ooo
. See also the ECMAScript specification./(?=\p{sc=Latin})\p{Ll}/
means any lower case letter of the Latin script). Similarly, the negative lookahead assertion can be used to do the subtraction (for example, /(?!\p{sc=Latin})\p{Ll}/
means any lower case letter that is not of the Latin script).
bRegExp
is a specialisation for searching against non-Unicode sequences such as binary data. It interprets an input string as a sequence of values in the range from 0
to ((1 << CHAR_BIT) - 1)
inclusive.
// Test of bRegExp. #include <cstdio> #include <string> #include "nire.hpp" int main() { // "\xE3\x81\x82" is HIRAGANA LETTER A (U+3042) // "\xC3\xA3" is 'ã' (U+00E3) const std::string text("\xe3\x81\x82""\xc3\xa3"); nire::RegExp re("\\xe3"); // Searches for a character whose code point is U+00E3. nire::bRegExp bre("\\xe3"); // Searches for a byte whose value is 0xE3. nire::Match mr, mbr; mr = re.exec(text); mbr = bre.exec(text); std::printf("RegExp:%u+%u bRegExp:%u+%u\n", mr.index(), mr[0].length(), mbr.index(), mbr[0].length()); return 0; } ---- output ---- RegExp:3+2 bRegExp:0+1
re
of the RegExp
type searched for a UTF-8 sequence that represents U+00E3, whereas bre
of the bRegExp
type searched for an octet whose value is 0xE3
.
The backtracking algorithm used by the regular expression engine of ECMAScript (and also Perl on which it is based) can require exponential time to search when given regular expressions include nested quantifiers or consecutive expressions having a quantifier each, and the character sets of which are not mutually exclusive. The following are well-known examples:
Unfortunately, no fundamental measures against this problem which can be applied to in any situation have found yet. So, to avoid holding control for long time, NIRE throws re_error(re_error::type::complexity)
when matching at one particular position fails repeatedly more than certain times.
The default value of the "certain times" is 16777216 (256 to the third power). But this value can be changed by setting an arbitrary value to the backtrackingLimit
member variable of an instance of the regexp
type.