SRELL

SRELL (std::regex-like library) is a regular expression template library for C++.

Contents

Features

The same class design as std::regex

SRELL is a template library that has an ECMAScript (JavaScript) compatible regular expression engine wrapped into the same class design as std::regex (the standard regular expression template library since C++11).

Unicode-specific implementation

SRELL has native support for Unicode:

  • UTF-8, UTF-16, and UTF-32 strings can be handled without any additional configurations.
  • '.' does not match a half of a surrogate pair in UTF-16 strings or does not match a code unit in UTF-8 strings.
  • Supplementary Characters can be specified in a character class such as [丈𠀋], and the range also can be specified in a character class such as [\u{1b000}-\u{1b0ff}].
  • When the case-insensitive match is done, even characters having two lowercase letters for one uppercase letter such as Greek Σ (u+03c2 [ς] and u+03c3 [σ]) or having the third case called "titlecase" besides the uppercase and the lowercase such as DŽ (uppercase; DŽ, lowercase; dž and titlecase; Dž) are processed appropriately.

Because regex was proposed early for C++0x (now C++11), it is little dependent on C++11's new features. So SRELL should be available with even pre-C++11 compilers as far as they interpret C++ templates accurately. (The oldest compiler on where I confirmed SRELL could be used is Visual C++ 2005 in Visual Studio 2005).

Download

How to use

Put srell*.hpp (three files of srell.hpp, srell_ucfdata.hpp, and srell_updata.hpp) in a directory in your path and include srell.hpp.

//  Example 01:
#include <cstdio>
#include <string>
#include <iostream>
#include "srell.hpp"

int main()
{
    srell::regex e;     //  Regular expression object holder.
    srell::cmatch m;    //  Object which receives results.

    e = "\\d+[^-\\d]+"; //  Compile a regular expression string.
    if (srell::regex_search("1234-5678-90ab-cdef", m, e))
    {
        //  If use printf.
        const std::string s(m[0].first, m[0].second);
            //  The code above can be replaced with one of the following lines.
            //  const std::string s(m[0].str());
            //  const std::string s(m.str(0));
        std::printf("result: %s\n", s.c_str());

        //  If use iostream.
        std::cout << "result: " << m[0] << std::endl;
    }
    return 0;
}
	

As in this example, all classes and algorithms that belong to SRELL have been put within namespace "srell". Except for this point, the usage is basically identical to std::regex.

Please see also readme_en.txt included in the zip archive.

C++11 features

C++11's new features that std::regex may use are as follows:

As of April 2019, SRELL determines whether these features are available or not by using the following macros:

#if defined(__cpp_unicode_characters) && !defined(SRELL_CPP11_CHAR1632_ENABLED)
  #define SRELL_CPP11_CHAR1632_ENABLED  //  to do typedef for char16_t and char32_t.
#endif
#ifdef __cpp_initializer_lists
  #include <initializer_list>
  #ifndef SRELL_CPP11_INITIALIZER_LIST_ENABLED
  #define SRELL_CPP11_INITIALIZER_LIST_ENABLED  //  to make it possible to pass initializer_list as an argument.
  #endif
#endif
#ifdef __cpp_rvalue_references
  #ifndef SRELL_CPP11_MOVE_ENABLED
  #define SRELL_CPP11_MOVE_ENABLED  //  to enable move semantics in constructos and operator=().
  #endif
#endif
		

If your compiler does not set __cpp_* macros despite of the fact that the corresponding features are actually available, you can turn on the feature(s) you need by setting the SRELL_CPP11_* macro(s) above before including SRELL.

C++20 features

A new type devoted to UTF-8, char8_t has been added to the working draft for C++20. It is already available in GCC 9 and later and Clang 7 and later when the -fchar8_t option is specified. char8_t is available also in Visual Studio 2019 version 16.1 and later when the /Zc:char8_t option is specified. (std::u8string seems to be defined in version 16.2 and later when /std:c++latest is specified).

SRELL has support for char8_t since version 2.100.
As of September 2019, SRELL determines whether char8_t and std::u8string are available or not by using the following macros:

#ifdef __cpp_char8_t
  #ifdef __cpp_lib_char8_t
  #define SRELL_CPP20_CHAR8_ENABLED 2   //  both char8_t support and std::u8string support.
  #else
  #define SRELL_CPP20_CHAR8_ENABLED 1   //  only char8_t support.
  #endif
#endif
		

If your compiler does not set __cpp_* macros despite of the fact that the features are actually available, you can turn on the features by setting the SRELL_CPP20_CHAR8_ENABLED macro above before including SRELL.

Additional information

Although I develop and check SRELL mainly on VC++ and MinGW, according to some tests on Compiler Explorer, as of May 2019, at least the following compilers also can compile a sample code that does regex search with srell::u16regex in SRELL 2.200 against UTF-16 strings and generate assembly outputs:

A sample code that uses char or wchar_t can be compiled by x86-64 gcc 4.1.2, which seems to be the oldest one among compilers being available in Compiler Explorer.

SRELL's regular expression syntax

SRELL 2 supports the regular expressions defined in ECMAScript 2019 (ES10) Specification 21.2 RegExp (Regular Expression) Objects.

The full list is as follows:

List of Regular Expressions available in SRELL
Characters
.

Matches any character but LineTerminator (i.e., any code point but U+000A, U+000D, U+2028, and U+2029. For LineTerminator see ECMAScript 2019 Specification 11.3).
If the dotall option flag is passed to the pattern compiler, '.' matches every code point. (i.e., equivalent to [\u{0}-\u{10ffff}].) It corresponds to //s in Perl 5.

Note: The dotall option flag is available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0.

\0

Matches NULL (\u0000).

\t

Matches Horizontal Tab (\u0009).

\n

Matches Line Feed (\u000a).

\v

Matches Vertical Tab (\u000b).

\f

Matches Form Feed (\u000c).

\r

Matches Carriage Return (\u000d).

\cX

Matches a control character corresponding to ((the code point value of X) & 0x1f) where X is one of [A-Za-z].
If "\c" is not followed by one of A-Z or a-z, then error_escape is thrown.

\\

Matches a backslash (\u005c) itself.

\xhh

Matches a Unicode code point represented by two hexadecimal digits hh.
If "\x" is not followed by two hexadecimal digits error_escape is thrown.

\uhhhh

Matches a Unicode code point represented by four hexadecimal digits hhhh.
If "\u" is not followed by four hexadecimal digits error_escape is thrown.

\u{h...}

Matches a Unicode code point represented by one or more hexadecimal digits h....
If the inside of {} in "\u{...}" is not one or more hexadecimal digits, a value represented by the hexadecimal digits exceeds the max value of Unicode code points, or the closing curly bracket '}' does not exist, then error_escape is thrown.

Note: This expression has been available since ECMAScript 6.0. In SRELL up to version 2.001, what could be specified as h... in \u{h...} was "one to six hexadecimal digits". This is because this feature was implemented based on the proposal document, and the change that was made to the text when the proposal was approved formally was overlooked.

\

When a \ is followed by one of ^ $ . * + ? ( ) [ ] { } | /, the sequence represents the following character itself. This is used for removing the speciality of a character that has usually a special meaning in the regular expression and making the pattern compiler interpret the character literally. (The reason why '/' is also included in the list is probably because a sequence of regular expressions is enclosed by // in ECMAScript.)
In the character class mentioned below, in addition to the fourteen characters above, '-' also become a member of the group and can be used as "\-".

Alternatives
A|B

Matches a sequence of regular expressions A or B. An arbitrary number of '|' can be used to separete expressions, such as /abc|def|ghi?|jkl?/.
Each sequence of regular expressions separeted by '|' is tried from left to right, only the sequence that first succeeds in matching is adopted.
For example, when matching /abc|abcdef/ against "abcdef", the result is "abc".

Character Class
[]

A character class. A set of characters:

  • [ABC] matches 'A', 'B', or 'C'.
  • [^DEF] matches any character but 'D', 'E', 'F'. When the first charcter in [] is '^', any character being not included in [] is matched. I.e., '^' means negation.
  • [G^H] matches 'G', '^', or 'H'. '^' not being the first character in [] is treated as an ordinary character.
  • [I-K] matches 'I', 'J', or 'K'. The sequence CH1-CH2 represents "any character in the range from the Unicode code point of CH1 to the code point of CH2 inclusive".
  • [-LM] matches '-', 'L', or 'M'. '-' that does not fall under the condition above is treated as an ordinary character.
  • [N-P-R] matches 'N', 'O', 'P', '-', or 'R'. '-' following a range sequence represents '-' itself.
  • [.|({] matches '.', '|', '(', or '{'. These characters lose their special meanings in [].
  • [] is the empty class. It does not match any code point. This expression always makes matching fail whenever it occurs.
  • [^] is the complementary set of the empty class. Thus it matches any code point. The same as [\0-\u{10ffff}].

Although in Perl's regular expression, ']' immediately after '[' is counted as a ']' itself, there is not such a special treatment in ECMAScript's RegExp. To include ']' in a character class, it is always needed to prefix a '\' to ']' and to write "\]".

If regular expressions contain a mismatched '[' or ']', error_brack is thrown. If regular expressions contain an invalid character range such as [b-a], error_range is thrown.

Predefined Character Classes
\d

Equivalent to [0-9]. This expression can be used also in a character class, such as [\d!"#$%&'()].

\D

Equivalent to [^0-9]. This can be used in a character class, as well as \d.

\s

Equivalent to [ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff]. This can be used in a character class, too.

Note: Strictly speaking, this consists of the union of WhiteSpace and LineTerminator. Whenever some code point(s) were to be added to category Zs in Unicode, the number of code points that \s matches is increased.

\S

Equivalent to [^ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff]. This can be used in a character class, too.

\w

Equivalent to [0-9A-Za-z_]. This can be used in a character class, too.

\W

Equivalent to [^0-9A-Za-z_]. This can be used in a character class, too.

\p{...}

Matches any character that has the Unicode property specified in "...". For example, \p{scx=Latin} matches every Latin character defined in Unicode. This expression can be used in a character class, too.

For the details about what can be specified in "...", see the tables in ECMAScript Specificaton's draft (SRELL 2.100 and later supports in advance also four new script names, introduced in Unicode 12, which are anticipated to be available in ECMAScript 2020/ES11.)

Note: This expression is available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0.

\P{...}

Matches any character that does not have the Unicode property specified in "...". This can be used in a character class, too.

Note: Available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0.

Quantifiers
*
*?

Repeats matching the preceding expression 0 or more times. "*" tries to match as many as possible, whereas "*?" tries to match as few as possible.

If this appears without a preceding expression, error_badrepeat is thrown. This applies to the following five also.

+
+?

Repeats matching the preceding expression 1 or more time(s). "+" tries to match as many as possible, whereas "+?" tries to match as few as possible.

?
??

Repeats matching the preceding expression 0 or 1 time(s). "?" tries to match as many as possible, whereas "??" tries to match as few as possible.

{n}

Repeats matching the preceding expression exactly n times.

If regular expressions contain a mismatched '{' or '}', error_brace is thrown. This applies to the following two also.

{n,}
{n,}?

Repeats matching the preceding expression at least n times. "{n,}" tries to match as many as possible, whereas "{n,}?" tries to match as few as possible.

{n,m}
{n,m}?

Repeats matching the preceding expression n time at least and m times at most. "{n,m}" tries to match as many as possible, whereas "{n,m}?" tries to match as few as possible.

If an invalid range in {} is specified like {3,2}, error_badbrace is thrown.

Brackets and backreference
(...)

Grouping of regular expressions and capturing a matched string. Every pair of capturing brackets is assigned with a number starting from 1 in the order that its left roundbracket '(' appears leftwards in the entire sequence of regular expressions, and the substring that regular expressions between the pair match can be referenced by the number from other position in the expressions.

If regular expressions contain a mismatched '(' or ')', error_paren is thrown.

When a pair of capturing roundbrackets itself is bound with a quantifier or it is inside another pair of brackets having a quantifier, the captured string by the pair is cleared whenever repetition happens. So, any captured string cannot be carried over to the next loop.

\N
(N is a
positive
integer)

Backreference. When '/' is followed by a number that begins with 1-9, it is regarded as a backreference to a string captured by (...) assigned with the corresponding number and matching is done with that string. If a pair of brackets assigned with Number N do not exist in the entire sequence of regular expressions, error_backref is thrown.

For example, /(TO|to)..\1/ matches "TOMATO" or "tomato", but does not match "Tomato".

In RegExp of ECMAScript, capturing brackets are not required to appear prior to its corresponding backreference(s). So expressions such as /\1(abc)/ and /(abc\1)/ are valid and not treated as an error.

When the corresponding brackets do not capture anything, a backreference is treated as having captured the special "undefined" value. This corresponds to an empty string and matching always succeeds.

(?<NAME>...)

Identical to (...) except that a substring that regular expressions inside a pair of brackets match can be referenced by the name NAME as well as the number assigned to the pair of the brackets.

For example, in the case of m|(?<year>\d+)/(?<month>\d+)/(?<day>\d+)|, the string captured by the first pair of parentheses can be referenced by either \1 or \k<year>.

Note: Available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0.

\k<NAME>

References to a substring captured by the pair of brackets named NAME. If the pair of corresponding brackets does not exist in the entire sequence of regular expressions, error_backref is thrown.

Note: Available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0.

(?:...)

Grouping. Unlike (...), this does not capture anything but only do grouping. So assignment of a number for backreference is not done.
For example, /tak(?:e|ing)/ matches "take" or "taking", but does not capture anything for backreference. Usually, this is somewhat faster than (...).

Assertions
^

Matches at the beginning of the string. When the multiline option is specified, '^' also matches every position immediately after one of LineTerminator.

$

Matches at the end of the string. When the multiline options is specified, '$' also matches every position immediately before one of LineTerminator.

\b

Out of a character class: \b matches a boundary between \w and \W.

Inside a character class: \b matches BEL (\u0008).

\B

Out of a character class: \B matches any boundary where \b does not match.

Inside a character class: error_escape is thrown.

(?=...)

A zero-width positive lookahead assertion. For example, /a(?=bc|def)/ matches "a" followed by "bc" or "def", but only "a" is counted as the matched string.

(?!...)

A zero-width negative lookahead assertion. For example, /a(?!bc|def)/ matches "a" not followed by "bc" nor "def".

Incidentally, expression /&(?!amp;|lt|gt|#)/ would be useful to find and escape bare '&'s when source code in where many '&'s are used is copied to a HTML file.

(?<=...)

A zero-width positive lookbehind assertion. For example, /(?<=bc|de)a/ matches "a" following "bc" or "de", but only "a" is counted as the matched string and "bc" or "de" is not so.

Note: In SRELL 1 the number of characters that regular expressions inside a lookbehind assertion match must be a fixed-length, such as /(?<=abc|def)/ and /(?<=\d{2})/; otherwise error_lookbehind is thrown. This restriction does not exist in SRELL 2.

(?<!...)

A zero-width negative lookbehind assertion. For example, /(?<!bc|de)a/ matches "a" not following "bc" nor "de".

Note: In SRELL 1 the number of characters that regular expressions inside a lookbehind assertion match must be a fixed-length; otherwise error_lookbehind is thrown. This restriction does not exist in SRELL 2.

Footnotes

Differences between std::regex and SRELL

SRELL extensions for Unicode support

For Unicode support, SRELL has the following extensions which do not exist in std::regex:

basic_regex

Typedef list of basic_regex
Input Type For UTF-8 For UTF-16 For UTF-32 Note
char u8cregex - - Defined also as u8regex, when the SRELL_CPP20_CHAR8_ENABLED macro is not defined.
wchar_t - u16wregex u32wregex u16wregex is defined when 0xffff <= WCHAR_MAX < 0x10ffff, whereas u32wregex is defined when WCHAR_MAX >= 0x10ffff.
char8_t u8regex - - Defined only when the SRELL_CPP20_CHAR8_ENABLED macro is defined.
char16_t - u16regex - Defined only when the SRELL_CPP11_CHAR1632_ENABLED macro is defined.
char32_t - - u32regex
//  For UTF-8 string.
#if defined(SRELL_CPP20_CHAR8_ENABLED)
    typedef basic_regex<char8_t, u8regex_traits<char8_t> > u8regex;
#endif
typedef basic_regex<char, u8regex_traits<char> > u8cregex;
#if !defined(SRELL_CPP20_CHAR8_ENABLED)
    typedef u8cregex u8regex;
#endif

#if defined(SRELL_CPP11_CHAR1632_ENABLED)
    //  For C++11-aware compilers.
    typedef basic_regex<char16_t, u16regex_traits<char16_t> > u16regex;
    typedef basic_regex<char32_t> u32regex;
#endif

//  Typedef u32wregex or u16wregex, depending on the size of wchar_t.
#if defined(WCHAR_MAX)
    #if WCHAR_MAX >= 0x10ffff
        typedef wregex u32wregex;
    #elif WCHAR_MAX >= 0xffff
        typedef basic_regex<wchar_t, u16regex_traits<wchar_t> > u16wregex;
    #endif
#endif
				

Handling UTF-8 or UTF-16 is performed by u8regex_traits or u16regex_traits shown above respectively. By using these classes, for example, it is possible to make a class to handle UTF-16 strings with uint32_t type array, such as basic_regex<uint32_t, u16regex_traits<uint32_t> >.

match_results

Typedef list of match_results
Input Type For UTF-8 For UTF-16 For UTF-32 Note
char u8ccmatch
u8csmatch
- - Defined also as u8[cs]match, when the SRELL_CPP20_CHAR8_ENABLED macro is not defined.
wchar_t - u16wcmatch
u16wsmatch
u32wcmatch
u32wsmatch
u16w[cs]match are defined when 0xffff <= WCHAR_MAX < 0x10ffff, whereas u32w[cs]match are defined when WCHAR_MAX >= 0x10ffff.
char8_t u8cmatch
u8smatch
- - Defined only when the SRELL_CPP20_CHAR8_ENABLED macro is defined.
char16_t - u16cmatch
u16smatch
- Defined only when the SRELL_CPP11_CHAR1632_ENABLED macro is defined.
char32_t - - u32cmatch
u32smatch
//  for UTF-8 string.
#if defined(SRELL_CPP20_CHAR8_ENABLED)
    typedef match_results<const char8_t *> u8cmatch;
#endif
#if defined(SRELL_CPP20_CHAR8_ENABLED) && SRELL_CPP20_CHAR8_ENABLED >= 2
    typedef match_results<std::u8string::const_iterator> u8smatch;
#endif
typedef cmatch u8ccmatch;
typedef smatch u8csmatch;
#if !defined(SRELL_CPP20_CHAR8_ENABLED)
    typedef u8ccmatch u8cmatch;
#endif
#if !defined(SRELL_CPP20_CHAR8_ENABLED) || SRELL_CPP20_CHAR8_ENABLED < 2
    typedef u8csmatch u8smatch;
#endif

#if defined(SRELL_CPP11_CHAR1632_ENABLED)
    //  For C++11-aware compilers.
    typedef match_results<const char16_t *> u16cmatch;
    typedef match_results<const char32_t *> u32cmatch;
    typedef match_results<std::u16string::const_iterator> u16smatch;
    typedef match_results<std::u32string::const_iterator> u32smatch;
#endif

//  Typedef u32w[cs]match or u16w[cs]match, depending on the size of wchar_t.
#if defined(WCHAR_MAX)
    #if WCHAR_MAX >= 0x10ffff
        typedef wcmatch u32wcmatch;
        typedef wsmatch u32wsmatch;
    #elif WCHAR_MAX >= 0xffff
        typedef wcmatch u16wcmatch;
        typedef wsmatch u16wsmatch;
    #endif
#endif
				

Additionally, since version 2.300 the BidirectionalIterator lookbehind_limit member has been added to the match_results class. For details, see the section for the match_lblim_avail flag.

sub_match

Typedef list of sub_match
Input Type For UTF-8 For UTF-16 For UTF-32 Note
char u8ccsub_match
u8cssub_match
- - Defined also as u8[cs]sub_match, when the SRELL_CPP20_CHAR8_ENABLED macro is not defined.
wchar_t - u16wcsub_match
u16wssub_match
u32wcsub_match
u32wssub_match
u16w[cs]sub_match are defined when 0xffff <= WCHAR_MAX < 0x10ffff, whereas u32w[cs]sub_match are defined when WCHAR_MAX >= 0x10ffff.
char8_t u8csub_match
u8ssub_match
- - Defined only when the SRELL_CPP20_CHAR8_ENABLED macro is defined.
char16_t - u16csub_match
u16ssub_match
- Defined only when the SRELL_CPP11_CHAR1632_ENABLED macro is defined.
char32_t - - u32csub_match
u32ssub_match
//  For UTF-8 string.
#if defined(SRELL_CPP20_CHAR8_ENABLED)
    typedef sub_match_results<const char8_t *> u8csub_match;
#endif
#if !defined(SRELL_CPP20_CHAR8_ENABLED) && SRELL_CPP20_CHAR8_ENABLED >= 2
    typedef sub_match_results<std::u8string::const_iterator> u8ssub_match;
#endif
typedef csub_match u8ccsub_match;
typedef ssub_match u8cssub_match;
#if !defined(SRELL_CPP20_CHAR8_ENABLED)
    typedef u8ccsub_match u8csub_match;
#endif
#if !defined(SRELL_CPP20_CHAR8_ENABLED) || SRELL_CPP20_CHAR8_ENABLED < 2
    typedef u8cssub_match u8ssub_match;
#endif

#if defined(SRELL_CPP11_CHAR1632_ENABLED)
    //  For C++11-aware compilers.
    typedef sub_match_results<const char16_t *> u16csub_match;
    typedef sub_match_results<const char32_t *> u32csub_match;
    typedef sub_match_results<std::u16string::const_iterator> u16ssub_match;
    typedef sub_match_results<std::u32string::const_iterator> u32ssub_match;
#endif

//  Typedef u32w[cs]submatch or u16w[cs]submatch, depending the size of wchar_t.
#if defined(WCHAR_MAX)
    #if WCHAR_MAX >= 0x10ffff
        typedef wcsub_match u32wcsub_match;
        typedef wssub_match u32wssub_match;
    #elif WCHAR_MAX >= 0xffff
        typedef wcsub_match u16wcsub_match;
        typedef wssub_match u16wssub_match;
    #endif
#endif
				

The meanings of the prefixes are as follows:

  • u8: meaning changes depending on whether or not your compiler supports the char8_t type (detected by whether or not the SRELL_CPP20_CHAR8_ENABLED macro is defined):
    • If char8_t supported: handles an array of char8_t or an instance of std::u8string as a UTF-8 string.
    • If char8_t not supported: identical to the "u8c-" prefix.
  • u16: handles an array of char16_t or an instance of std::u16string as a UTF-16 string.
  • u32: handles an array of char32_t or an instance of std::u32string as a UTF-32 string.
  • u8c: handles an array of char or an instance of std::string as a UTF-8 string. (Introduced in version 2.100. Until SRELL 2.002, the "u8-" prefix was used for this type.)
  • u16w: handles an array of wchar_t or an instance of std::wstring as a UTF-16 string. (Defined only when WCHAR_MAX is equal to or more than 0xffff and less than 0x10ffff.)
  • u32w: handles an array of wchar_t or an instance of std::wstring as a UTF-32 string. (Defined only when WCHAR_MAX is equal to or more than 0x10ffff.)

Based on these rules, regex_iterator and regex_token_iterator also have similar types defined as above that have prefixes u(8c?|16w?|32w?).

Basic use is as follows:

srell::u8regex u8re(u8"UTF-8 Regular Expression");
srell::u8cmatch u8cm;
std::printf("%s\n", srell::regex_search(u8"UTF-8 target string", u8cm, u8re) ? "found!" : "not found...");

srell::u16regex u16re(u"UTF-16 Regular Expression");
srell::u16cmatch u16cm;
std::printf("%s\n", srell::regex_search(u"UTF-16 target string", u16cm, u16re) ? "found!" : "not found...");

srell::u32regex u32re(U"UTF-32 Regular Expression");
srell::u32cmatch u32cm;
std::printf("%s\n", srell::regex_search(U"UTF-32 target string", u32cm, u32re) ? "found!" : "not found...");

srell::u16wregex u16wre(L"UTF-16 Regular Expression");
srell::u16wcmatch u16wcm;
std::printf("%s\n", srell::regex_search(L"UTF-16 target string", u16wcm, u16wre) ? "found!" : "not found...");
    //  The three lines above and the ones below are mutually exclusive.
    //  If wchar_t is less than 21-bit, the ones above are available;
    //  if equal to or more than, the ones below are available.
srell::u32wregex u32wre(L"UTF-32 Regular Expression");
srell::u32wcmatch u32wcm;
std::printf("%s\n", srell::regex_search(L"UTF-32 target string", u32wcm, u32wre) ? "found!" : "not found...");
			

In compilers prior to C++11, only "u8c-" types and "u16w-" types are available if wchar_t is a type being equal to or more than 16-bit and less than 21-bit, and only "u8c-" types and "u32w-" types are available if wchar_t is a type being equal to or more than 21-bit.
However, even in such environments, "u8c-", "u16-" and "u32-" types are available if such code as below is put before including SRELL:

typedef uint_least16_t char16_t;    //  do typedef a type that can have a 16-bit value.
typedef uint_least32_t char32_t;    //  do typedef a type that can have a 32-bit value.

namespace std
{
    typedef basic_string<char16_t> u16string;
    typedef basic_string<char32_t> u32string;
}

#define SRELL_CPP11_CHAR1632_ENABLED    //  to make them available manually.
			
SRELL extensions for the named capture feature

In SRELL 2, the following member functions have been added to the match_results class for the named capture feature:

difference_type length(const string_type &sub) const;
difference_type position(const string_type &sub) const;
string_type str(const string_type &sub) const;
const_reference operator[](const string_type &sub) const;
			

Basically, these can be used in the same way as the member functions having the same names in std::regex. The only difference is that these take the group name string as a parameter, instead of the group number corresnponding to a pair of parentheses.

//  Example.
srell::regex e("-(?<digits>\\d+)-");
srell::cmatch m;

if (srell::regex_search("1234-5678-90ab-cdef", m, e))
{
    const std::string by_number(m.str(1));      //  access by paren's number. a feature of std::regex.
    const std::string by_name(m.str("digits")); //  access by paren's name. an extension of SRELL 2.

    std::printf("results: bynumber=%s byname=%s\n", by_number.c_str(), by_name.c_str());
}
//  results: bynumber=5678 byname=5678
			
SRELL extensions to syntax_option_type

The following flag option has been added to SRELL 2:

namespace regex_constants
{
    static const syntax_option_type dotall; //  specified the singleline mode.
        //  See Flag option for singleline mode.
}
			

Flag option for singleline mode

If the dotall flag option of the regex_constants::syntax_option_type or basic_regex::flag_type type is passed to the regular expression compiler, the behaviour of '.' is changed. This flag option corresponds to //s in Perl 5.

SRELL extensions to match_flag_type

The following flag option is available:

namespace regex_constants
{
    //  Since version 2.300:
    static const match_flag_type match_lblim_avail;
}
			

match_lblim_avail (flag option for specifying the limit of the lookbehind assertions)

If the match_lblim_avail flag option is set, when a lookbehind assertion is performed, the lookbehind_limit member of an instance of the match_result type passed to an algorithm function is treated as "the limit of a sequence until where the algorithm function can lookbehind".

const char text[] = "0123456789abcdefghijklmnopqrstuvwxyz";
const char* const begin = text;
const char* const end = text + std::strlen(text);
const char* const first = text + 10;    //  Sets the position of 'a'.
const srell::regex re("(?<=^\\d+).");
srell::cmatch match;

match.lookbehind_limit = begin;

std::printf("matched %d\n", srell::regex_search(first, end, match, re));
    //  Does not match as lookbehind is performed only in the range [first, end).

std::printf("matched %d\n", srell::regex_search(first, end, match, re, srell::regex_constants::match_lblim_avail));
    //  Matches because regex_search is allowed to lookbehind until match.lookbehind_limit.
    //  I.e., when match_lblim_avail specified, searching againist the sequence
    //  [match.lookbehind_limit, end), begins at first in the sequence.
				

As shown in the example above, when match_lblim_avail specified, ^ matches match.lookbehind_limit instead of first.

* As this way is not smart for passing the lookbehind limit, it is planned to replace with a more appropriate way after seeing the outcome of P1844.

Regular expression engines and flags

std::regex has six regular expression engines including ECMAScript-based one, whereas SRELL has one engine being based on ECMAScript. Because of this difference, the following flag options defined in std::regex are ignored in SRELL even if specified:

syntax_option_type (and flag_type of basic_regex)

  • All but icase and multiline

match_flag_type

  • match_any
  • format_sed

The comparison between the ECMAScript mode of std::regex and SRELL is as follows:

  • std::regex's ECMAScript mode consists of the expressions defined in the ECMAScript specificatoin third edition
    - Unicode dependent matters (such as meaning of \s)
    + locale dependent matters
    + [:class name:], [.class name.], [=class name=] expressions.

  • SRELL 2 consists of the expressions defined in the ECMAScript 2018 specification.
  • SRELL 1 consists of the expressions defined in the ECMAScript 2017 (ES8) specification
    + lookbehind assertions.

Although both are based on the same ECMAScript's regular expression definition, neither std::regex nor SRELL is a superset of each other.

Measures against long time thinking

The backtracking algorithm used by the regular expression engine of ECMAScript (and also Perl on which it is based) can require exponential time to search when given regular expressions include nested quantifiers or consecutive expressions having a quantifier each, and the character sets of which are not mutually exclusive. The following are well-known examples:

  • "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /(a*)*b/
  • "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/

Unfortunately, no fundamental measures against this problem which can be applied to in any situation have found yet. So, to avoid holding control for long time, SRELL throws regex_error(regex_constants::error_complexity) when matching from a particular position fails repeatedly more than certain times.

The default value of the "certain times" is 16777216 (256 to the third power). But this value can be changed by setting an arbitrary value to limit_counter member variable of an instance of regex_basic type passed to regex_search() or regex_match().

Breaking changes

u8-prefix and u8c-prefix

Until version 2.002, the "u8-" prefix was used to mean that "This class handles a sequence of the char type as a UTF-8 string". However, considering the fact that C++20 will support the char8_t type for UTF-8, it is expected that the C++ standard library will begin to use the "u8-" prefix for classes and algorithms that handle char_8.
Thus, to avoid inconsistency with the naming convention of the standard library, SRELL has begun to use the "u8c-" prefix instead of "u8-" to mean that "This class handles a sequence of char as a UTF-8 string" since version 2.100.

List of classes whose prefix has been changed from u8- to u8c-

  • basic_regex: u8cregex
  • match_results: u8ccmatch, u8csmatch
  • sub_match: u8ccsub_match, u8cssub_match
  • regex_iterator: u8ccregex_iterator, u8csregex_iterator
  • regex_token_iterator: u8ccregex_token_iterator, u8csregex_token_iterator

The freed "u8-" prefix is now associated with the char8_t type. But if your compiler does not support char8_t, for backwards compatibility, class names having the "u8-" prefix are defined as aliases (i.e., typedef) of the corresponding classes that have the "u8c-" prefix in their names, respectively.

u8- and u8c-
PrefixSRELL -2.002SRELL 2.100-
char8_t not supported
by your compiler
char8_t supported
by your compiler
u8- handles a sequence of char as UTF-8 handles a sequence of char8_t as UTF-8
u8c- (Prefix not used) handles a sequence of char as UTF-8

Lookbehind

SRELL version 1.xxx supported the regular expressions defined in ECMAScript 2017 (ES8) Specification 21.2 RegExp (Regular Expression) Objects and fixed-length lookbehind assertions.

Because of the following reason, SRELL 1 behaved differently to SRELL 2 when a lookbehind assertion was used

TC39, which maintains the ECMAScript standard, adopted the variable-length lookbehind assertions for its RegExp, instead of the fixed-length ones that are supported by many script languages, such as Perl5, Python, etc. At a glance, the former may seem to be just a superset of the latter, but these two in fact return different results in some cases:

"abcd" =~ /(?<=(.){2})./
//  Fixed-length lookbehind: $& = "c", $1 = "b".
//  As the automaton runs from left to right even in a lookbehind assertion,
//  "b" just before "c" is the last one that $1 captured.

//  Variable-length lookbehind: $& = "c", $1 = "a".
//  As the automaton runs from right to left in a lookbehind assertion,
//  "a" being the second next of "c" is the last one that $1 captured.
		

While SRELL 1 supported the fixed-width lookbehind assertions as an extension, SRELL 2 supports the variable-length lookbehind following the enhancement of RegExp. Thus, there happened a breaking change between SRELL 1 and SRELL 2.

External Links