SRELL (std::regex-like library) is a Unicode-aware regular expression template library for C++.
SRELL is a template library that has an ECMAScript (JavaScript) compatible regular expression engine wrapped into the same class design as std::regex. SRELL can be used in the same way as std::regex (or boost::regex on which std::regex is based) and does not need any installation as it is a header-only template libarary.
SRELL has native support for Unicode:
'.'
does not match a half of a surrogate pair in UTF-16 strings or does not match a code unit in UTF-8 strings.[丈𠀋]
, and a range also can be specified in a character class such as [\u{1b000}-\u{1b0ff}]
.SRELL has been tuned up not to slow down remarkably when case-insensitive (icase) search is performed.
As std::regex was proposed early for C++0x (now C++11), it is little dependent on C++11's new features. So SRELL should be available with even pre-C++11 compilers as far as they interpret C++ templates accurately. (The oldest compiler on where I confirm SRELL can be used is Visual C++ 2005 in Visual Studio 2005).
No preparation is required. Place srell*.hpp (the three files of srell.hpp, srell_ucfdata2.hpp, and srell_updata.hpp) somewhere in your PATH and include srell.hpp.
If you have used <regex>, you already know how to use SRELL generally.
// Example 01: #include <cstdio> #include <string> #include <iostream> #include "srell.hpp" int main() { srell::regex e; // Regular expression object holder. srell::cmatch m; // Object which receives results. e = "\\d+[^-\\d]+"; // Compile a regular expression string. if (srell::regex_search("1234-5678-90ab-cdef", m, e)) { // If use printf. const std::string s(m[0].first, m[0].second); // The code above can be replaced with one of the following lines. // const std::string s(m[0].str()); // const std::string s(m.str(0)); std::printf("result: %s\n", s.c_str()); // If use iostream. std::cout << "result: " << m[0] << std::endl; } return 0; }
As in this example, all classes and algorithms that belong to SRELL have been put within namespace "srell". Except for this point, the usage is basically identical to std::regex.
Please see also readme_en.txt included in the zip archive.
New features introduced in C++11 and later that SRELL may use are as follows:
char16_t
and char32_t
typesinitializer_list
char8_t
typeAs of June 2020, SRELL determines whether these features are available or not by using the following macros:
#ifdef __cpp_unicode_characters #ifndef SRELL_CPP11_CHAR1632_ENABLED #define SRELL_CPP11_CHAR1632_ENABLED // Do typedef for char16_t and char32_t. #endif #endif #ifdef __cpp_initializer_lists #include <initializer_list> #ifndef SRELL_CPP11_INITIALIZER_LIST_ENABLED #define SRELL_CPP11_INITIALIZER_LIST_ENABLED // Make it possible to pass initializer_list as an argument. #endif #endif #ifdef __cpp_rvalue_references #ifndef SRELL_CPP11_MOVE_ENABLED #define SRELL_CPP11_MOVE_ENABLED // Enable move semantics in constructos and operator=(). #endif #endif #ifdef __cpp_char8_t #ifdef __cpp_lib_char8_t #define SRELL_CPP20_CHAR8_ENABLED 2 // Both char8_t support and std::u8string support. #else #define SRELL_CPP20_CHAR8_ENABLED 1 // Only char8_t support. #endif #endif
If your compiler does not set __cpp_*
macros despite of the fact that the corresponding features are actually available, you can turn on the feature(s) you need by setting the SRELL_CPP_*
macro(s) above before including SRELL.
Although I develop and check SRELL mainly on VC++ and MinGW, according to some tests on Compiler Explorer, as of May 2019, at least the following compilers also can compile a sample code that does regex search with srell::u16regex in SRELL 2.200 against UTF-16 strings and generate assembly outputs:
A sample code that uses char or wchar_t can be compiled by x86-64 gcc 4.1.2, which seems to be the oldest one among compilers being available in Compiler Explorer.
The expressions defined in ECMAScript 2021 Specification 22.2 RegExp (Regular Expression) Objects are available.
By default, the u
flag is assumed to be always set.
Since version 4.000, when the unicodesets
flag is specified, SRELL behaves equivalently to the v
flag mode, which is expected to be added to a future version of ECMAScript. For the details of the v mode, see the proposal page.
The detailed list of supported expressions is as follows:
Characters | |
---|---|
. |
Matches any character but LineTerminator (i.e., any code point but U+000A, U+000D, U+2028, and U+2029).
Note: The |
\0 |
Matches NULL ( |
\t |
Matches Horizontal Tab ( |
\n |
Matches Line Feed ( |
\v |
Matches Vertical Tab ( |
\f |
Matches Form Feed ( |
\r |
Matches Carriage Return ( |
\cX |
Matches a control character corresponding to ( |
\\ |
Matches a backslash ( |
\xHH |
Matches a character whose code unit value in UTF-16 is identical to the value represented by two hexadecimal digits
Because code unit values |
\uHHHH |
Matches a character whose Unicode code point is identical to the value represented by four hexadecimal digits
SRELL 2.500-: When sequential |
\u{H...} |
Matches a character whose Unicode code point is identical to the value represented by one or more hexadecimal digits
Note: This expression has been available since ECMAScript 6.0. In SRELL up to version 2.001, what could be specified as |
\ |
When a
Note: In the |
Any character but ^$.*+?()[]{}|\/ |
Represents that character itself. |
Alternatives | |
A|B |
Matches a sequence of regular expressions A or B. An arbitrary number of |
Character Class | |
[] |
A character class. A set of characters:
Examples when case insensitive search is performed (when the
Although in Perl's regular expression,
If regular expressions contain a mismatched |
In the v mode (when the
Per level of
In the v mode, the eight characters
Moreover, the following 18 double punctuators are reserved in the vmode for future use. They cannot be written in []. If written, SRELL throws
|
|
Predefined Character Classes | |
\d |
Equivalent to |
\D |
Equivalent to |
\s |
Equivalent to
Note: Strictly speaking, this consists of the union of WhiteSpace and LineTerminator. Whenever some code point(s) were to be added to category Zs in Unicode, the number of code points that |
\S |
Equivalent to |
\w |
Equivalent to |
\W |
Equivalent to |
\p{...} |
Matches any character that has the Unicode property specified in "
For the details about what can be specified in "
In the v mode, properties of strings (Unicode properties that match sequences of characters) are also supported. They can be used also in the character class, except negated character classes (
Note: This expression is available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0. |
\P{...} |
Matches any character that does not have the Unicode property specified in "
Unlike
Note: Available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0. |
Quantifiers | |
* *? |
Repeats matching the preceding expression 0 or more times.
If this appears without a preceding expression, |
+ +? |
Repeats matching the preceding expression 1 or more time(s). |
? ?? |
Repeats matching the preceding expression 0 or 1 time(s). |
{n} |
Repeats matching the preceding expression exactly
If regular expressions contain a mismatched |
{n,} {n,}? |
Repeats matching the preceding expression at least |
{n,m} {n,m}? |
Repeats matching the preceding expression
If an invalid range in {} is specified like |
Brackets and backreference | |
(...) |
Grouping of regular expressions and capturing a matched string. Every pair of capturing brackets is assigned with a number starting from 1 in the order that its left roundbracket
If regular expressions contain a mismatched When a pair of capturing roundbrackets itself is bound with a quantifier or it is inside another pair of brackets having a quantifier, the captured string by the pair is cleared whenever a repetition happens. So, any captured string cannot be carried over to the next loop. |
\N (N is a positive integer) |
Backreference. When
For example,
In RegExp of ECMAScript, capturing brackets are not required to appear prior to its corresponding backreference(s). So expressions such as
When a pair of brackets does not capture anything, it is treated as having captured the special |
(?<NAME>...) |
Identical to
For example, in the case of Note: Available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0. |
\k<NAME> |
References to a substring captured by the pair of brackets named Note: Available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0. |
(?:...) |
Grouping. Unlike |
Assertions | |
^ |
Matches at the beginning of the string.
When the |
$ |
Matches at the end of the string.
When the |
\b |
Out of a character class: matches a boundary between
Inside a character class: matches BEL ( |
\B |
Out of a character class: matches any boundary where
Inside a character class: |
(?=...) |
A zero-width positive lookahead assertion. For example, |
(?!...) |
A zero-width negative lookahead assertion. For example,
Incidentally, expression |
(?<=...) |
A zero-width positive lookbehind assertion. For example,
Note: In SRELL 1, the number of characters matched with regular expressions inside a lookbehind assertion must be a fixed-length, such as |
(?<!...) |
A zero-width negative lookbehind assertion. For example,
Note: In SRELL 1 the number of characters matched with regular expressions inside a lookbehind assertion must be a fixed-length; otherwise |
'\'
, or a combination of a backslash and a character which is not explained in the table above appears, error_escape
is thrown. Up to version 2.300, SRELL was interpreting the latter as "representing the following character itself", but since version 2.301 handles it as an error in accordance with the ECMAScript specification./\1\u0030/
, or 2) write the digit character inside a character class, which consists of that character only, such as /\1[0]/
. SRELL's pattern compiler translates both types of the expressions into the same internal representation.\ooo
and \0ooo
. See also the ECMAScript Specification.[^]
(the complementary set of the empty set) can be used for that purpose./(?=\p{sc=Latin})\p{Ll}/
means any lower case letter of the Latin script). Similarly, the negative lookahead assertion can be used to do the subtraction (for example, /(?!\p{sc=Latin})\p{Ll}/
means any lower case letter that is not of the Latin script).
For Unicode support, SRELL has the following typedef
s and extensions which do not exist in <regex>:
Prefix and interpretation of string |
Specialised with ... | basic_regex | match_results | sub_match |
Note |
---|---|---|---|---|---|
u8- |
char8_t or char |
u8regex |
u8cmatch u8smatch |
u8csub_match u8ssub_match |
char8_t is used only when char8_t is supported (detected by checking whether the __cpp_char8_t or SRELL_CPP20_CHAR8_ENABLED macro is defined). Otherwise, these types are just aliases of u8c- types shown below. |
u16- |
char16_t |
u16regex |
u16cmatch u16smatch |
u16csub_match u16ssub_match |
Defined only when char16_t and char32_t are supported (detected by checking whether the __cpp_unicode_characters or SRELL_CPP11_CHAR1632_ENABLED macro is defined). |
u32- |
char32_t |
u32regex |
u32cmatch u32smatch |
u32csub_match u32ssub_match |
|
u8c- |
char |
u8cregex |
u8ccmatch u8csmatch |
u8ccsub_match u8cssub_match |
|
u16w- |
wchar_t |
u16wregex |
u16wcmatch u16wsmatch |
u16wcsub_match u16wssub_match |
Defined only when 0xFFFF <= WCHAR_MAX < 0x10FFFF . |
u32w- |
u32wregex |
u32wcmatch u32wsmatch |
u32wcsub_match u32wssub_match |
Defined only when WCHAR_MAX >= 0x10FFFF . |
|
u1632w- |
u1632wregex |
u1632wcmatch u1632wsmatch |
u1632wcsub_match u1632wssub_match |
When 0xFFFF <= WCHAR_MAX < 0x10FFFF , identical to u16w- .
When WCHAR_MAX >= 0x10FFFF , identical to u32w- .
Unlike u16w- and u32w- , these u1632w- types are always available on condition that WCHAR_MAX is equal to or greater than 0xFFFF .
|
The meaning of each prefix is as follows:
char8_t
type. It is detected by checking whether the __cpp_char8_t
or SRELL_CPP20_CHAR8_ENABLED
macro is defined or not:
char8_t
supported: handles an array of char8_t
and an instance of std::u8string
as a UTF-8 string.char8_t
not supported: identical to the "u8c-" prefix. Defined as mere aliases of "u8c-" types shown below.
u8"..."
).
char16_t
and an instance of std::u16string
as a UTF-16 string.
Suitable for UTF-16 string literals (u"..."
).
char32_t
and an instance of std::u32string
as a UTF-32 string.
Suitable for UTF-32 string literals (U"..."
).
char
and an instance of std::string
as a UTF-8 string. (Introduced in SRELL version 2.100. Until version 2.002, the "u8-" prefix was used for this kind of type.)wchar_t
and an instance of std::wstring
as a UTF-16 string. (Defined only when WCHAR_MAX
is equal to or more than 0xFFFF
and less than 0x10FFFF
.)wchar_t
and an instance of std::wstring
as a UTF-32 string. (Defined only when WCHAR_MAX
is equal to or more than 0x10FFFF
.)WCHAR_MAX
is equal to or more than 0xFFFF
and less than 0x10FFFF
, identical to u16w-
above. When WCHAR_MAX
is equal to or more than 0x10FFFF
, identical to u32w-
above.
Types of this prefix are available in SRELL version 2.930 and later.
Although omitted from the table above, based on these rules, regex_iterator
and regex_token_iterator
that have u(8c?|16w?|32w?|u1632w)
prefixes also have been defined similarly.
Basic use of Unicode support versions is as follows:
srell::u8regex u8re(u8"UTF-8 Regular Expression"); srell::u8cmatch u8cm; // -smatch instead of -cmatch if target string is of basic_string type. And so on. std::printf("%s\n", srell::regex_search(u8"UTF-8 target string", u8cm, u8re) ? "found!" : "not found..."); srell::u16regex u16re(u"UTF-16 Regular Expression"); srell::u16cmatch u16cm; std::printf("%s\n", srell::regex_search(u"UTF-16 target string", u16cm, u16re) ? "found!" : "not found..."); srell::u32regex u32re(U"UTF-32 Regular Expression"); srell::u32cmatch u32cm; std::printf("%s\n", srell::regex_search(U"UTF-32 target string", u32cm, u32re) ? "found!" : "not found..."); srell::u1632wregex u1632wre(L"UTF-16 or UTF-32 Regular Expression"); srell::u1632wcmatch u1632wcm; std::printf("%s\n", srell::regex_search(L"UTF-16 or UTF-32 target string", u1632wcm, u1632wre) ? "found!" : "not found..."); srell::u16wregex u16wre(L"UTF-16 Regular Expression"); srell::u16wcmatch u16wcm; std::printf("%s\n", srell::regex_search(L"UTF-16 target string", u16wcm, u16wre) ? "found!" : "not found..."); // The three lines above and the ones below are mutually exclusive. // If wchar_t is less than 21-bit, the ones above are available; // if equal to or more than, the ones below are available. srell::u32wregex u32wre(L"UTF-32 Regular Expression"); srell::u32wcmatch u32wcm; std::printf("%s\n", srell::regex_search(L"UTF-32 target string", u32wcm, u32wre) ? "found!" : "not found...");
In compilers prior to C++11, only "u8c-" types and "u16w-" types are available if wchar_t
is a type being equal to or more than 16-bit and less than 21-bit, and only "u8c-" types and "u32w-" types are available if wchar_t
is a type being equal to or more than 21-bit.
However, even in such environments, "u8c-", "u16-" and "u32-" types are available if such code as below is put before including SRELL:
typedef uint_least16_t char16_t; // Do typedef for a type that can have a 16-bit value. typedef uint_least32_t char32_t; // Do typedef for a type that can have a 32-bit value. namespace std { typedef basic_string<char16_t> u16string; typedef basic_string<char32_t> u32string; } #define SRELL_CPP11_CHAR1632_ENABLED // Make them available manually.
Incidentally, handling UTF-8 or UTF-16 is performed by u8regex_traits
or u16regex_traits
passed to basic_regex
as a template argument. By using these classes, for example, it is possible to make a class to handle UTF-16 strings with uint32_t
type array, such as basic_regex<uint32_t, u16regex_traits<uint32_t> >
.
In SRELL 2.000 and later, the following member functions have been added to the match_results
class for the named capture feature:
difference_type length(const string_type &sub) const;
difference_type position(const string_type &sub) const;
string_type str(const string_type &sub) const;
const_reference operator[](const string_type &sub) const;
// The following ones are available since SRELL 2.650 and later.
difference_type length(const char_type *sub) const;
difference_type position(const char_type *sub) const;
string_type str(const char_type *sub) const;
const_reference operator[](const char_type *sub) const;
Basically, these can be used in the same way as the member functions having the same names in regex
. The only difference is that these take the group name string as a parameter, instead of the group number corresnponding to a pair of parentheses.
// Example. srell::regex e("-(?<digits>\\d+)-"); srell::cmatch m; if (srell::regex_search("1234-5678-90ab-cdef", m, e)) { const std::string by_number(m.str(1)); // access by paren's number. a feature of std::regex. const std::string by_name(m.str("digits")); // access by paren's name. an extension of SRELL. std::printf("results: bynumber=%s byname=%s\n", by_number.c_str(), by_name.c_str()); } // results: bynumber=5678 byname=5678
The following flag option has been added to SRELL:
namespace regex_constants
{
static const syntax_option_type dotall; // (Since SRELL 2.000)
// Single-line mode. If specified, the behaviour of '.' is changed.
// This flag option corresponds to //s
of ECMAScript and Perl 5.
static const syntax_option_type unicodesets; // (Since SRELL 4.000)
// For using v mode.
}
Like the other values of the syntax_option_type
type, this value is also defined also in basic_regex
.
The following error type value has been added to SRELL:
namespace regex_constants { static const error_type error_utf8; // (Since SRELL 2.630) // Invalid UTF-8 sequence is found in a regular expression passed tobasic_regex
. static const error_type error_property; // (Since SRELL 3.010) // Unknown or unsupported name or value is specified in\p{...}
or\P{...}
. static const error_type error_noescape; // (Since SRELL 4.000; v mode only) //( ) [ ] { } / - \ |
needs to be escaped by using\
in the character class. static const error_type error_operator; // (Since SRELL 4.000; v mode only) // Operation error in the character class. Reserved double punctuators are // found, or different operations are used at the same level of[]
. static const error_type error_complement; // (Since SRELL 4.000; v mode only) // Complement of strings cannot be used.\P{POSName}
,[^\p{POSName}]
, // or[^\q{strings}]
wherePOSName
is a name of property-of-strings is found. }
In SRELL 2.600 and later, there are overload functions that take three BidirectionalIterator
as parameters:
template <class BidirectionalIterator, class Allocator, class charT, class traits> bool regex_search( BidirectionalIterator first, BidirectionalIterator last, BidirectionalIterator lookbehind_limit, match_results<BidirectionalIterator, Allocator> &m, const basic_regex<charT, traits> &e, const regex_constants::match_flag_type flags = regex_constants::match_default); template <class BidirectionalIterator, class charT, class traits> bool regex_search( BidirectionalIterator first, BidirectionalIterator last, BidirectionalIterator lookbehind_limit, const basic_regex<charT, traits> &e, const regex_constants::match_flag_type flags = regex_constants::match_default);
The third iterator, lookbehind_limit
is used for specifying the limit until where regex_search()
can read a sequence backwards when a lookbehind assertion is performed.
const char text[] = "0123456789abcdefghijklmnopqrstuvwxyz"; const char* const begin = text; const char* const end = text + std::strlen(text); const char* const first = text + 10; // Sets the position of 'a'. const srell::regex re("(?<=^\\d+)."); srell::cmatch match; std::printf("matched %d\n", srell::regex_search(first, end, match, re)); // Does not match as lookbehind is performed only in the range [first, end). std::printf("matched %d\n", srell::regex_search(first, end, begin, match, re)); // Matches because regex_search is allowed to lookbehind until begin. // I.e., in a three-iterators version, searching againist the sequence // [begin, end), begins at first in the sequence.
As in the example shown above, in a version of three-iterators, ^
matches begin
(the third iterator) instead of first
(the first iterator).
And when a three-iterators version is called, the position()
member of match_results
returns a distance from the position passed to as the third iterator, while prefix().first
of match_results
is set to the position passed to as the first iterator.
* By introducing these three-iterators overloads, the way used in SRELL 2.300~2.500 has been removed.
<regex> has six regular expression engines including ECMAScript-based one, whereas SRELL has one engine being compatible with ECMAScript.'s RegExp. Because of this difference, the following flag options defined in <regex> are ignored in SRELL even if specified:
syntax_option_type
(and flag_type
of basic_regex
)nosubs
, optimize
, collate
, basic
, extended
, awk
, grep
, egrep
(all but icase
and multiline
)match_flag_type
match_any
, format_sed
The comparison between the ECMAScript mode of <regex> and SRELL is as follows:
<regex>'s ECMAScript mode consists of the expressions defined in the ECMAScript specificatoin third edition
- (MINUS) Unicode dependent matters (such as meaning of \s
)
+ (PLUS) locale dependent matters
+ (PLUS) [:class name:], [.class name.], [=class name=] expressions.
Although both are based on the same ECMAScript's regular expression definition, neither <regex> nor SRELL is a superset of each other.
Types other than u8regex
, u8cregex
, u16regex
, and u16wregex
treat an input string as a sequence of Unicode values. For example, when CHAR_BIT
is 8
, srell::regex
(typedef
of srell::basic_regex<char>
) interprets 0x00
-0xFF
in an input string as U+0000-U+00FF, respectively.
Because the characters represented by U+0000-U+00FF in Unicode are identical to ISO-8859-1, as a result, it can be assumed that srell::regex
supports ISO-8859-1.
srell::regex
can be used to find a specific pattern of bytes in a binary data.
This applies also to srell::wregex
(typedef
of srell::basic_regex<wchar_t>
). It interprets an input as a sequence of Unicode values in the range 0x00
-WCHAR_MAX
.
The suitable type to use with the W functions of WinAPI is srell::u16wregex
or srell::u1632wregex
which supports UTF-16, not srell::wregex
that virtually supports UCS-2.
The implementations of the following functions of SRELL have been simplified to avoid redundant overheads:
basic_regex::assign()
: In <regex> when an exception is thrown (when compiling a regular expression string fails) *this
remains unchanged (cf. 11 in [re.regex.assign]), whereas *this
is cleared in SRELL. This is because when compiling begins, SRELL does not keep the old contents anywhere.match_results::operator[](size_type n)
: While <regex> guarantees safety even when n >= match_results::size()
(i.e., out-of-range access) (cf. 8 in [re.results.acc]), SRELL does not. Guaranteeing safety needs an additional dummy member of the sub_match
type only for the purpose of preparing out-of-range access.The backtracking algorithm used by the regular expression engine of ECMAScript (and also Perl on which it is based) can require exponential time to search when given regular expressions include nested quantifiers or consecutive expressions having a quantifier each, and the character sets of which are not mutually exclusive. The following are well-known examples:
Unfortunately, no fundamental measures against this problem which can be applied to in any situation have found yet. So, to avoid holding control for long time, SRELL throws regex_error(regex_constants::error_complexity)
when matching from a particular position fails repeatedly more than certain times.
The default value of the "certain times" is 16777216 (256 to the third power). But this value can be changed by setting an arbitrary value to the limit_counter
member variable of an instance of the regex_basic
type passed to an algorithm function, such as regex_search()
and regex_match()
.
SRELL version 2.300~2.500 had the following extensions:
match_results
: BidirectionalIterator lookbehind_limit
(For specifying the limit of reading backwards for lookbehind)match_flag_type
: match_lblim_avail
(Flag option for telling an algorithm function that match_results.lookbehind_limit
is available and should be taken into account)
If the match_lblim_avail
flag option is set, when a lookbehind assertion is performed, the lookbehind_limit
member of an instance of the match_result
type passed to an algorithm function is treated as "the limit of a sequence until where the algorithm function can lookbehind".
const char text[] = "0123456789abcdefghijklmnopqrstuvwxyz"; const char* const begin = text; const char* const end = text + std::strlen(text); const char* const first = text + 10; // Sets the position of 'a'. const srell::regex re("(?<=^\\d+)."); srell::cmatch match; match.lookbehind_limit = begin; std::printf("matched %d\n", srell::regex_search(first, end, match, re)); // Does not match as lookbehind is performed only in the range [first, end). std::printf("matched %d\n", srell::regex_search(first, end, match, re, srell::regex_constants::match_lblim_avail)); // Matches because regex_search is allowed to lookbehind until match.lookbehind_limit. // I.e., when match_lblim_avail specified, searching againist the sequence // [match.lookbehind_limit, end), begins at first in the sequence.
As shown in the example above, when match_lblim_avail
specified, ^
matches match.lookbehind_limit
instead of first
.
In SRELL 2.600 and later, the limit position until where regex_search
can lookbehind can be specified as an argument passed to the function. Thus, the way mentioned above became obsolete and was removed.
It has been shown that the C++ committee no longer has any intention to improve or to enhance <regex>. Thus, now I think that adding SRELL's own overload functions to regex_search()
is not likely to conflict with any enhancements to <regex> of the C++ standard in the future.
Until version 2.002, the "u8-" prefix was used to mean that "This class handles a sequence of the char
type as a UTF-8 string". However, considering the fact that C++20 will support the char8_t
type for UTF-8, it is expected that the C++ standard library will begin to use the "u8-" prefix for classes and algorithms that handle char8_t
.
Thus, to avoid inconsistency with the naming convention of the standard library, SRELL has begun to use the "u8c-" prefix instead of "u8-" to mean that "This class handles a sequence of char
as a UTF-8 string" since version 2.100.
basic_regex
: u8cregex
match_results
: u8ccmatch
, u8csmatch
sub_match
: u8ccsub_match
, u8cssub_match
regex_iterator
: u8ccregex_iterator
, u8csregex_iterator
regex_token_iterator
: u8ccregex_token_iterator
, u8csregex_token_iterator
The freed "u8-" prefix is now associated with the char8_t
type. But if your compiler does not support char8_t
, for backwards compatibility, class names having the "u8-" prefix are defined as aliases (i.e., typedef
) of the corresponding classes that have the "u8c-" prefix in their names, respectively.
Prefix | SRELL -2.002 | SRELL 2.100- | |
---|---|---|---|
When char8_t not supported by compiler | When char8_t supported by compiler |
||
u8- | handles a sequence of char as UTF-8 |
handles a sequence of char8_t as UTF-8 |
|
u8c- | (Prefix did not exist) | handles a sequence of char as UTF-8 |
SRELL version 1.xxx supported the regular expressions defined in ECMAScript 2017 (ES8) Specification 21.2 RegExp (Regular Expression) Objects plus fixed-length lookbehind assertions.
Because of the following reason, SRELL version 1.nnn behaved differently to SRELL version 2.000 and later when a lookbehind assertion was used.
TC39, which maintains the ECMAScript standard, adopted the variable-length lookbehind assertions for its RegExp, instead of the fixed-length ones that are supported by many script languages, such as Perl5, Python, etc. At a glance, the former may seem to be just a superset of the latter, but these two in fact return different results in some cases:
"abcd" =~ /(?<=(.){2})./
// In the case of fixed-length lookbehind: $& = "c", $1 = "b".
// As the automaton runs from left to right even in a lookbehind assertion,
// "b" just before "c" is the last one that $1 captured.
// In the case of variable-length lookbehind: $& = "c", $1 = "a".
// As the automaton runs from right to left in a lookbehind assertion,
// "a" being the second next of "c" is the last one that $1 captured.
While in SRELL 1 the fixed-width lookbehind assertions were supported as an extension, in SRELL 2.000 and later the variable-length lookbehind assertions are supported following the enhancement of RegExp of JavaScript. Thus, there happened a breaking change between SRELL 1.401 and SRELL 2.000.
\Z
was excluded from the proposal at the 2021-12 meeting.
(?imsx-imsx:subexpression)
) only.
The unbounded form ((?imsx-imsx)
) was excluded from the proposal at the 2021-12 meeting.