This article summarizes the regular expression syntax and usage in C++. Most of the content of this article is from Bo Qian’s modern C++ tutorial series: https://www.youtube.com/playlist?list=PL5jc9xFGsL8FWtnZBeTqZBbniyw0uHyaH
If you are new to C++, I personally highly recommend following his channel. His tutorial is well organized and easy to follow. If you just want to quickly go through the content, just scroll down. Let’s start.
Regular Expression: is a sequence of characters that define a search pattern. Usually, such patterns are used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation. — Wikipedia
Modern C++ supports 6 types of regular expression system:
– ECMAScript (default)
– basic
– extended
– awk
– grep
– egrep
The default regular expression system in C++ is ECMAScript. In this article, we will specifically focus on ECMAScript.
1. Regular Expression Syntax
In order to use the regular expression in C++, you should first include the regex header file:
#include <regex>
Then you can easily define your own regular expression using the following syntax:
regex e(“abc”);
If you want to change the regular expression system, you can simply add a flag during the definition of the regular expression:
regex e(“abc”, regex_constants::grep); // Now we change the regular expression system from ECMAScript to grep.
The following code provides an overview of how to use regular expression in C++. You may copy and paste the code in your ide, uncomment the code line by line and check the result by yourself.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include<iostream> | |
#include<string> | |
#include<vector> | |
#include<regex> | |
using namespace std; | |
//Basic grammar | |
int main(){ | |
string str; | |
while(true){ | |
cin>>str; | |
//using regular expression grammar: ECMAScript (default) | |
//regex e("^abc.", regex_constants::grep); | |
//match string with exactly "abc" | |
//regex e("abc"); | |
//match string with "abc", ignore lowercase or uppercase | |
//regex e("abc", regex_constants::icase); | |
//'.' matches to any character except the newline | |
//regex e("abc."); | |
//'?' matches 0 or 1 preceding characters! | |
//e.g. "ab", "abc" will be matches | |
//regex e("abc?"); | |
//matches 0 or more preceding characters. | |
//regex e("abc*"); | |
//1 or more preceding characters. | |
//regex e("abc+"); | |
//[…] Any character inside the square brackets, represents | |
//1 character | |
//e.g. "abcd", "abc", "abcccddd", "abdcdc" are matches | |
//regex e("ab[cd]*"); | |
//[^…] Any character not inside the squre brackets, | |
//represents 1 character | |
//regex e("ab[^cd]*"); | |
//{number} exact match the [number] of preceding characters | |
//e.g. "abccc", "abcdd", "abcdc" are matches | |
//e.g. "abcdcdcd", "abdddccc", "ab" are not matches | |
//regex e("ab[cd]{3}"); | |
//matches 3 or more preceding characters | |
//regex e("ab[cd]{3,}"); | |
//matches 3, 4 or 5 preceding characters | |
//regex e("ab[cd]{3,5}"); | |
//| – OR | |
//e.g. "abc", "def", "deg" are matches | |
//"abcd", "defg" are not matches | |
//regex e("abc|de[fg]"); | |
//matches ']' | |
//regex e("\]"); | |
//\1 represents the group 1, the () defines the group | |
//e.g. "abcdeeeeabc" is a match | |
//regex e("(abc)de+\\1"); | |
//Note (de+) defines the group, if we have (de+) defined | |
//as "deee", then for group 2, it should always be "deee". | |
//regex e("(ab)*c(de+)\\2\\1"); | |
//matching any e-mail address | |
//[[:w:]] word character: digit, number, or underscore! | |
//regex e("[[:w:]]+@[[:w:]]+\.com"); | |
//'^' marks that "abc" should be at the beginning of the string | |
//regex e("^abc", regex_constants::icase); | |
//'$' marks that "abc." should appear at the end of the string | |
regex e("abc.$", regex_constants::icase); | |
/* | |
[:s:] – a white space character; | |
[:w:] – a word character; | |
[:d:] – a decimal digit character; | |
[:upper:] – an uppercase character; | |
[:lower:] – a lowercase character; | |
[:alnum:] – an alpha-numerical character; | |
[:alpha:] – an alphabetic character; | |
[:blank:] – a blank character; | |
[:punct:] – a punctuation mark character | |
*/ | |
//check whether str can be matched with regular expression e | |
bool match = regex_match(str, e); | |
//search the string str to see whether there is a substring | |
//of str, that can be matched with e | |
bool isFound = regex_search(str, e); | |
cout << (match ? "Matched" : "Not matched") << endl << endl; | |
} | |
return 0; | |
} |
2. Sub-match in Regular Expression
We can store the match result in smatch data structure, and get the detailed matching result of the specific groups defined in the regular expression. The general syntax and the meaning of smatch are provided below:
std::match_result<> – store the detailed matches!
smatch – detailed match in the string!
smatch m;
m[0].str() – The entire match (same with m.str(), m.str(0))
m[1].str() – The substring that matches the first group (same as m.str(1))
m[2].str() – The substring that matches the second group
m.prefix() – Everything before the first matched character
m.suffix() – Everything after the last matched character
The following code provides an overview of how to use the smatch library:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include<iostream> | |
#include<string> | |
#include<vector> | |
#include<regex> | |
using namespace std; | |
int main() { | |
string str; | |
while (true) { | |
cin >> str; | |
smatch m; //typedef std::match_results<string> | |
//We can define two groups here to extract the user name and | |
//domain name | |
regex e("([[:w:]]+)@([[:w:]]+)\.com"); | |
//search the string for regular expression e and save the | |
//results to m | |
bool match = regex_search(str, m, e); | |
cout << "match size: " << m.size() << endl; | |
for(int i = 0; i < m.size(); ++i){ | |
//print out the matched results | |
cout << "m[" << i <<"]: str() = " << m[i].str() <<endl; | |
} | |
//prefix is everything before the first matched character | |
cout <<"m.prefix().str(): " << m.prefix().str() << endl; | |
//suffic is everything after the last matched character | |
cout <<"m.suffix().str(): " << m.suffix().str() <<endl; | |
} | |
return 0; | |
} |
Here is a screenshot of the result if we run the above code:
3. Iterators in Regular Expression
If we have defined a string like the following:
string str = “zhangxm01@gmail.com; zhan@163.com; zhan_j@yahoo.com”;
and we want to extract the e-mail and domain names in this string, what should we do? Unfortunately, the previous code in section two won’t work because it only extracts the first e-mail and domain name.
In this case, we need to use iterators or token iterators in the regular expression. The following code shows how to use iterators in the regular expression:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include<iostream> | |
#include<string> | |
#include<vector> | |
#include<regex> | |
using namespace std; | |
//We can use regular expression iterator here! | |
int main() { | |
string str = "zhangxm01@gmail.com; zhan@163.com; zhan_j@yahoo.com"; | |
regex e("([[:w:]]+)@([[:w:]]+)\.com"); | |
//regex iterator | |
sregex_iterator pos(str.cbegin(), str.cend(), e); | |
//default constructor defines past-the-end iterator | |
sregex_iterator end; | |
for(; pos != end; ++pos){ | |
cout << "Matched: " << pos->str(0) << endl; | |
cout << "User Name: " << pos->str(1) << endl; | |
cout << "Domain: " << pos->str(2) <<endl; | |
cout << endl; | |
} | |
system("pause"); | |
return 0; | |
} |
Try it yourself! A sample screenshot of the result is provided below:
An alternative way to do this is to use token iterator:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include<iostream> | |
#include<string> | |
#include<vector> | |
#include<regex> | |
using namespace std; | |
//regex_token_iterator: pointing to a sub match | |
int main() { | |
string str = "zhangxm01@gmail.com; zhan@163.com; zhan_j@yahoo.com"; | |
regex e("([[:w:]]+)@([[:w:]]+)\.com"); | |
//regex iterator | |
sregex_token_iterator pos(str.cbegin(), str.cend(), e); | |
//default constructor defines past-the-end iterator | |
sregex_token_iterator end; | |
for(; pos != end; ++pos){ | |
//str() cannot have any parameters | |
cout << "Matched: " << pos->str() << endl; | |
cout << endl; | |
} | |
system("pause"); | |
return 0; | |
} |
The difference between regular expression iterator and the token iterator is that the first one will point to a detailed match, which is why that iterator can have multiple data members, and each data member corresponding to a sub-match. The token iterator can only point to a sub-match, which is why pos->str() cannot have any parameter in str().
4. Regular Expression Replace
we can use regex_replace to replace the regular expression groups into strings.
Try the following code yourself:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include<iostream> | |
#include<string> | |
#include<vector> | |
#include<regex> | |
using namespace std; | |
//regex_replace | |
int main() { | |
string str = "zhangxm01@gmail.com; zhan@163.com; zhan_j@yahoo.com"; | |
regex e("([[:w:]]+)@([[:w:]]+)\.com"); | |
//$1 and $2 represents the group 1 and group 2 in the regex | |
cout <<regex_replace(str, e, "$1 is on $2") << endl;; | |
//We can add more flags in the regex_replace function | |
cout << regex_replace(str, e, "$1 is on $2", regex_constants::format_no_copy|regex_constants::format_first_only); | |
cout<< endl; | |
system("pause"); | |
return 0; | |
} |
Thank you for reading. Have a nice day!