Regular Expression in C++

This article summarizes the regular expression syntax and usage in C++. Most of the content of this article is from Bo Qian’s modern C++ tutorial series: https://www.youtube.com/playlist?list=PL5jc9xFGsL8FWtnZBeTqZBbniyw0uHyaH

If you are new to C++, I personally highly recommend following his channel. His tutorial is well organized and easy to follow. If you just want to quickly go through the content, just scroll down. Let’s start.

Regular Expression:  is a sequence of characters that define a search pattern. Usually, such patterns are used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation.    — Wikipedia

Modern C++ supports 6 types of regular expression system:

– ECMAScript (default)
– basic
– extended
– awk
– grep
– egrep

The default regular expression system in C++ is ECMAScript. In this article, we will specifically focus on ECMAScript.

1. Regular Expression Syntax

In order to use the regular expression in C++, you should first include the regex header file:

#include <regex>

Then you can easily define your own regular expression using the following syntax:

regex e(“abc”);

If you want to change the regular expression system, you can simply add a flag during the definition of the regular expression:

regex e(“abc”, regex_constants::grep); // Now we change the regular expression system from ECMAScript to grep.

The following code provides an overview of how to use regular expression in C++. You may copy and paste the code in your ide, uncomment the code line by line and check the result by yourself.

#include<iostream>
#include<string>
#include<vector>
#include<regex>
using namespace std;
//Basic grammar
int main(){
string str;
while(true){
cin>>str;
//using regular expression grammar: ECMAScript (default)
//regex e("^abc.", regex_constants::grep);
//match string with exactly "abc"
//regex e("abc");
//match string with "abc", ignore lowercase or uppercase
//regex e("abc", regex_constants::icase);
//'.' matches to any character except the newline
//regex e("abc.");
//'?' matches 0 or 1 preceding characters!
//e.g. "ab", "abc" will be matches
//regex e("abc?");
//matches 0 or more preceding characters.
//regex e("abc*");
//1 or more preceding characters.
//regex e("abc+");
//[…] Any character inside the square brackets, represents
//1 character
//e.g. "abcd", "abc", "abcccddd", "abdcdc" are matches
//regex e("ab[cd]*");
//[^…] Any character not inside the squre brackets,
//represents 1 character
//regex e("ab[^cd]*");
//{number} exact match the [number] of preceding characters
//e.g. "abccc", "abcdd", "abcdc" are matches
//e.g. "abcdcdcd", "abdddccc", "ab" are not matches
//regex e("ab[cd]{3}");
//matches 3 or more preceding characters
//regex e("ab[cd]{3,}");
//matches 3, 4 or 5 preceding characters
//regex e("ab[cd]{3,5}");
//| – OR
//e.g. "abc", "def", "deg" are matches
//"abcd", "defg" are not matches
//regex e("abc|de[fg]");
//matches ']'
//regex e("\]");
//\1 represents the group 1, the () defines the group
//e.g. "abcdeeeeabc" is a match
//regex e("(abc)de+\\1");
//Note (de+) defines the group, if we have (de+) defined
//as "deee", then for group 2, it should always be "deee".
//regex e("(ab)*c(de+)\\2\\1");
//matching any e-mail address
//[[:w:]] word character: digit, number, or underscore!
//regex e("[[:w:]]+@[[:w:]]+\.com");
//'^' marks that "abc" should be at the beginning of the string
//regex e("^abc", regex_constants::icase);
//'$' marks that "abc." should appear at the end of the string
regex e("abc.$", regex_constants::icase);
/*
[:s:] – a white space character;
[:w:] – a word character;
[:d:] – a decimal digit character;
[:upper:] – an uppercase character;
[:lower:] – a lowercase character;
[:alnum:] – an alpha-numerical character;
[:alpha:] – an alphabetic character;
[:blank:] – a blank character;
[:punct:] – a punctuation mark character
*/
//check whether str can be matched with regular expression e
bool match = regex_match(str, e);
//search the string str to see whether there is a substring
//of str, that can be matched with e
bool isFound = regex_search(str, e);
cout << (match ? "Matched" : "Not matched") << endl << endl;
}
return 0;
}

view raw
CPP_Regex_01.cpp
hosted with ❤ by GitHub


2. Sub-match in Regular Expression

We can store the match result in smatch data structure, and get the detailed matching result of the specific groups defined in the regular expression. The general syntax and the meaning of smatch are provided below:

std::match_result<> – store the detailed matches!
smatch – detailed match in the string!

smatch m;
m[0].str() – The entire match (same with m.str(), m.str(0))
m[1].str() – The substring that matches the first group (same as m.str(1))
m[2].str() – The substring that matches the second group
m.prefix() – Everything before the first matched character
m.suffix() – Everything after the last matched character

The following code provides an overview of how to use the smatch library:

#include<iostream>
#include<string>
#include<vector>
#include<regex>
using namespace std;
int main() {
string str;
while (true) {
cin >> str;
smatch m; //typedef std::match_results<string>
//We can define two groups here to extract the user name and
//domain name
regex e("([[:w:]]+)@([[:w:]]+)\.com");
//search the string for regular expression e and save the
//results to m
bool match = regex_search(str, m, e);
cout << "match size: " << m.size() << endl;
for(int i = 0; i < m.size(); ++i){
//print out the matched results
cout << "m[" << i <<"]: str() = " << m[i].str() <<endl;
}
//prefix is everything before the first matched character
cout <<"m.prefix().str(): " << m.prefix().str() << endl;
//suffic is everything after the last matched character
cout <<"m.suffix().str(): " << m.suffix().str() <<endl;
}
return 0;
}

view raw
CPP_Regex_02.cpp
hosted with ❤ by GitHub

Here is a screenshot of the result if we run the above code:

RE_Results


3. Iterators in Regular Expression

If we have defined a string like the following:

string str = “zhangxm01@gmail.com; zhan@163.com; zhan_j@yahoo.com”;

and we want to extract the e-mail and domain names in this string, what should we do? Unfortunately, the previous code in section two won’t work because it only extracts the first e-mail and domain name.

In this case, we need to use iterators or token iterators in the regular expression. The following code shows how to use iterators in the regular expression:

#include<iostream>
#include<string>
#include<vector>
#include<regex>
using namespace std;
//We can use regular expression iterator here!
int main() {
string str = "zhangxm01@gmail.com; zhan@163.com; zhan_j@yahoo.com";
regex e("([[:w:]]+)@([[:w:]]+)\.com");
//regex iterator
sregex_iterator pos(str.cbegin(), str.cend(), e);
//default constructor defines past-the-end iterator
sregex_iterator end;
for(; pos != end; ++pos){
cout << "Matched: " << pos->str(0) << endl;
cout << "User Name: " << pos->str(1) << endl;
cout << "Domain: " << pos->str(2) <<endl;
cout << endl;
}
system("pause");
return 0;
}

view raw
CPP_Regex_03.cpp
hosted with ❤ by GitHub

Try it yourself! A sample screenshot of the result is provided below:

RE_Results_02

An alternative way to do this is to use token iterator:

#include<iostream>
#include<string>
#include<vector>
#include<regex>
using namespace std;
//regex_token_iterator: pointing to a sub match
int main() {
string str = "zhangxm01@gmail.com; zhan@163.com; zhan_j@yahoo.com";
regex e("([[:w:]]+)@([[:w:]]+)\.com");
//regex iterator
sregex_token_iterator pos(str.cbegin(), str.cend(), e);
//default constructor defines past-the-end iterator
sregex_token_iterator end;
for(; pos != end; ++pos){
//str() cannot have any parameters
cout << "Matched: " << pos->str() << endl;
cout << endl;
}
system("pause");
return 0;
}

view raw
CPP_Regex_04.cpp
hosted with ❤ by GitHub

The difference between regular expression iterator and the token iterator is that the first one will point to a detailed match, which is why that iterator can have multiple data members, and each data member corresponding to a sub-match. The token iterator can only point to a sub-match, which is why pos->str() cannot have any parameter in str().


4. Regular Expression Replace

we can use regex_replace to replace the regular expression groups into strings.

Try the following code yourself:

#include<iostream>
#include<string>
#include<vector>
#include<regex>
using namespace std;
//regex_replace
int main() {
string str = "zhangxm01@gmail.com; zhan@163.com; zhan_j@yahoo.com";
regex e("([[:w:]]+)@([[:w:]]+)\.com");
//$1 and $2 represents the group 1 and group 2 in the regex
cout <<regex_replace(str, e, "$1 is on $2") << endl;;
//We can add more flags in the regex_replace function
cout << regex_replace(str, e, "$1 is on $2", regex_constants::format_no_copy|regex_constants::format_first_only);
cout<< endl;
system("pause");
return 0;
}

view raw
CPP_Regex_05.cpp
hosted with ❤ by GitHub


Thank you for reading. Have a nice day!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s