Table of Contents

With C++17, you can now use more sophisticated algorithms for pattern searches! You’ll have more control and a promising performance boost for many use cases. This article shows primary usage and runs a benchmark comparing the new techniques.

May 2022 Updates: added notes about C++20 and constexpr algorithms, updated the benchmark and compared against std::ranges::search and custom strchr versions.

Intro  

The naive approach of finding a pattern in a string is O(nm) (where n is the length of the whole string, m is the length of the pattern). There are much better alternatives. For instance, Boyer-Moore with the linear complexity.

The algorithm is, for example, used in grep - see this reference - why GNU grep is fast,

I’m not an expert in describing algorithms, so here’s an excellent introduction to Boyer-Moore:

C++17 updated std::search algorithm in two (exclusive) ways:

  • you can now use the execution policy to run the default version of the algorithm but in a parallel way.
  • you can provide a Searcher object that handles the search.

For now, as of C++20, we have three searchers, defined in the <functional> header:

  • default_searcher (delegates the search operation to the pre-C++17 standard library’s std::search)
  • boyer_moore_searcher
  • boyer_moore_horspool_searcher

Preprocessing  

Both of the algorithms, Boyer Moore and Boyer Moore Horspool, use some knowledge about the pattern string to skip fruitless comparisons. In order to be “smarter”, each algorithm does a preprocessing that analyses the input pattern. The complexity of the preprocessing usually depends on the size of the alphabet of the string.

Horspool is a simplified version of Boyer-Moore (with only a bad character rule) and uses smaller internal tables. The average complexity is linear, but the worst-case might be O(mn).

In Boost  

You might be familiar with the searching algorithms if you use boost libraries. In version 1.50 (2012, June) there was a new set of algorithms added: see boost Version 1.50.0.

In the library, there are three searchers objects:

The Series  

This article is part of my series about C++17 Library Utilities. Here’s the list of the other topics that I’ll cover:

Resources about C++17 STL:

How To Use Searchers  

C++17 provides a new overload for std::search:

template<class ForwardIterator, class Searcher>
ForwardIterator search( ForwardIterator first, ForwardIterator last,
                        const Searcher& searcher );

Each searcher usually takes two input iterators - the begin and end of a pattern, and then a binary predicate - usually, it’s an equality operator. They might also use other parameters - for example, a hashing function.

Here’s a basic example:

#include <algorithm>
#include <iostream>
#include <functional> // searchers
#include <iomanip>    // quoted

int main() {
    std::string str = "Hello Super World";
    std::string needle = "Super";
    std::cout << "looking for " << std::quoted(needle) 
              << " in " << std::quoted(str) << '\n';
    auto it = search(str.begin(), str.end(),
                    std::boyer_moore_searcher(needle.begin(), needle.end()));

    if (it != str.end())
        std::cout << "found at pos " << std::distance(str.begin(), it) << '\n';
    else
        std::cout << "...not found\n";
}

Play @Compiler Explorer.

You can also prepare a searcher object and then use it several times when looking for the same pattern. That allows us to preprocess the pattern once. See an example in the comment by JFT below.

Using Other Containers  

The important fact about std::search is that it’s a generic algorithm. And you can use it not only for strings!

Here’s a sample code for searching a pattern of numbers in a vector of integers.

std::vector<int> testVector(1000000);
std::iota(testVector.begin(), testVector.end(), 0);
std::vector vecNeedle(testVector.end() - 1000, testVector.end());

auto it = std::search(testVector.begin(), testVector.end(),
        std::boyer_moore_horspool_searcher(
                vecNeedle.begin(), vecNeedle.end()));

if (it == testVector.end())
        std::cout << "The pattern " << needle << " not found\n";

C++20 updates:  

In C++20, most standard algorithms can be used at compile-time - constexpr. This partially works for searchers. As of C++20, only the default_searcher is marked as constexpr, so you can use this functionality in a limited form:

See below:

#include <algorithm>
#include <iostream>
#include <functional> // searchers

constexpr bool IsPresent(std::string_view pattern, std::string_view str) {
    // only default_searcher is constexpr in cpp20
    auto it = std::search(str.begin(), str.end(),
                    std::default_searcher(pattern.begin(), pattern.end()));
    return it != str.end();
}

int main() {
    static_assert(IsPresent("hello", "super hello world") == true);
    static_assert(IsPresent("HELLO", "super hello world") == false);
}

Play @Compiler Explorer.

Additionally, C++20 also brings std::ranges::search algorithm. However, it’s not compatible with searchers from C++17, so you can only use a default searcher in that version. See the benchmark with an example below.

A Benchmark  

Let’s try to measure if searchers give any performance.

I wrote a test app that shows some nice performance boost for the new algorithms for this task.

Source code: github.com/fenbf/articles/cpp17/searchers/searchers.cpp

How the test works:

  • the app loads a file, like a book sample - 500kb of text,
  • the whole file content is stored in one std::string,
  • patterns are selected - N letters of the input string, you might select the front, middle, or the end of the string, the benchmark takes ITER/10 different patterns, by shifting them by one letter
  • the app uses several algorithms and runs each search ITER times.

The command line:

searchers.exe filename iterations pattern_len pos

pos: 
0   - from the start of the string, 
1   - from the middle,
> 1 - from the end

Let’s review some of the algorithms in the benchmark:

The std::string::find version:

RunAndMeasure("string::find", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        std::size_t found = testString.find(needles[i % PATTERNS]);
        if (found == std::string::npos)
            std::cout << "The string " << needles[i % PATTERNS] << " not found\n";
    }
    return 0;
});

The boyer_moore_horspool version:

RunAndMeasure("boyer_moore_horspool_searcher", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        auto it = std::search(testString.begin(), testString.end(),
            std::boyer_moore_horspool_searcher(
                needles[i % PATTERNS].begin(), needles[i % PATTERNS].end()));
        if (it == testString.end())
            std::cout << "The string " << needles[i % PATTERNS] << " not found\n";
    }
    return 0;
});

The C++20 ranges version:

RunAndMeasure("std::ranges::search", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        auto res = std::ranges::search(testString, needles[i % PATTERNS]);
        if (res.empty())
            std::cout << "The string " << needles[i % PATTERNS] << " not found\n";
    }
    return 0;
});

There’s also one version based on strchr/memchr function suggested by Gregory Pakos; see his gist with the code @Github.

The Results  

Here are the results (i7 8700, Win 10, MSVC 2022, Release 64 bit)

Pattern at the end

The pattern is composed of 10000 letters from the end of the input text.

.\searchers.exe ..\..\..\..\GutenbergBooks\largest.txt 1000 10000 2
string length: 547412
test iterations: 1000
needle from the end
patterns count: 100
patterns len: 10000
5 first patterns, 30 letters max:
ject Gutenberg-tm trademark.
ect Gutenberg-tm trademark.  C
ct Gutenberg-tm trademark.  Co
t Gutenberg-tm trademark.  Con
 Gutenberg-tm trademark.  Cont
string::find: 393.926 ms
strchr_find: 270.201 ms
std::ranges::search: 1706.21 ms
default searcher: 756.361 ms
boyer_moore_searcher init only: 29.7993 ms
boyer_moore_searcher: 56.3499 ms
boyer_moore_horspool_searcher init only: 5.3273 ms
boyer_moore_horspool_searcher: 29.3569 ms

Please notice that the pattern is shifted:

5 first patterns, 30 letters max:
ject Gutenberg-tm trademark.
ect Gutenberg-tm trademark.  C
ct Gutenberg-tm trademark.  Co
t Gutenberg-tm trademark.  Con
 Gutenberg-tm trademark.  Cont

This hopefully makes it harder for the CPU to cache data, and thus the benchmark might be more realistic.

Here’s the graph from that benchmark run:

Pattern in the center

The pattern is now the 1000 letters in the center of the input string:

PS .\searchers.exe ..\..\..\..\GutenbergBooks\largest.txt 1000 1000 1
string length: 547412
test iterations: 1000
needle from the center...
patterns count: 100
patterns len: 1000
5 first patterns, 30 letters max:
and D.W. Briggs. Brother
Randa
nd D.W. Briggs. Brother
Randal
d D.W. Briggs. Brother
Randall
 D.W. Briggs. Brother
Randall
D.W. Briggs. Brother
Randall o
string::find: 181.393 ms
strchr_find: 138.059 ms
std::ranges::search: 852.053 ms
default searcher: 386.184 ms
boyer_moore_searcher init only: 3.8253 ms
boyer_moore_searcher: 26.3352 ms
boyer_moore_horspool_searcher init only: 0.895 ms
boyer_moore_horspool_searcher: 25.9875 ms

And the chart:

Compiler Explorer Version

The version for Compiler Explorer, it uses GCC 12.1, and -O2: https://godbolt.org/z/6z3voE6EM

string length: 11621
test iterations: 5000
needle in 1/4 of the input string from the end...
patterns count: 500
patterns len: 3155
5 first patterns, 30 letters max: 
odio morbi quis commodo odio. 
dio morbi quis commodo odio. F
io morbi quis commodo odio. Fe
o morbi quis commodo odio. Feu
 morbi quis commodo odio. Feug
string::find: 53.3118 ms
strchr_find: 50.1767 ms
std::ranges::search: 170.277 ms
default searcher: 90.7336 ms
boyer_moore_searcher init only: 161.1 ms
boyer_moore_searcher: 237.46 ms
boyer_moore_horspool_searcher init only: 42.8164 ms
boyer_moore_horspool_searcher: 282.665 ms

This time the ranges version is not as slow as in the MSVC version, and the version with searchers seems to be slower.

Quick Bench

Quick Bench: https://quick-bench.com/q/k8S-i72re2G2phZLolIERVTiZJo

Summary  

Followup post here: Preprocessing Phase for C++17’s Searchers

The article just briefly shows new capabilities that you get in C++17, and it also updated on smaller updates in C++20. While the new algorithms offer a potential boost, sometimes an optimized version of std::string::find might still be a good alternative. As always, it’s good to measure and adjust the technique to your specific environment and problem domain.

Back to you

  • Have you used new string searchers? Or do you prefer to use string::find?
  • What are your use cases?

Share your feedback in the comments below the article.