Partners: KDAB Whole Tomato Software CppDepend

20 August 2018

Speeding up Pattern Searches with Boyer-Moore Algorithm from C++17

C++17 searchers

With C++17 you can now use more sophisticated algorithms for pattern searches! Now, you’ll have more control and a promising performance boost for many use cases.

See what the options are.

Intro

The naive approach of finding a pattern in a string is O(nm) (where n is the length of the whole string, m is the length of the pattern). There are much better alternatives. For instance Boyer-Moore with the linear complexity.

The algorithm is for example used in grep - see this reference - why GNU grep is fast,

I’m not an expert in describing algorithms, so here’s an excellent introduction of Boyer-Moore:

C++17 updated std::search algorithm in two ways:

  • you can now use execution policy to run the default version of the algorithm but in a parallel way.
  • you can provide a Searcher object that handles the search.

For now we have three searchers:

  • default_searcher
  • boyer_moore_searcher
  • boyer_moore_horspool_searcher

Preprocessing

Both of the algorithms Boyer Moore and Boyer Moore Horspool use some knowledge about the pattern string so that they can skip fruitless comparisons. In order to be “smarter”, each algorithm does a preprocessing that analyses the input pattern. The complexity of the preprocessing usually depends on the size of the alphabet of the string.

Horspool is a simplified version of Boyer-Moore (with only bad character rule) and uses smaller internal tables. The average complexity is linear, but the worst case might be O(mn).

In Boost

You might be familiar with the searching algorithms if you use boost libraries. In the version 1.50 (2012, June) there was a new set of algorithms added: see boost Version 1.50.0.

In the library there are three searchers objects:

The Series

This article is part of my series about C++17 Library Utilities. Here’s the list of the other topics that I’ll cover:

Resources about C++17 STL:

How To Use

C++17 provides a new overload for std::search:

template<class ForwardIterator, class Searcher>
ForwardIterator search( ForwardIterator first, ForwardIterator last,
                        const Searcher& searcher );

Each searcher usually takes two input iterators - begin and end of a pattern, and then a binary predicate - usually it’s an equality operator. They might also use other parameters - for example, a hashing function.

All in all you can use it like:

std::string testString = "Hello Super World";
std::string needle = "Super";
auto it = search(testString.begin(), testString.end(),
                    boyer_moore_searcher(needle.begin(), needle.end()));

if (it == testString.end())
    cout << "The string " << needle << " not found\n";

Some Basic Tests

I wrote some basic tests that show some nice performance boost for the new algorithms in MSVC.

Source code: github.com/fenbf/articles/cpp17/searchers/searchers.cpp

How the test works:

  • the app loads a file, like a book sample - 500kb of text.
  • the whole file content is stored in one input string.
  • a pattern is selected - N last letters of the input string.
  • the app uses several algorithms and runs each search ITER times.

for example:

the std::string::find version:

RunAndMeasure("string::find", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        std::size_t found = testString.find(needle);
        if (found == std::string::npos)
            std::cout << "The string " << needle << " not found\n";
    }
});

the boyer_moore_horspool version:

RunAndMeasure("boyer_moore_horspool_searcher", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        auto it = std::search(testString.begin(), testString.end(),
            std::boyer_moore_horspool_searcher(
                needle.begin(), needle.end()));
        if (it == testString.end())
            std::cout << "The string " << needle << " not found\n";
    }
});

Here are the results (i7 4720HQ, Win 10, MSVC 2017 15.8, Release 64 bit)

The pattern is composed of 10000 letters from the end of the input text.

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10000
string length: 547412
test iterations: 1000
pattern length: 10000
string::find: 693.449 ms
default searcher: 1102.25 ms
boyer_moore_searcher: 133.558 ms
boyer_moore_horspool_searcher: 37.0234 ms

The patter is now the last 100 letters of the input text:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 200
string length: 547412
test iterations: 1000
pattern length: 200
string::find: 158.612 ms
default searcher: 467.518 ms
boyer_moore_searcher: 58.8752 ms
boyer_moore_horspool_searcher: 56.7017 ms

The sample results need some more investigation. For example for short patterns, the string::find version is usually faster. Also, horspool algorithm was faster than the boyer_moore option in most cases.

More performance tests are worth a separate blog post, with much better result analysis. So I’ll leave this for your experiments now.

Using Other Containers

The important fact about std::search is that it’s a generic algorithm! So you can use it not only for strings!

Here’s a sample code for searching a pattern of numbers in a vector of integers.

std::vector<int> testVector(1000000);
std::iota(testVector.begin(), testVector.end(), 0);
std::vector vecNeedle(testVector.end() - 1000, testVector.end());

auto it = std::search(testVector.begin(), testVector.end(),
        std::boyer_moore_horspool_searcher(
                vecNeedle.begin(), vecNeedle.end()));

if (it == testVector.end())
        std::cout << "The pattern " << needle << " not found\n";

Summary

Followup post here: Preprocessing Phase for C++17's Searchers

The article just briefly shows new capabilities that you get in C++17. It’s important to know that the new algorithms will not always be faster than std::string::find (for strings).

Have you used new string searchers? What are your use cases?
I showed some results from MSVC, can you run the code on GCC or Clang?

Get my free ebook about C++17!

More than 50 pages about the new Language Standard.

C++17 in detail, by Bartlomiej Filipek

© 2017, Bartlomiej Filipek, Blogger platform
Any opinions expressed herein are in no way representative of those of my employers.
This site contains ads or referral links, which provide me with a commission. Thank you for your understanding.