Preprocessing Phase for C++17's Searchers

Table of Contents

Searchers from C++17 are a new way to perform efficient pattern lookups. The new standard offers three searchers: default_searcher, boyer_moore_searcher and boyer_moore_horspool_searcher. The last two implements algorithms that require some additional preprocessing for the input pattern. Is there a chance to separate preprocessing time from the search time?

Short Reminder

In my last article I’ve introduced searchers that were added into C++17.

Quoting the standard:

Searchers provides function object types for operations that search for a sequence [pat_first, pat_last) in another sequence [first, last) that is provided to the object’s function call operator. The first sequence (the pattern to be searched for) is provided to the object’s constructor, and the second (the sequence to be searched) is provided to the function call operator. Each specialization of a class template specified in this subclause shall meet the CopyConstructible and CopyAssignable requirements.

template<class ForwardIterator, class Searcher>
ForwardIterator search( ForwardIterator first, ForwardIterator last,
                        const Searcher& searcher );

For now we have three searchers:

default_searcher
boyer_moore_searcher
boyer_moore_horspool_searcher

Last time, however, I didn’t correctly summarise what a searcher is. This is because it’s not immediately clear - if you just look at std::search reference.

The basic idea is that each Searcher wraps the pattern you’d like to search. That means also doing some necessary preprocessing. Later - inside std::search - each searcher exposes operator()(first, last) - a way to look for a pattern into the [first, last) range.

Additional, since searcher is copyable and assignable - you can pass it around in your application.

Since a searcher is a separate object, we might do a little experiment and measure how much time it takes… let’s see.

The Series

This article is part of my series about C++17 Library Utilities. Here’s the list of the other topics that I’ll cover:

Resources about C++17 STL:

C++17 In Detail by Bartek!
C++17 - The Complete Guide by Nicolai Josuttis
C++ Fundamentals Including C++ 17 by Kate Gregory
Practical C++14 and C++17 Features - by Giovanni Dicanio
C++17 STL Cookbook by Jacek Galowicz

Demo Application

Source code: github.com/fenbf/articles/cpp17/searchers/searchers.cpp

How the test works:

the app loads a file, like a book sample - 500kb of text.
the whole file content is stored in one input string.
a pattern is selected
- you can look for a string
- or for N characters from the input string (from start, centre or the end)
the app uses several algorithms and runs each search ITER times.

Command line:

searcher.exe file iterations N|string Pos
file - text file to load
iterations - the number of iterations
N|string - number of letters or a given string
Pos - optional parameter when N is specified:
    0 - the start of the input string
    1 - the centre of the input string
    > 1 - end of the input string

For example:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "the town"

The above command will look for “the town” string in the input file “book-test.txt” and will perform 1000 iterations.

Another command:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10 1

This will look for 10 characters from the centre (pos=1).

Here’s the code for the boyer_moore_horspool version:

Searcher Preprocessing

In the first version of the demo application I used code:

RunAndMeasure("boyer_moore_horspool_searcher", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        auto it = std::search(testString.begin(), testString.end(),
            std::boyer_moore_horspool_searcher(
                needle.begin(), needle.end()));
        if (it == testString.end())
            std::cout << "The string " << needle << " not found\n";
    }
});

The above code measured the whole search. But now we can split it, or extract the preprocessing phase.

For example:

RunAndMeasure("boyer_moore_searcher init only", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        std::boyer_moore_searcher b(needle.begin(), needle.end());
        DoNotOptimizeAway(&b);
    }
    return 0;
});

All data structures must be initialised in the constructor of the searcher objects. Later only operator() is used to perform the search.

Some Performance Results

Here’s what I got from running a few tests.

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 1000 1
string length: 547412
test iterations: 1000
needle from the center...
pattern length: 1000
string::find: 207.235 ms
default searcher: 336.469 ms
boyer_moore_searcher init only: 4.65379 ms
boyer_moore_searcher: 33.383 ms
boyer_moore_horspool_searcher init only: 0.926099 ms
boyer_moore_horspool_searcher: 31.652 ms

When searching for 1000 letters from the centre of the input string, both of the new algorithms were faster than the default searcher and string::find. boyer_moore uses more time to perform the initialisation than boyer_moore_horspool (it creates two lookup tables, rather than one, so it will use more space and preprocessing). But it looks like boyer_moore search time is a bit faster: 33ms - 4.6ms vs 31.6 - 0.92ms.

The cost of preprocessing in boyer_moore is more visible if you make the pattern even larger:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10000 1
string length: 547412
test iterations: 1000
needle from the center...
pattern length: 10000
string::find: 154.501 ms
default searcher: 291.107 ms
boyer_moore_searcher init only: 104.912 ms
boyer_moore_searcher: 126.098 ms
boyer_moore_horspool_searcher init only: 6.35085 ms
boyer_moore_horspool_searcher: 25.0702 ms

104ms vs 6ms!

How about more realistic searchers and patterns. It’s probably quite rare to look for 1000 letters…

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "the town"
string length: 547412
test iterations: 1000
needle is a string...
pattern length: 8
string::find: 32.6093 ms
default searcher: 57.8666 ms
boyer_moore_searcher init only: 0.423179 ms
boyer_moore_searcher: 22.0527 ms
boyer_moore_horspool_searcher init only: 0.288173 ms
boyer_moore_horspool_searcher: 21.9978 ms

When looking for “the town” (appears in line 711 out of 9469 line). The preprocessing seems to be super fast, and the new algorithms could beat the string::find version.

If the string is longer and positioned in near the end of the file:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "This Web site
 includes information about Project"
string length: 547412
test iterations: 1000
needle is a string...
pattern length: 48
string::find: 60.324 ms
default searcher: 408.87 ms
boyer_moore_searcher init only: 0.670692 ms
boyer_moore_searcher: 125.899 ms
boyer_moore_horspool_searcher init only: 0.326464 ms
boyer_moore_horspool_searcher: 127.477 ms

Here, when looking for “This Web site includes information about Project” - which is located at the end of the file (a single occurrence) Boyer-Moore algorithms are 2x slower than string::find.

As usual, I encourage you to perform your own tests.

Summary

In this post, I wanted to stress that each searcher can perform some initialisation in its constructor. What’s more, searchers can be initialised once and then passed around in the application - might be useful when you search for the same pattern over and over.