Parallel STL And Filesystem: Files Word Count Example

Table of Contents

Last week you might have read about a few examples of parallel algorithms. Today I have one more application that combines the ideas from the previous post.

We’ll use parallel algorithms and the standard filesystem to count words in all text files in a given directory.

The Case

In my previous post, there were two examples: one with iterating over a directory and counting the files sizes and the next one about counting words in a string. What would happen if we joined those two samples?

We can also play with execution policies and test if std::execution::par gives a performance advantage over the sequential version.

The General Idea

The application does the following:

Gets the input parameters from the command line: directory parallel:1:0 (printsizes)
It will find all TXT files in a directory (recursively)
Then it will work on the selected files and count words in each file.
The sum of all words will be presented at the end and optionally (if the third command line argument is passed) the list of paths and their corresponding word count will be shown.
The parallel argument is used to determine if the app will use sequential execution policy or parallel.
The app will also print some timings for the steps.

The pseudo code:

params = ReadInputFromCommandLine();
files = FindFiles(params);
wordsCounts = CountWords(files, params)
Present(wordsCounts);

Please notice that while each step might use parallelism to execute their internal tasks, there are “sync points” between the major steps. In my initial implementation, FindFiles must finish before CountWords can start. Such approach might not be the best but was easier to start with.

Gathering All Text Files

The sequential version is relatively simple:

std::vector<std::filesystem::path> paths;

std::filesystem::recursive_directory_iterator dirpos{ root };

std::copy_if(begin(dirpos), end(dirpos), 
    std::back_inserter(paths), 
    [](const std::filesystem::path& p) {
    if (std::filesystem::is_regular_file(p) && p.has_extension())
    {
        auto ext = p.extension();
        return ext == std::string(".txt");
    }

    return false;
});

The above code iterates over the directory and then appends a path when it checks that it is a text file.

For the parallel version I had one obstacle:

In MSVC (VS 2017 15.7.4), std::copy_if has no parallel implementation for such directory iterator (copy_if only supports random access iterators), so I had to write my custom version.

std::vector<std::filesystem::path> paths;
std::vector<std::filesystem::path> output;

std::filesystem::recursive_directory_iterator dirpos{ root };

std::copy(begin(dirpos), end(dirpos), std::back_inserter(paths));

std::mutex mut; // we need some blocking mechanism for the output...

std::for_each(pol, std::begin(paths), std::end(paths), 
    [&output, &mut](const std::filesystem::path& p) {
    if (std::filesystem::is_regular_file(p) && p.has_extension())
    {
        auto ext = p.extension();
        if (ext == std::string(".txt"))
        {
            std::unique_lock<std::mutex> lock(mut);
            output.push_back(p);
        }
    }
});

I’m using a two-step approach: firstly I collect all the paths and then I’m filtering out the entries that are not TXT files.

The code uses a mutex in the case when it pushes one more element to the output vector. This is probably not the best approach from the performance perspective.

Counting Words

When we have all the paths, then we can iterate over them and count words in each file.

To hold the results I’m using a separate vector std::vector<FileAndWordCount> filesWithWordCount

The core code:

allFilesWordCount = std::transform_reduce(
    pol, // policy: par, seq or par_unseq...
    filesWithWordCount.begin(), filesWithWordCount.end(),  
    std::uintmax_t{ 0 },         // start value        
    std::plus<>(),                // acumulate
    [](FileAndWordCount& p) {
        const auto str = GetFileContents(p.path);
        p.wordCount = CountWords(str, std::execution::par);
        return p.wordCount;
    }
);

Each task might be run in parallel and the code reads all the text from a file into one string and then runs CountWords on the given string. It uses the same algorithm as from the last post.

Warning: it might be another point for the refactoring. Why not use std::vector<FileAndWordCount> from the beginning and not waste time for transforming vector<path> into std::vector<FileAndWordCount>.

Performance Results

While I know that the code is not written in the optimal way I still see some perf boost compared to the sequential version.

One invocation over small files (10…15kb each).

.\FileWordCount.exe .\temp\ 0
Using SEQ Policy
filtering only TXT files sequential: 0.633585 ms
number of files: 60
computing the sizes: 6.82179 ms
word count of 60 TXT files: 52872

.\FileWordCount.exe .\temp\ 1
Using PAR Policy
gathering all the paths: 0.247118 ms
number of all files: 68
filtering only TXT files: 0.37423 ms
number of files: 60
computing the sizes: 1.50521 ms
word count of 60 TXT files: 52872

For 68 files (60 that are TXT) I got 1.5ms for PAR and 6,8ms for SEQ version!

And another test - reading 40 books from Gutenberg Project:

.\FileWordCount.exe ..\GutenbergBooks\ 0
Using SEQ Policy
filtering only TXT files sequential: 0.361597 ms
number of files: 40
computing the sizes: 29.1121 ms
word count of 40 TXT files: 1726386

.\FileWordCount.exe ..\GutenbergBooks\ 1
Using PAR Policy
gathering all the paths: 0.097899 ms
number of all files: 40
filtering only TXT files: 0.302384 ms
number of files: 40
computing the sizes: 17.3274 ms
word count of 40 TXT files: 1726386

This time the whole directory contains around 10MB of text files.

And I got 17ms for the PAR version and 29ms for SEQ.

Your results might be different! I’m using a Quad Core i7 Laptop with SSD.

Summary

With the ease of use of Parallel STL and Filesystem I could quite easily assemble an application that does the word counting task efficiently. As you see, I didn’t spend a lot of time to polish the code and the design, yet for small utilities that might be good enough. And what’s more: all code comes from only STL without any third party code!

You can find all the code in my repo:

github/fenbf/ParSTLTests

And the file with this example is:

FileWordCount.cpp

I’m curious what’s your ideas for the same use case? How would you improve the code?

There are several points where we could improve the code:

Find an optimal way to count words in a file: load its contents at once as string (not suitable for larger files), or read chunks at a time.
For example Instead of collecting paths and filtering them and then starting the whole process I could work on those files in parallel (without any sync point).
Compare it with OS version like WinApi for reading files and distributing tasks.
Error handling

I’m happy to see your ideas and modifications!