Sponsors: KDAB and Whole Tomato Software

03 July 2017

Randomizing lines from a file, a tool in VC++ 2017

Writing a utility tool in VC++ 2017

Is C++ well suited for writing fast small utilities/tools?

Let’s see:

For my recent giveaway I needed a tool that would take an input file - CSV with data and then draw a few winners from all of the entries. To make things more complicated each entry might have a different weight.

Read more for the full description, my solution and experiments.

The Problem

In short:

I have all of the entries to the giveaway (in a file), I need to pick (randomly) 3 winners.

More details:

All of the entries can be exported into CSV. The file has the following structure:

Name Email Entries Actions Date
ABC ABC@abc.com 10 5 2017 May
XYZ XYZ@xyz.com 1 1 2017 June

The field Entries is actually a weight of a user. 1 is the default. If you see weight like 10 it means that the user is 10x likely to win than a user with entries = 1.

How to randomize such structure and pick the winners?

For a start we can load all of the lines (skip the header), then randomize/shuffle the entries and just look at first N (in our case 3) top entries.

The problem is the weight parameter.

In our case, It might be relatively easy as the weight is not a double or float… it’s just an integer value. What we can do is to duplicate the entries as many times as the eight param suggests.

For example:

If there’s a weight of 2 we need to have 2 copies of a given entry.

Then we can shuffle such structure and then users with weights > 1 should have bigger chance to win.

The Solution

I don’t know about you, but I usually didn’t memorize code to load/process a file… but I remembered where could I get some basic parts for my project.

Some time ago there was a post from Meeting C++: Randomizing a CSV File with Standard C++.

Sounds like a good start for me… right?

I didn’t hesitate to take some parts of it and I started my project.

The repo: fenbf/RandomMachine.

As you might expect, it’s not a huge project! I’m not using any super advanced data structures, class hierarchies or complex systems. The tool should work and should be relatively fast to write.

What’s the overall structure?

Let’s have a look at main:

int main(int argc, const char *argv[])
{
    try
    {
        const auto inputParams = ReadInputParams(argc, argv);

        const auto vecLineEntries = ReadAllLines(inputParams);

        auto vecIdWithWeights = BuildFromEntries(vecLineEntries);

        ShuffleWithMT19937(begin(vecIdWithWeights), end(vecIdWithWeights));

        const auto vecWinners = DrawWinners(vecLineEntries,
            vecIdWithWeights, 
            inputParams.numElementsToPick);

        ShowWinners(vecWinners, vecLineEntries, inputParams);
    }
    catch (...)
    {

    }

    return 0;
}

Core parts:

  • It’s a command line tool, there are several parameters:
    • inputFileName
    • UsersToPick
    • LinesToSkip
    • Column ID of the weight param
    • Column separator
    • I know that I should be using Boost.ProgramOptions, but this time I wrote my own simple parsing.
  • ReadAllLines will open a file and parse it. It produces a vector of all entries. Each entry has a string - with the line text and also a weight param (by default it’s one).
  • Based on entries we build an additional index vector. Here each line entry is expanded/duplicated based on the weight param. But it’s only id, not the full copy.
    • For example if there are two entries, with weights of 2 and 3, we’ll get something like {0, 0, 1, 1, 1 }.
  • The index vector is then shuffled
  • After shuffling we can just look at the top N entries. But we need to take care of drawing only unique winners. Due to the weight it might happen that we pick the same user twice… or more. So if that happens we just look at another entry.
    • As noted in one of the comments: shuffling is probably not optimal solution. A better way would be to just pick one element at random, then mark it and then draw another one (until I reach number of winners). It doesn’t matter when number of lines/winners are relatively small (like up to 100), but when number of winners (elements to pick) is small and number of lines is larger then that’s much better choice.
  • After we draw from the collection, we just have to present it.
  • Simplified error handling - using exceptions - is added.
  • I tired to be const correct and used const whenever possible: Please declare your variables as const.

Interesting Parts

In terms of details let’s see how the final function of drawing is built:

vector<LineEntry> 
DrawWinners(const vector<LineEntry>& vecInputLines, 
            const vector<int>& vecWeightedIndices, 
            unsigned int numElementsToPick)
{
    unsigned int winnersFound = 0;
    std::set<int> setWinners;
    std::vector<LineEntry> outVec;

    for (unsigned int i = 0; 
        winnersFound < numElementsToPick && i < vecWeightedIndices.size(); 
        ++i)
    {
        const auto index = vecWeightedIndices[i];
        const auto &entry = vecInputLines[index];

        if (setWinners.find(index) == setWinners.end())
        {
            setWinners.insert(index);

            outVec.push_back(entry);

            winnersFound++;
        }
    }

    return outVec;
}

So the above code is responsible for drawing top N entries using randomized index vector. The shuffling part is done before the call to the function. The only little complication is to avoid duplicates of winners. I am using a separate set to mark if a entry is already a winner or not.

Then we just need to output the selected lines.

What are other interesting parts in terms of C++ and Visual Studio?

Modern C++

What’s used from modern C++?

  • auto wherever possible
  • non static data member initialization
  • uniform initialization
  • random: std::random_shuffle is deprecated in C++14 - Meeting C++, and since I got that randomizing code from Meeting C++ it already used mt19937. The only thing I did was to wrap shuffling into a simple template function:
template <typename ItRandom> 
void ShuffleWithMT19937(ItRandom itFirst, ItRandom itLast)
{
    std::random_device rd;
    std::mt19937 g(rd());
    std::shuffle(itFirst, itLast, g);
}
  • string_view - have a look at this great post: string_view | Growing up. I was able to use a few string_views across the whole code. But I need to get used to it.
    • So far I’ve noticed that there’s a problem with numeric conversions directly from a string_view. So I needed to make a copy of the string first and then do the conversion (like calling std::atoi).
  • For each loops.
  • move semantics, returning by value, not by output ref/ptr parameter (with also a chance of using Copy Elision).

Visual Studio 2017

With Visual Studio 2017 it’s really easy to write such code. The whole IDE works just better, faster. A lot of tool - even basic refactoring - are there.

For more about VS you can read:

I was happy to see that writing unit tests for native code is as simple as for managed languages. The native unit testing framework makes life so much easier. It just works!

Unit testing native code, Visual Studio 2017

Todo / Experiments

Want to know the best about such pet projects?

You can experiment with it!

How about adding Modules?

In Visual Studio 2017 there’s already early module support. See here Using C++ Modules in Visual Studio 2017 | Visual C++ Team Blog. There’s std.core that brings the Standard Library, so that should work with my sample.

What’s more to add?

I definitely need to add more unit tests, as currently maybe like 50% of code is covered. And not all edge cases are included. Native unit testing framework is really super simple to use.

Soon VC++ 2017.3 will be released (there already preview), and we should get the following big features from C++17:

  • structured bindings
  • guaranteed copy elision
  • constexpr if-statements
  • Selection statements with initializers

For more look here: C++17 Features In Visual Studio 2017 Version 15.3 Preview

It would be great to use structured bindings and selection statements with initializer, just to see how they work in such simple code.

Possibly, if I try hard, I could even came up with an example for constexpr-if.

Any suggestions how to improve my amazing project? :)

Summary

I hope you already know that C++ is also good for writing small utilities.

Maybe such project would be simpler or smaller in Python or C#? I don’t know… but I don’t expect to see a huge difference. I didn’t use explicit memory management, only standard containers, basic exception handling. So the whole app should be quite safe and won’t leak.

Do you write simple tools in C++ or use some different language?

Any suggestions how I could refactor improve the code?

Maybe you have a better solution?

Interested in new blog posts and bonus content? Sign up for my newsletter.

© 2017, Bartlomiej Filipek, Blogger platform
Any opinions expressed herein are in no way representative of those of my employers.
This site contains ads or referral links, which provide me with a commission. Thank you for your understanding.