Copy-On-Write Base Class Template

This tip presents a really simple Copy-On-Write(COW) class template (to be used as a base class) to ease developer effort to implement COW class. Copy-On-Write as its name implies, is a technique to duplicate(copy) shared resource upon write: the only exception to this write operation is when there is only one existing copy and not shared among multiple instances, duplication is not required. Qt 5 and earlier versions makes use of COW extensively in its framework because it predates C++11. COW can boost performance when used judiciously. Our helper class, COWBase has 4 methods: namely, construct() to be called in the derived class constructor, clone_if_needed() to be called in derived class setter and modifying functions before other code, ptr() to return raw internal pointer to T and use_count() to return number of instances.

template<typename T>
class COWBase
{
protected:
    // call this function in derived constructor
    void construct();

    // call this function in derived's setter before other code
    void clone_if_needed();
    
    // function to get the internal raw ptr
    const T* ptr() const;
    T* ptr();
    
    // returns count of the shared_ptr instance
    long use_count() const;
private:
    std::shared_ptr<T> m_ptr;
};

construct() make call to make_shared() to instantiate the shared_ptr.

// call this function in derived constructor
void construct() {
    m_ptr = std::make_shared<T>();
}

clone_if_needed() checks whether there is more than 1 instance of shared_ptr before making a new copy.

// call this function in derived's setter before other code
void clone_if_needed() {
    if (m_ptr.use_count() > 1) {
        std::shared_ptr<T> old = m_ptr;
        construct();
        *m_ptr = *old; // copy the old contents to new ptr.
    }
}

ptr() to get raw pointer to T as mentioned.

// function to get the internal raw ptr
const T* ptr() const {
    return m_ptr.get();
}
T* ptr() {
    return m_ptr.get();
}

use_count() returns number of shared_ptr instances.

// returns count of the shared_ptr instance
long use_count() const {
    return m_ptr.use_count();
}

Example of the Inherited Class

We’ll implement a mock TextBox for our example. The TextBoxImpl holds the data for TextBox, has 2 members, a title and color.

struct TextBoxImpl
{
    std::string m_Title;
    int m_Color;
};

TextBox inherits privately from COWBase. For those who are not acquainted with private inheritance. The public inheritance that most developers are accustomed to seeing, is a is-a relationship while private inheritance is implemented-in-terms-of relationship: privately inherited class still can access the public and protected of the base class as per usual, the only difference is user of privately inherited class can neither call the base class methods nor access its data members, hence the meaning private. If reader is keen to read more about private inheritance, go to the excellent resource on this topic: C++ Tutorial Private Inheritance. To get back on TextBox discussion, its functions include 2 constructors, 2 getters with 2 corresponding setters and a Display() method to send the contents to console.

#include "COWBase.h"

class TextBox : private COWBase<TextBoxImpl>
{
public:
    TextBox();
    TextBox(const std::string& title, int color);
	
    int GetColor() const;
    const std::string& GetTitle() const;

    void SetColor(int color);
    void SetTitle(const std::string& title);
	
    void Display();
};

The default constructor just call COWBase‘s construct() while the non-default constructor calls construct() and then initialize TextBoxImpl‘s members.

TextBox() {
    construct();
}
TextBox(const std::string& title, int color) {
    construct();
    ptr()->m_Title = title;
    ptr()->m_Color = color;
}

For getters, it is business as usual. The members are accessed through the ptr().

int GetColor() const 
{ return ptr()->m_Color; }

const std::string& GetTitle() const 
{ return ptr()->m_Title; }

Every setter or modifying function must call clone_if_needed() before anything else.

void SetColor(int color) {
    clone_if_needed();
    ptr()->m_Color = color;
}
void SetTitle(const std::string& title) {
    clone_if_needed();
    ptr()->m_Title = title;
}

Display() shows the contents of use_count(), title and color.

void Display() {
    std::cout << "use_count:" << use_count() 
              << ", Title: " << ptr()->m_Title 
              << ", Color: " << ptr()->m_Color 
              << "\n";
}

Usage of the Inherited Class

We create a TextBox object named a and assigned it to b.

TextBox a("Hello", 1);
a.Display();

TextBox b = a;
b.Display();

After a is assigned to b, use_count increased to 2.

use_count:1, Title: Hello, Color: 1
use_count:2, Title: Hello, Color: 1

After b changes its color to 2, its use_count() is dropped to 1. After b changes its title to "world", use_count() is still 1 because it is the same instance.

b.SetColor(2);
std::cout << "\nAfter setting color on b\n";
b.Display();

b.SetTitle("world");
std::cout << "\nAfter setting title on b\n";
b.Display();

The output is expected as follows.

After setting color:2 on b
use_count:1, Title: Hello, Color: 2

After setting title:world on b
use_count:1, Title: world, Color: 2

After this tutorial, I hope I have done a good job of convincing you, the reader to see how this COWBase class could be tremendous help to writing COW class with C++. The code in this post is hosted at Github. Happy writing COW class!

Optimization Turns Out to Be Pessimization!

Benchmark is hosted at Github

I have doing this optimization in my hobby OpenGL projects for years. My only sin is I did not measure their performance. I just assumed the one with less operation must be the faster one. Whenever a programmer wants to do get a fractional value of a maximum integer value, he/she cast the numerator and denominator to float before doing division and get the product of the division result and the max value and then cast back the result to integer. See below.

// With float conversion
int value = (int)(((float)numerator / denominator) * max_value);

Being the smart ass I am, I get the product of max value and numerator before divided by denominator. In that way, I can get away with casting to float and back. but with this method, care must be taken that product of max value and numerator must never exceed the maximum limit of integer!

// Without float conversion
int value = max_value * numerator / denominator;

To my dismay, I discovered today that the one with float conversion is the faster one. The culprit could be integer division is several times slower than float division. See Floating Point and Integer Arithmetic Benchmark.

It turns out that it is also the faster than a constant literal divisor.

// Without float conversion with constant divisor
int value = max_value * numerator / 1000;

The benchmark result is as follows.

C# 7, .NET Framework 4.6.1
===========================
                             With float conversion timing: 1827ms
                          Without float conversion timing: 2326ms
    Without float conversion with constant divisor timing: 1813ms

G++ 7.4.0, -O3
===========================
                             With float conversion timing:  298ms
                          Without float conversion timing: 2067ms
    Without float conversion with constant divisor timing:  869ms

Clang++ 6.0.0, -O3
===========================
                             With float conversion timing:  273ms
                          Without float conversion timing: 1169ms
    Without float conversion with constant divisor timing:  470ms

VC++, update 16.4, /Ox
===========================
                             With float conversion timing:  951ms
                          Without float conversion timing: 2087ms
    Without float conversion with constant divisor timing: 1022ms

The lesson here is to always measure, measure and measure!

As a point of interest, Windows SDK provides a MulDiv() that closely mirrors the assembly instructions by multiplying two 32-bit values and then divides the 64-bit result by a third 32-bit value.

int MulDiv(
  int nNumber,
  int nNumerator,
  int nDenominator
);

C++17: codecvt_utf8 is deprecated

Last week, I got this compilation error after upgrading my Visual C++ project to C++17 for using codecvt_utf8 to convert UTF-8 string back and fro wchar_t string.

error C4996: ‘std::codecvt_utf8<wchar_t,1114111,0>’: warning STL4017: std::wbuffer_convert, std::wstring_convert, and the <codecvt> header (containing std::codecvt_mode, std::codecvt_utf8, std::codecvt_utf16, and std::codecvt_utf8_utf16) are deprecated in C++17. (The std::codecvt class template is NOT deprecated.) The C++ Standard doesn’t provide equivalent non-deprecated functionality; consider using MultiByteToWideChar() and WideCharToMultiByte() from <Windows.h> instead. You can define _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING or _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS to acknowledge that you have received this warning.

The workaround as stated in error is to define _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING or _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS. The offending code in question is below.

#include <locale>
#include <codecvt>

std::string utf8_encode(const std::wstring& source)
{
	return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(source);
}

std::wstring utf8_decode(const std::string& source)
{
    return std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(source);
}

Or if you can live with platform specific code or coding for Windows platform exclusively, you can use MultiByteToWideChar() and WideCharToMultiByte() as suggested in this StackOverflow post.

// Convert a wide Unicode string to an UTF8 string
std::string utf8_encode(const std::wstring &wstr)
{
    if (wstr.empty()) return std::string();
    int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
    std::string strTo(size_needed, 0);
    WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
    return strTo;
}

// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
    if (str.empty()) return std::wstring();
    int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
    std::wstring wstrTo(size_needed, 0);
    MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
    return wstrTo;
}

For portable code, you can use the UTF8-CPP library by Nemanja Trifunovic. Note: we have to check for _MSC_VER macro to detect Windows platform because on Windows, wchar_t is UTF-16 while on other platform such as Linux and MacOS, wchar_t is UTF-32! To get the code example compiled successfully, an old version of UTF8-CPP have to be used instead of the latest version listed above.

// Convert a wide Unicode string to an UTF8 string
std::string utf8_encode(const std::wstring &wstr)
{
    std::string utf8line;

    if (wstr.empty())
        return utf8line;

#ifdef _MSC_VER
    utf8::utf16to8(wstr.begin(), wstr.end(), std::back_inserter(utf8line));
#else
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(utf8line));
#endif
    return utf8line;
}

// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
    std::wstring wide_line;

    if (str.empty())
        return wide_line;

#ifdef _MSC_VER
    utf8::utf8to16(str.begin(), str.end(), std::back_inserter(wide_line));
#else
    utf8::utf8to32(str.begin(), str.end(), std::back_inserter(wide_line));
#endif
    return wide_line;
}

Real Case of Premature Micro-Optimization

The benchmark is hosted at Github

I discovered a real case of premature micro-optimization when you don’t measure. You know it is pretty bad when you read premature and micro on the same sentence. On the page 44 of Optimizing C++ ebook by Agner Fog, it is written …

In some cases it is possible to replace a poorly predictable branch by a table lookup. For example:

float a;
bool b;
a = b ? 1.5f : 2.6f;

The ?: operator here is a branch. If it is poorly predictable then replace it by a table lookup:

float a;
bool b = 0;
const float lookup[2] = {2.6f, 1.5f};
a = lookup[b];

If a bool is used as an array index then it is important to make sure it is initialized or comes from a reliable source so that it can have no other values than 0 or 1. In some cases the compiler can automatically replace a branch by a conditional move.

I was trying to implement this lookup table optimization (to replace ternary operator) on a floating-point value which the code is compiled with G++ and ran on Linux. I also ran the integer benchmark and on other compilers such like Visual C++ 2019 and Clang and also the Visual C# 7 to see their differences. Below is the C++ benchmark code. The lookup array is declared as a static local variable inside the function.

int64_t IntTernaryOp(bool value)
{
    return value ? 3 : 4;
}

int64_t IntArrayOp(bool value)
{
    static const int64_t arr[2] = { 3, 4 };
    return arr[value];
}

double FloatTernaryOp(bool value)
{
    return value ? 3.0f : 4.0f;
}

double FloatArrayOp(bool value)
{
    static const double arr[2] = { 3.0f, 4.0f };
    return arr[value];
}

int main()
{
    const size_t MAX_LOOP = 1000000000;
    
    int64_t sum = 0;
    double sum_f = 0;

    timer stopwatch;

    stopwatch.start("IntTernaryOp");
    sum = 0;
    for (size_t k = 0; k < MAX_LOOP; ++k)
    {
        sum += IntTernaryOp(k % 2);
    }
    stopwatch.stop(sum);

    stopwatch.start("IntArrayOp");
    sum = 0;
    for (size_t k = 0; k < MAX_LOOP; ++k)
    {
        sum += IntArrayOp(k % 2);
    }
    stopwatch.stop(sum);

    stopwatch.start("FloatTernaryOp");
    sum_f = 0;
    for (size_t k = 0; k < MAX_LOOP; ++k)
    {
        sum_f += FloatTernaryOp(k % 2);
    }
    stopwatch.stop(sum_f);

    stopwatch.start("FloatArrayOp");
    sum_f = 0;
    for (size_t k = 0; k < MAX_LOOP; ++k)
    {
        sum_f += FloatArrayOp(k % 2);
    }
    stopwatch.stop(sum_f);

    return 0;
}

In Visual C# code, static local variable is not permitted so the lookup array is declared as a static member variable inside the class. C# does not allow casting integer bool to integer (to be used as array index), so I use a integer instead.

static Int64[] arrInt64 = new Int64[] { 3, 4 };
static Double[] arrDouble = new Double[] { 3.0, 4.0 };

static Int64 IntTernaryOp(Int64 value)
{
    return (value==1) ? 3 : 4;
}

static Int64 IntArrayOp(Int64 value)
{
    return arrInt64[value];
}

static double FloatTernaryOp(Int64 value)
{
    return (value == 1) ? 3.0f : 4.0f;
}

static double FloatArrayOp(Int64 value)
{
    return arrDouble[value];
}

static void Main(string[] args)
{
    const int MAX_LOOP = 1000000000;

    Int64 sum = 0;
    double sum_f = 0.0;

    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    sum = 0;
    for (Int64 k = 0; k < MAX_LOOP; ++k)
    {
        sum += IntTernaryOp(k % 2);
    }
    stopWatch.Stop();
    DisplayElapseTime("IntTernaryOp", stopWatch.Elapsed, sum);

    stopWatch = new Stopwatch();
    stopWatch.Start();
    sum = 0;
    for (Int64 k = 0; k < MAX_LOOP; ++k)
    {
        sum += IntArrayOp(k % 2);
    }
    stopWatch.Stop();
    DisplayElapseTime("IntArrayOp", stopWatch.Elapsed, sum);

    stopWatch = new Stopwatch();
    stopWatch.Start();
    sum_f = 0;
    for (Int64 k = 0; k < MAX_LOOP; ++k)
    {
        sum_f += FloatTernaryOp(k % 2);
    }
    stopWatch.Stop();
    DisplayElapseTime("FloatTernaryOp", stopWatch.Elapsed, sum_f);

    stopWatch = new Stopwatch();
    stopWatch.Start();
    sum_f = 0;
    for (Int64 k = 0; k < MAX_LOOP; ++k)
    {
        sum_f += FloatArrayOp(k % 2);
    }
    stopWatch.Stop();
    DisplayElapseTime("FloatArrayOp", stopWatch.Elapsed, sum_f);
}

This is the benchmark results on a computer with Intel i7-8700 at 3.2Ghz with 16GB RAM.

VC++ /Ox results
       IntTernaryOp:  562ms, result:3500000000
         IntArrayOp:  523ms, result:3500000000
     FloatTernaryOp: 3972ms, result:3.5e+09
       FloatArrayOp: 1030ms, result:3.5e+09
G++ 7.4.0 -O3 results
       IntTernaryOp:  306ms, result:3500000000
         IntArrayOp:  519ms, result:3500000000
     FloatTernaryOp: 1030ms, result:3.5e+09
       FloatArrayOp: 1030ms, result:3.5e+09
Clang++ 6.0.0 -O# results
       IntTernaryOp:  585ms, result:3500000000
         IntArrayOp:  523ms, result:3500000000
     FloatTernaryOp: 1030ms, result:3.5e+09
       FloatArrayOp: 1030ms, result:3.5e+09
C# 7 Release Mode, .NET Framework 4.7.2
       IntTernaryOp: 1311ms, result:3500000000
         IntArrayOp: 1038ms, result:3500000000
     FloatTernaryOp: 2448ms, result:3500000000
       FloatArrayOp: 1036ms, result:3500000000

For the integer benchmark, the lookup table got worse performance than lookup table with G++, while in Visual C++/C# and Clang++, there is improvement. As it turns out in the floating benchmark I am looking for (in G++), improvement is only seen Visual C++/C# while G++ and Clang++ retained the same timing for both ternary and lookup table code. I did check the assembly code at the Godbolt Compiler Explorer, none of the ternary operator is converted to use conditional move as suggested by Agner Fog’s book so I guess the short branch is still with the CPU cache, therefore it makes little difference in G++ and Clang++ floating-point test. The morale of the story is never take a book’s word at its value and always measure to confirm your hypothesis. In my case, I forgo this lookup micro-optimization.

Intel Should Implement Conditional SIMD

Auto-vectorizer can be a great help in vectorizing simple instructions with SIMD instruction calls.

int a[COUNT];
int b[COUNT];
int c[COUNT];

for(int i=0; i<COUNT; ++i)
{
    a[i] = b[i] + c[i];
}

But a simple conditional check is enough to break the auto-vectorizer.

for(int i=0; i<COUNT; ++i)
{
    // a conditional check breaks the auto-vectorizer
    if(b[i] > X && b[i] < Y)
        a[i] = b[i] + c[i];
}

I hereby propose to Intel to add a new set of conditional SIMD whereby all SIMD instructions to accept an additional operand of a vector of booleans. Only when the boolean is true, then the instruction proceeds with the operations. Whereas the original SIMD instruction set will be retained, in order not to break all existing applications but their calls will automatically be redirected to conditional SIMD with a vector of booleans implicitly all set to true. This is to save the silicon estate on the chip.

C++: Quick and Dirty Log Extraction and Replay

New Text Stream library is hosted at Github.

Log Extraction

Hackers can stay infiltrated and undetected up to 200 days! Nowadays, hackers do not even bother to cover their tracks because most logs are not checked for unusual activities. This tip-trick is to help to change all that with the New Text Stream library. Unfortunately, I had given it a New Text name which stuck until now. Talk about bad naming! New Text Stream library has write-read symmetry which write and read operations are the exact opposite. See the pseudo-code below. See New Text Stream tutorial for more information.

// Writing
ostm << name <> name >> age;

In this tip-trick, we are just going to focus on the input stream. For it to work, every of your log information mustn’t be broken into multi-line because the library only parses one line at any given time. And timestamp at the beginning of each log line must be removed first because timestamp is certainly different and cannot be matched. In the example below, the log logged the new user registration activity and UserName and DoB are extracted. match() has a second process parameter default to true. Set to false if you are not sure whether the input line is going to match successfully, so as not to waste time to process (or tokenize) it. The matching is not foolproof. For example, say ", DoB=" appears in the UserName which you are trying to extract, the 2nd extraction (DoB) shall fail. The principle behind it is simple, for the first input ({0}), the input stream try to extract whatever text in between "NEW_USER UserName=" and ", DoB=" and the last string extraction ({1}) is whatever text behind ", DoB=". This method cannot extract variable-length array, to workaround it, extract the array as a string input and split the string according to its delimiter.

try
{
    new_text::ifstream is;

    std::string log_line = "2020-01-16 18:23:56.020 NEW_USER UserName=Kelly, DoB=1999-01-02";
    log_line = log_line.substr(24); // get rid of the log timestamp
    is.str(log_line);
    if (is.match("NEW_USER UserName={0}, DoB={1}"))
    {
        std::string name = "";
        std::string date = "";
        is >> name >> date;
        std::cout << name << ":" << date << std::endl;
    }
}
catch (std::exception & e)
{
    std::cout << "Exception thrown:" << e.what() << std::endl;
}

The output is as follows.

Kelly:1999-01-02

Next, to enable extracting the date into year, month and day with ease, you can use the matching code below. :t signals to the library to trim the string afterward and :02 tells the New Text Stream that the number is always 2 digits and may come with a leading zero and trim away that zero. Read New Text Stream tutorial for more information.

try
{
    new_text::ifstream is;

    std::string log_line = "2020-01-16 18:23:56.020 NEW_USER UserName=Kelly, DoB=1999-01-02";
    log_line = log_line.substr(24); // get rid of the log timestamp
    is.str(log_line);
    if (is.match("NEW_USER UserName={0}, DoB={1:t}-{2:02}-{3:02}")) // Parse date in YYYY-MM-DD
    {
        std::string name = "";
        int year = 0, month = 0, day = 0;
        is >> name >> year >> month >> day;
        std::cout << name << ":" << year << "-" << month << "-" << day << std::endl;
    }
}
catch (std::exception & e)
{
    std::cout << "Exception thrown:" << e.what() << std::endl;
}

The output is as follows.

Kelly:1999-1-2

Fuzzing

Fuzzing is a methodology to check for runtime problem or vulnerability by attempting to crash your application with unpredictable input on the attack surface. There are 2 types of random input, black-box and white-box. Black box fuzzing is truly random and fuzzer is not aware of the data type being fuzzed whereas white-box fuzzing is different in this regard, for example, if it knows the data type is a date and it tries to input an invalid date with 13th month or 30th day in February. The number one frustration with fuzzing is most of the time we do not have access to the random input that causes the crash, rendering developer helpless to troubleshoot. We can log every random input and try to replay the commands with input extracted with New Text Stream. Remember to clear all log after a successful fuzzing run so as not to accumulate the log.

Log Replay

When the application crashed at a customer site, we may not have the necessary information to reproduce the crash. If we log down commands and its input, we can replay, provided that the input is neither huge nor consists of complex composite data type. We do not have to log every function call, say you have an engine with 10,000 functions and only 100 of them are publicly accessible, you only need to implement logging for those, not all the functions. Log replay is much harder to implement for GUI application that involves mouse clicking and keyboard typing. However, if the GUI has an underlying engine and that engine can be compiled and linked to run standalone from the GUI; The log replay still can work on the standalone engine to debug the issue.

 

C++17: Benchmark of Singular Min/Max and Iterator Versions

The benchmark code is hosted at GitHub.

In this tip, we’ll benchmark native operator with singular min()/max() and iterator versions like min_element()/max_element(). Lastly, we benchmark the combined operations with minmax_element().

This is the benchmark code.

const size_t MAX_LOOP = (argc == 2) ? atoi(argv[1]) : 1000;

using int_type = uint64_t;
std::vector vec(1000000);
std::iota(vec.begin(), vec.end(), 1);
std::random_device rd;
std::mt19937 g(rd());

std::shuffle(vec.begin(), vec.end(), g);
timer stopwatch;

stopwatch.start("manual min");
int_type min = vec[0];
for (size_t k = 0; k < MAX_LOOP; ++k)
{
    min = vec[0];
    for (auto n : vec)
    {
        if (n < min)
            min = n;
    }
}
stopwatch.stop(min);

stopwatch.start("std min");
min = vec[0];
for (size_t k = 0; k < MAX_LOOP; ++k)
{
    min = vec[0];
    for (auto n : vec)
        min = std::min(n, min);
}
stopwatch.stop(min);

stopwatch.start("std min_element");
min = vec[0];
for (size_t k = 0; k < MAX_LOOP; ++k)
{
    min = vec[0];
    min = * std::min_element(vec.cbegin(), vec.cend());
}
stopwatch.stop(min);

std::cout << std::endl;

stopwatch.start("manual max");
int_type max = vec[0];
for (size_t k = 0; k  max)
            max = n;
    }
}
stopwatch.stop(max);

stopwatch.start("std max");
max = vec[0];
for (size_t k = 0; k < MAX_LOOP; ++k)
{
    max = vec[0];
    for (auto n : vec)
        max = std::max(n, max);
}
stopwatch.stop(max);

stopwatch.start("std max_element");
max = vec[0];
for (size_t k = 0; k < MAX_LOOP; ++k)
{
    max = vec[0];
    max = *std::max_element(vec.cbegin(), vec.cend());
}
stopwatch.stop(max);

std::cout << std::endl;

stopwatch.start("manual min max");
for (size_t k = 0; k < MAX_LOOP; ++k)
{
    min = vec[0];
    max = vec[0];
    for (auto n : vec)
    {
        if (n  max)
            max = n;
    }
}
stopwatch.stop(min, max);

stopwatch.start("std min & max");
for (size_t k = 0; k < MAX_LOOP; ++k)
{
    min = vec[0];
    max = vec[0];
    for (auto n : vec)
    {
        min = std::min(n, min);
        max = std::max(n, max);
    }
}
stopwatch.stop(min, max);

stopwatch.start("std minmax_element");
std::pair<std::vector::const_iterator, std::vector::const_iterator> minmax;
for (size_t k = 0; k < MAX_LOOP; ++k)
{
    minmax = std::minmax_element(vec.cbegin(), vec.cend());
}
stopwatch.stop(*minmax.first, *minmax.second);

The benchmark is done on Intel i7-8700 CPU at 3.2GHz. Clang++ and G++ benchmark is built and run on Windows Subsystem for Linux(WSL)(Ubuntu 18.04 version) on Windows 10 update 19.10. The timing is based on 1 billion iterations.

VC++ update 16.4, /Ox

From the Visual C++ results, we can see the iterator versions are consistently slower than the native operators and min()/max(). In other words, iterator versions are not worth the trouble to use them.

         manual min:  592ms, result:1
            std min:  630ms, result:1
    std min_element: 1889ms, result:1

         manual max:  832ms, result:1000000
            std max:  585ms, result:1000000
    std max_element: 1891ms, result:1000000

     manual min max:  816ms, result:1,1000000
      std min & max:  762ms, result:1,1000000
 std minmax_element: 2143ms, result:1,1000000

Clang++ 6.0.0, -O3

From the Clang++ results, the iterator version of min_element() and minmax_element() performs better than singular version.

         manual min:  521ms, result:1
            std min:  419ms, result:1
    std min_element:  386ms, result:1

         manual max:  416ms, result:1000000
            std max:  445ms, result:1000000
    std max_element:  428ms, result:1000000

     manual min max:  701ms, result:1,1000000
      std min & max:  969ms, result:1,1000000
 std minmax_element:  514ms, result:1,1000000

G++ 7.4.0, -O3

For G++, we can just stick to operators for min/max since their results comes neck to neck. From the results, minmax_element() is to be avoided since it performs worse than the combined and min()/max().

         manual min:  801ms, result:1
            std min:  810ms, result:1
    std min_element:  808ms, result:1

         manual max:  566ms, result:1000000
            std max:  552ms, result:1000000
    std max_element:  555ms, result:1000000

     manual min max:  799ms, result:1,1000000
      std min & max:  796ms, result:1,1000000
 std minmax_element: 2052ms, result:1,1000000

C#: Never Test For NaN With Equality Test

Let's write a simple program to test. To get a NaN, we do a 0/0 and store it in num. First of all, the divisor has to be a float-point type because DivideByZeroException shall be thrown for integer or decimal zero divisor. Note: Non-zero number divided by zero (float) gives an infinity number. In our program, we compare num to NaN and next, compare num to itself and then, we compare 2.0 to NaN. Lastly, we check if num is NaN with Double.IsNaN().

double num = 0.0 / 0.0; // Result of 0 divided by 0 is a NaN

Console.WriteLine("num == NaN is {0}", (num == double.NaN));
Console.WriteLine("num == num is {0}", (num == num));
Console.WriteLine("2.0 == NaN is {0}", (2.0 == double.NaN));
Console.WriteLine("double.IsNaN(num) is {0}", double.IsNaN(num));

This is the output. All the 3 NaN comparisons return False while IsNaN() check on num returns True. This is exactly the same behaviour with C++. The advice for this tip is never test for NaN with equality test, use IsNaN() instead.

num == NaN is False
num == num is False
2.0 == NaN is False
double.IsNaN(num) is True

If reader is interested to know more about floating-point format, read my Succinct Guide to Floating Point Format For C++ and C# Programmers.

Shh…Your interviewer didn’t have the answers to his technical questions!

Advice for suitable type of technical question to ask in an interview

 

Write code to sum up from 1 to 100

One has to especially careful when the question is deceptively simple, for instance, sum up all the numbers from 1 to 100. A simple summing loop wouldn’t suffice to satisfy your interviewer. He is looking for you to give answer of O(1) time complexity. Of course, I gave the same answer as then eight year old Carl Friedrich Gauss, the math child prodigy.

jo1.1

jo1.2

The story has it that Carl Gauss’s teacher is trying to take a rest from teaching, so he gave his class a problem of adding all the integers from 1 to 100. But Gauss spotted a pattern of adding the beginning and ending numbers giving an average of 50 that enables him to come up with an formula to calculate it.

  1 + 99 = 100, avg = 100/2 = 50
  2 + 98 = 100, avg = 100/2 = 50
  3 + 97 = 100, avg = 100/2 = 50
  .
  .
  .
 47 + 53 = 100, avg = 100/2 = 50
 48 + 52 = 100, avg = 100/2 = 50
 49 + 51 = 100, avg = 100/2 = 50

As it turns out Singapore education system teaches elementary school pupils this Gauss trick.

I want to ask reader a question: Does it mean every Singaporeans, including me, who’s been through the Singapore education, as talented as Gauss? The bigger question is whether the interviewer came up with the answer independently or he has read it somewhere on the web? It wouldn’t be fair to ask a question that yourself is not capable of answering it.

Surely, knowing the answer because you just learnt of it from another person does not legitimately make you an expert. However, in fact, not a small number of interviewers actually believe if you give the Gauss answer or make use of exclusive OR to swap two integers, your programming skills must be at amazing level. This is absolutely absurd at extraordinary level!

Swap 2 integers without a temporary variable

A common interview question where job candidate are asked to swap 2 integers without third variable.

X := X XOR Y
Y := Y XOR X
X := X XOR Y

If you do swapping via a temporary variable and examine the assembly instructions at Godbolt Compiler Explorer, you notice that compiler does not optimize the code to XOR swapping because for XOR swapping to work, there has to be a check that ensures the 2 variables are not the same and that check makes the execution slower than the third variable swapping method. And XOR swapping is strictly for integers and nothing else: it cannot be used to swap pointers. It is not unreasonable to expect the questioner to well-research the usefulness of question/answer beforehand.

Write code to perform addition without using the plus operator

Even an electronics graduate like me would be hard-pressed to recall the full-adder circuit and convert that circuit into code on the spot to answer this question. This question and the Lee algorithm are Google interview questions from electronics domain.

Only those micromouse team members are well-acquainted with Lee algorithm because micromouse used it for maze-solving. Lee algorithm is a very simple algorithm where the robot travels from a higher value cell to lower value cell. Each cell value is determined from the minimum value of the 4 neighboring cells and increment that value by one. Lee algorithm is the most simple algorithm I have known in my life but even it has its own intricacies only an implementer would know. In my honest opinion, narrow electronics-related question should be avoided unless organisation is within that industry.

To digress a little bit, for those readers who are interested in Lee algorithm, read my article!

Embedded system interview question: Bubble sort

If a job candidate is applying for embedded system position in Singapore, most likely than not, he is asked to write a bubble sort function without pseudo code or instructions! I was admittedly stumped by this question. Why bubble sort, the slowest type? Why not insertion sort or other efficient sort algorithm like merge sort? Sometimes this is the only interview question on the test. I asked my tester how many people managed to answer that correctly. He said almost all, matter-of-factly. Obviously, every job applicant in the embedded sector, memorise bubble sort by heart. Just like some programmers memorised XOR swap for interviews.

Admittedly, I was not familiar with what embedded system programming job and reponsibility entails, this could be the only feasible sort algorithm on bare metal system where every piece of memory has to be pre-allocated and the number of stored items to be sorted remains small. So it could be a legitimate question for a embedded programmer.

Many hiring managers graduated school decades ago and has not keep abreast with the latest development in software development. In my opinion, coding test should be restricted to test the problem solving skills, not involving rote memorization of information that can be easily googled or syntax errors that can be caught by compilers.

What is a good example of short question to ask?

After giving so many bad examples of questions to ask, then what about an example of good one? My opinion is to ask question with a short answer to test the problem solving skills on the spot. Avoid the questions found in the interview books as the candidate could have read the answers in the book unless he gave a different but correct answer.

This is my example of a good question. Write a custom add function to detect an overflow and throws OverflowException upon detecting it, without using the checked keyword.

uint add(uint a, uint b)
{
	const uint max_uint = 0xFFFFFFFF;
	if(b > (max_uint - a))
	    throw new OverflowException();
	return a + b;
}

This detection can be achieved in assembly code by doing the add operation and then checking the Carry Flag for unsigned arithmetic or Overflow Flag for signed arithmetic in the FLAGS register afterwards.

Another approach is to ask candidate if he know the concept and explain the concept. Don’t be surprised: many programmers with years of experience under their belt didn’t understand deadlock or know protected accessibility.

Recap

Before the end of the article, let us do a recap of the dos and don’ts in the interview advice mentioned above.

  • Avoid asking too difficult questions that technical interviewer is incapable of deriving answer by himself and the possibility of candidate knowing the answer like the interviewer beforehand is high.
  • Research the question/answer deeply before interview.
  • Avoid narrow electronics-related question unless organisation is within that industry.
  • The question should test problem solving skills of the candidate.
  • Answer to question should not involve rote memorization of easily googled information or syntax errors that can be caught by compilers.
  • Avoid the questions found in the interview books as the candidate could have read the answers in the book.
  • Ask candidate if he know the concept and explain the concept to see how well he understand it.

No Code is the Fastest Code

Optimization of finding a point with shortest distance w.r.t. a point of interest

The benchmark source code is hosted at Github.

Table of Contents

Introduction

When it comes to optimization, majority of developers turn to parallelism but per CPU core isn’t cheap. We could look to eliminate some operations to increase the per-thread performance first. That doesn’t mean accuracy have to be sacrificed. With all other things being equal, the code with lesser operations shall be more performant, hence the tip title: no code is fastest code since no code needs to be executed.

Shortest Distance

We’ll use the task of finding the shortest distance as an example. The formula of the length of a line is used to find the distance between 2 points, as shown below.

length_of_line_formula2

We have to compare and store the shorter distance. The interesting property of relationship of 2 numbers and their square roots is shown below: Given square root of D1 is lesser than that of D2, it should also be true that D1 is lesser than D2. We will make use of that property to forgo square root calculation.

lesser_and_lesser_sqrt2

Below is the C# code using length of line formula, meaning with Math.Sqrt() to find the nearest distance with respect to dest. The C++ code is identical, except for the sqrt() call, so it is not shown here to avoid repetition.

double shortest = 10000000.0;
int shortest_index = 0;

for (int j = 0; j < list.Count; ++j)
{
    Point pt = list[j];
    double x = (pt.x - dest.x);
    double y = (pt.y - dest.y);
    x *= x;
    y *= y;
    double distance = Math.Sqrt(x + y); // calling sqrt
    if (distance < shortest)
    {
        shortest = distance;
        shortest_index = j;
    }
}

This is the C# code to find the shortest distance without using Math.Sqrt() in the tight loop but Math.Sqrt() is still used to compute the distance outside the loop at the end when the shortest distance is found.

double shortest2 = 10000000.0;
int shortest_index2 = 0;

for (int j = 0; j < list.Count; ++j)
{
    Point pt = list[j];
    double x = (pt.x - dest.x);
    double y = (pt.y - dest.y);
    x *= x;
    y *= y;
    double distance = x + y; // no calling sqrt
    if (distance < shortest2)
    {
        shortest2 = distance;
        shortest_index2 = j;
    }
}
shortest2 = Math.Sqrt(shortest2);

Benchmark Results

The benchmark is done on computer with these specs: Intel(R) Core i7-8700 @ 3.2GHz CPU with 16GB RAM. The executable is built in Release x64 mode. Only one thread is utilized. The outer loop is done 1 million times while the inner loop is done on the List (or vector in C++) with one thousand elements.

VC++ 2019 16.4 update, /Ox
============================
   With sqrt timing: 4855ms
Without sqrt timing: 1264ms

In Visual C++, we get a 74% speedup over the sqrt version. G++ and Clang++ benchmark are built and run in Windows Subsystem for Linux(Ubuntu 1804). It does seem like G++ and Clang++ has a more optimized sqrt() implementation over VC++’s. Both compilers generate faster code than VC++, despite with WSL overhead running on Windows.

Clang++ 6.0.0, -O3
============================
   With sqrt timing: 1830ms
Without sqrt timing: 1047ms

With Clang++, we get 42% speedup.

G++ 7.4.0, -O3
============================
   With sqrt timing: 2388ms
Without sqrt timing: 1211ms

With G++, we get almost 50% speedup but its timing isn’t as good as Clang++’s.

C# 7, .Net Framework 4.6.1
============================
   With sqrt timing: 1938ms
Without sqrt timing: 1840ms

The result is surprising with C# 7. Only about 100 millisecond shaved off the total time. So don’t bother with this optimization in C# since the speed improvement is negligible(only 5% better).

History

  • Initial Release: 2019-01-09