# Real Case of Premature Micro-Optimization

The benchmark is hosted at Github

I discovered a real case of premature micro-optimization when you don’t measure. You know it is pretty bad when you read premature and micro on the same sentence. On the page 44 of Optimizing C++ ebook by Agner Fog, it is written …

In some cases it is possible to replace a poorly predictable branch by a table lookup. For example:

```float a;
bool b;
a = b ? 1.5f : 2.6f;```

The ?: operator here is a branch. If it is poorly predictable then replace it by a table lookup:

```float a;
bool b = 0;
const float lookup[2] = {2.6f, 1.5f};
a = lookup[b];```

If a bool is used as an array index then it is important to make sure it is initialized or comes from a reliable source so that it can have no other values than 0 or 1. In some cases the compiler can automatically replace a branch by a conditional move.

I was trying to implement this lookup table optimization (to replace ternary operator) on a floating-point value which the code is compiled with G++ and ran on Linux. I also ran the integer benchmark and on other compilers such like Visual C++ 2019 and Clang and also the Visual C# 7 to see their differences. Below is the C++ benchmark code. The lookup array is declared as a static local variable inside the function.

```int64_t IntTernaryOp(bool value)
{
return value ? 3 : 4;
}

int64_t IntArrayOp(bool value)
{
static const int64_t arr[2] = { 3, 4 };
return arr[value];
}

double FloatTernaryOp(bool value)
{
return value ? 3.0f : 4.0f;
}

double FloatArrayOp(bool value)
{
static const double arr[2] = { 3.0f, 4.0f };
return arr[value];
}

int main()
{
const size_t MAX_LOOP = 1000000000;

int64_t sum = 0;
double sum_f = 0;

timer stopwatch;

stopwatch.start("IntTernaryOp");
sum = 0;
for (size_t k = 0; k < MAX_LOOP; ++k)
{
sum += IntTernaryOp(k % 2);
}
stopwatch.stop(sum);

stopwatch.start("IntArrayOp");
sum = 0;
for (size_t k = 0; k < MAX_LOOP; ++k)
{
sum += IntArrayOp(k % 2);
}
stopwatch.stop(sum);

stopwatch.start("FloatTernaryOp");
sum_f = 0;
for (size_t k = 0; k < MAX_LOOP; ++k)
{
sum_f += FloatTernaryOp(k % 2);
}
stopwatch.stop(sum_f);

stopwatch.start("FloatArrayOp");
sum_f = 0;
for (size_t k = 0; k < MAX_LOOP; ++k)
{
sum_f += FloatArrayOp(k % 2);
}
stopwatch.stop(sum_f);

return 0;
}```

In Visual C# code, static local variable is not permitted so the lookup array is declared as a static member variable inside the class. C# does not allow casting integer bool to integer (to be used as array index), so I use a integer instead.

```static Int64[] arrInt64 = new Int64[] { 3, 4 };
static Double[] arrDouble = new Double[] { 3.0, 4.0 };

static Int64 IntTernaryOp(Int64 value)
{
return (value==1) ? 3 : 4;
}

static Int64 IntArrayOp(Int64 value)
{
return arrInt64[value];
}

static double FloatTernaryOp(Int64 value)
{
return (value == 1) ? 3.0f : 4.0f;
}

static double FloatArrayOp(Int64 value)
{
return arrDouble[value];
}

static void Main(string[] args)
{
const int MAX_LOOP = 1000000000;

Int64 sum = 0;
double sum_f = 0.0;

Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
sum = 0;
for (Int64 k = 0; k < MAX_LOOP; ++k)
{
sum += IntTernaryOp(k % 2);
}
stopWatch.Stop();
DisplayElapseTime("IntTernaryOp", stopWatch.Elapsed, sum);

stopWatch = new Stopwatch();
stopWatch.Start();
sum = 0;
for (Int64 k = 0; k < MAX_LOOP; ++k)
{
sum += IntArrayOp(k % 2);
}
stopWatch.Stop();
DisplayElapseTime("IntArrayOp", stopWatch.Elapsed, sum);

stopWatch = new Stopwatch();
stopWatch.Start();
sum_f = 0;
for (Int64 k = 0; k < MAX_LOOP; ++k)
{
sum_f += FloatTernaryOp(k % 2);
}
stopWatch.Stop();
DisplayElapseTime("FloatTernaryOp", stopWatch.Elapsed, sum_f);

stopWatch = new Stopwatch();
stopWatch.Start();
sum_f = 0;
for (Int64 k = 0; k < MAX_LOOP; ++k)
{
sum_f += FloatArrayOp(k % 2);
}
stopWatch.Stop();
DisplayElapseTime("FloatArrayOp", stopWatch.Elapsed, sum_f);
}```

This is the benchmark results on a computer with Intel i7-8700 at 3.2Ghz with 16GB RAM.

```VC++ /Ox results
IntTernaryOp:  562ms, result:3500000000
IntArrayOp:  523ms, result:3500000000
FloatTernaryOp: 3972ms, result:3.5e+09
FloatArrayOp: 1030ms, result:3.5e+09
G++ 7.4.0 -O3 results
IntTernaryOp:  306ms, result:3500000000
IntArrayOp:  519ms, result:3500000000
FloatTernaryOp: 1030ms, result:3.5e+09
FloatArrayOp: 1030ms, result:3.5e+09
Clang++ 6.0.0 -O# results
IntTernaryOp:  585ms, result:3500000000
IntArrayOp:  523ms, result:3500000000
FloatTernaryOp: 1030ms, result:3.5e+09
FloatArrayOp: 1030ms, result:3.5e+09
C# 7 Release Mode, .NET Framework 4.7.2
IntTernaryOp: 1311ms, result:3500000000
IntArrayOp: 1038ms, result:3500000000
FloatTernaryOp: 2448ms, result:3500000000
FloatArrayOp: 1036ms, result:3500000000```

For the integer benchmark, the lookup table got worse performance than lookup table with G++, while in Visual C++/C# and Clang++, there is improvement. As it turns out in the floating benchmark I am looking for (in G++), improvement is only seen Visual C++/C# while G++ and Clang++ retained the same timing for both ternary and lookup table code. I did check the assembly code at the Godbolt Compiler Explorer, none of the ternary operator is converted to use conditional move as suggested by Agner Fog’s book so I guess the short branch is still with the CPU cache, therefore it makes little difference in G++ and Clang++ floating-point test. The morale of the story is never take a book’s word at its value and always measure to confirm your hypothesis. In my case, I forgo this lookup micro-optimization.