The benchmark is hosted at Github
I discovered a real case of premature micro-optimization when you don’t measure. You know it is pretty bad when you read premature and micro on the same sentence. On the page 44 of Optimizing C++ ebook by Agner Fog, it is written …
In some cases it is possible to replace a poorly predictable branch by a table lookup. For example:
float a; bool b; a = b ? 1.5f : 2.6f;
The ?: operator here is a branch. If it is poorly predictable then replace it by a table lookup:
float a; bool b = 0; const float lookup[2] = {2.6f, 1.5f}; a = lookup[b];
If a bool is used as an array index then it is important to make sure it is initialized or comes from a reliable source so that it can have no other values than 0 or 1. In some cases the compiler can automatically replace a branch by a conditional move.
I was trying to implement this lookup table optimization (to replace ternary operator) on a floating-point value which the code is compiled with G++ and ran on Linux. I also ran the integer benchmark and on other compilers such like Visual C++ 2019 and Clang and also the Visual C# 7 to see their differences. Below is the C++ benchmark code. The lookup array is declared as a static local variable inside the function.
int64_t IntTernaryOp(bool value) { return value ? 3 : 4; } int64_t IntArrayOp(bool value) { static const int64_t arr[2] = { 3, 4 }; return arr[value]; } double FloatTernaryOp(bool value) { return value ? 3.0f : 4.0f; } double FloatArrayOp(bool value) { static const double arr[2] = { 3.0f, 4.0f }; return arr[value]; } int main() { const size_t MAX_LOOP = 1000000000; int64_t sum = 0; double sum_f = 0; timer stopwatch; stopwatch.start("IntTernaryOp"); sum = 0; for (size_t k = 0; k < MAX_LOOP; ++k) { sum += IntTernaryOp(k % 2); } stopwatch.stop(sum); stopwatch.start("IntArrayOp"); sum = 0; for (size_t k = 0; k < MAX_LOOP; ++k) { sum += IntArrayOp(k % 2); } stopwatch.stop(sum); stopwatch.start("FloatTernaryOp"); sum_f = 0; for (size_t k = 0; k < MAX_LOOP; ++k) { sum_f += FloatTernaryOp(k % 2); } stopwatch.stop(sum_f); stopwatch.start("FloatArrayOp"); sum_f = 0; for (size_t k = 0; k < MAX_LOOP; ++k) { sum_f += FloatArrayOp(k % 2); } stopwatch.stop(sum_f); return 0; }
In Visual C# code, static local variable is not permitted so the lookup array is declared as a static member variable inside the class. C# does not allow casting integer bool to integer (to be used as array index), so I use a integer instead.
static Int64[] arrInt64 = new Int64[] { 3, 4 }; static Double[] arrDouble = new Double[] { 3.0, 4.0 }; static Int64 IntTernaryOp(Int64 value) { return (value==1) ? 3 : 4; } static Int64 IntArrayOp(Int64 value) { return arrInt64[value]; } static double FloatTernaryOp(Int64 value) { return (value == 1) ? 3.0f : 4.0f; } static double FloatArrayOp(Int64 value) { return arrDouble[value]; } static void Main(string[] args) { const int MAX_LOOP = 1000000000; Int64 sum = 0; double sum_f = 0.0; Stopwatch stopWatch = new Stopwatch(); stopWatch.Start(); sum = 0; for (Int64 k = 0; k < MAX_LOOP; ++k) { sum += IntTernaryOp(k % 2); } stopWatch.Stop(); DisplayElapseTime("IntTernaryOp", stopWatch.Elapsed, sum); stopWatch = new Stopwatch(); stopWatch.Start(); sum = 0; for (Int64 k = 0; k < MAX_LOOP; ++k) { sum += IntArrayOp(k % 2); } stopWatch.Stop(); DisplayElapseTime("IntArrayOp", stopWatch.Elapsed, sum); stopWatch = new Stopwatch(); stopWatch.Start(); sum_f = 0; for (Int64 k = 0; k < MAX_LOOP; ++k) { sum_f += FloatTernaryOp(k % 2); } stopWatch.Stop(); DisplayElapseTime("FloatTernaryOp", stopWatch.Elapsed, sum_f); stopWatch = new Stopwatch(); stopWatch.Start(); sum_f = 0; for (Int64 k = 0; k < MAX_LOOP; ++k) { sum_f += FloatArrayOp(k % 2); } stopWatch.Stop(); DisplayElapseTime("FloatArrayOp", stopWatch.Elapsed, sum_f); }
This is the benchmark results on a computer with Intel i7-8700 at 3.2Ghz with 16GB RAM.
VC++ /Ox results IntTernaryOp: 562ms, result:3500000000 IntArrayOp: 523ms, result:3500000000 FloatTernaryOp: 3972ms, result:3.5e+09 FloatArrayOp: 1030ms, result:3.5e+09 G++ 7.4.0 -O3 results IntTernaryOp: 306ms, result:3500000000 IntArrayOp: 519ms, result:3500000000 FloatTernaryOp: 1030ms, result:3.5e+09 FloatArrayOp: 1030ms, result:3.5e+09 Clang++ 6.0.0 -O# results IntTernaryOp: 585ms, result:3500000000 IntArrayOp: 523ms, result:3500000000 FloatTernaryOp: 1030ms, result:3.5e+09 FloatArrayOp: 1030ms, result:3.5e+09 C# 7 Release Mode, .NET Framework 4.7.2 IntTernaryOp: 1311ms, result:3500000000 IntArrayOp: 1038ms, result:3500000000 FloatTernaryOp: 2448ms, result:3500000000 FloatArrayOp: 1036ms, result:3500000000
For the integer benchmark, the lookup table got worse performance than lookup table with G++, while in Visual C++/C# and Clang++, there is improvement. As it turns out in the floating benchmark I am looking for (in G++), improvement is only seen Visual C++/C# while G++ and Clang++ retained the same timing for both ternary and lookup table code. I did check the assembly code at the Godbolt Compiler Explorer, none of the ternary operator is converted to use conditional move as suggested by Agner Fog’s book so I guess the short branch is still with the CPU cache, therefore it makes little difference in G++ and Clang++ floating-point test. The morale of the story is never take a book’s word at its value and always measure to confirm your hypothesis. In my case, I forgo this lookup micro-optimization.
Leave a Reply