Monday, February 6, 2012

C++ and beyond - The power of abstractions


Modern C++
I've been primarily a C++ programmer, even though I have dabbled with Assembler, C, Object Pascal, Java, C#, and Javascript (in that order, chronologically). Over the years, I learned to adapt my style as much as possible from the "C with classes" style to "C++ with STL" to "Modern C++".
There's a lot of awesome material online from Andrei Alexandrescu, Herb Sutter, Bjarne Stroustrup and Scott Meyers, that mostly popular since the C++0x standard came into the light and after looking at those, my understanding of the C++ toolbox improved quite a bit.

One of the oft repeated tenets in the modern C++ community is that well written generic code using type-rich templates will produce faster code than C style if used properly. I always took this on faith, since the people who claim this (like the above 4 gentlemen) have forgotten more C++ than I ever learned.

Now that I started work on BABEL, I decided to test this out and see with my own eyes, if modern C++ with abstract,type-rich idioms can beat plain old C style code, while remaining extremely high level and readable.

A numerical sequence class
I often deal with problems that involve the use of intervals or ranges... I wanted to create an abstraction that lets me deal with these in a simple manner.

Thus I want a type that holds a range like [begin, end) and define operator overloads and other member functions for slicing, disjunction and so on. One useful feature would be to have it support an iterator interface, so that it plugs into STL numeric algorithms easily.

Here's the basic public interface I ended up with :
template<typename t> struct TNumSeq
{
public:
    friend inline bool operator<(const TNumSeq &range, const T &val);
    friend inline bool operator>(const TNumSeq &range, const T &val);
    friend inline TNumSeq operator&(const TNumSeq &seq1, const TNumSeq &seq2);
    friend inline TNumSeq operator&(const TNumSeq &seq, T val);
    friend inline TNumSeq operator&(T val, const TNumSeq &seq);

    inline TNumSeq();
    inline TNumSeq(const T &b, const T &e);
    inline explicit TNumSeq(const T &b);
    inline TNumSeq &operator=(const TNumSeq &that);
    inline TNumSeq& operator&=(const TNumSeq &that);
    inline TNumSeq operator&(const TNumSeq &that);

    inline bool has(const T& val) const;
    inline operator bool() const;

    class const_iterator : public std::iterator<std::forward_iterator_tag, T>
    {
    public:
        const_iterator(const TNumSeq &seq, const T &inc, const T &val);
        inline T operator*() const;
        inline const_iterator operator++(int);
        inline const_iterator &operator++();
        inline bool operator==(const const_iterator &that) const;
        inline bool operator!=(const const_iterator &that) const;
    };

    const_iterator begin(T inc = 1) const;
    const_iterator end(T inc = 1) const;
};

It's quite simple, you can define a numeric range, use the < and > operators to check if a value or another range is outside the range on either side, you can use & to get the part of a range that overlaps with another.

Empty ranges work fine, implicit bool conversion(Danger! Will Robinson) checks for emptiness. The has() member gives you membership test. And there is a forward const_iterator that lets this class masquerade as if the sequence were a real set of stored integers. The begin() and end() functions take a stepping increment for iteration by any "jump".

So to test this out... I tested the following code :

int i = 0, j = 1 << 30, sum = 0;
for(int n = i; n < j; n ++)
{
    sum += n;
}

versus

int i = 0, j = 1 << 30;
TNumSeq<int> s(i, j);
int sum = std::accumulate(s.begin(), s.end(), 0);

Ignore the fact that the default int will overflow the summation, for the range 0 to 2^30, that's immaterial for this test. One (and certainly, yours truly) would imagine that by no stretch of imagination, the generated code for the second example would be faster that the bare bones loop in the first - Look at all that is happening in the accumulate() call :
  • Two method invocations to get the iterators, which are at least 12 byte objects, have a constructor, and passed by value to accumulate, thus copying them twice each (apparently).
  • A loop inside, which calls the comparison,increment and dereference operators on the iterator object, both of whose implementations contain member access.
Seems like an awfully complicated code path for a very trivial task - but when I timed this I found that the two versions run at the same speed. Go figure!
Here is the exact assembly code generated by the compiler for the two cases :

009A18E0  add         ecx,eax
009A18E2  inc         eax
009A18E3  cmp         eax,40000000h
009A18E8  jl          testNumSeq+60h (9A18E0h)
versus
009A1BE0  add         ecx,eax
009A1BE2  add         eax,esi
009A1BE4  cmp         eax,40000000h
009A1BE9  jne         testNumSeq+360h (9A1BE0h)

And for increment by an arbitrary number :

00BA18E4  add         ecx,eax
00BA18E6  add         eax,2
00BA18E9  cmp         eax,40000000h
00BA18EE  jl          testNumSeq+64h (0BA18E4h)
versus
00BA1BF4  add         ecx,eax
00BA1BF6  add         eax,edx
00BA1BF8  cmp         eax,40000000h
00BA1BFD  jne         testNumSeq+374h (0BA1BF4h)                                                     

No difference except the fact that the for loop uses a constant as the increment, whereas the TNumSeq uses a variable on a register as the increment. How the hell does this happen? It seems like magic, the way that the compiler took away all that 50 odd lines of code in my class and came up with the succinct loop above. The answer is that it is due to being able to express intention more clearly to the compiler. There are many little things that happen, I will detail a few...
  • RVO and move semantics - The iterator pass by value copying is elided by RVO (return value optimization). With C++11, the move semantics feature takes care of most cases of this at the language level, and whenever an object is moved, it actually reuses the storage and state, rather than copy to a new place and destroy the old one. That means, you can treat any old object, even if it is large as if it were like an int and pass it around by value with impunity, knowing that the language/compiler is taking care of efficiency concerns. No overhead induced or change to existing code!
  • const correctness - since the parent sequence of the iterator is stored within the iterator as a const reference, and the iterator dereference operator is const, The compiler knows that the state of the iterator cannot change durin operator* and that the sequence cannot change during the that as well as operator++. This means it can keep those values in registers.
  • template inlining - templates are not code, they are actually code generators with zero overhead. All the above code becomes a single chunk of some meta-C++ which, unlike the C style code, has information alongside about what the code means. The compiler is able to use this extra information to generate the best possible machine code. Trust me, it's way more smarter than you or me, when it comes to this. As an assembly programmer I often observe compiler generated machine code to see if there is any scope of improvement and 9 times out of 10, there is none, unless I radically change the algorithm or use vectorization instructions...
So, in conclusion, we can see that writing idiomatic C++ code in a modern way sacrifices nothing, in performance while still providing high level abstractions. This is the future! To quote Sutter from GoingNative2012 "Performance has always been the king, since the entire history of programming, except the last decade,but now it's 'The Return of the king'

More on this as i progress with BABEL

No comments:

Post a Comment