Monday, February 6, 2012

C++ and beyond - The power of abstractions


Modern C++
I've been primarily a C++ programmer, even though I have dabbled with Assembler, C, Object Pascal, Java, C#, and Javascript (in that order, chronologically). Over the years, I learned to adapt my style as much as possible from the "C with classes" style to "C++ with STL" to "Modern C++".
There's a lot of awesome material online from Andrei Alexandrescu, Herb Sutter, Bjarne Stroustrup and Scott Meyers, that mostly popular since the C++0x standard came into the light and after looking at those, my understanding of the C++ toolbox improved quite a bit.

One of the oft repeated tenets in the modern C++ community is that well written generic code using type-rich templates will produce faster code than C style if used properly. I always took this on faith, since the people who claim this (like the above 4 gentlemen) have forgotten more C++ than I ever learned.

Now that I started work on BABEL, I decided to test this out and see with my own eyes, if modern C++ with abstract,type-rich idioms can beat plain old C style code, while remaining extremely high level and readable.

A numerical sequence class
I often deal with problems that involve the use of intervals or ranges... I wanted to create an abstraction that lets me deal with these in a simple manner.

Thus I want a type that holds a range like [begin, end) and define operator overloads and other member functions for slicing, disjunction and so on. One useful feature would be to have it support an iterator interface, so that it plugs into STL numeric algorithms easily.

Here's the basic public interface I ended up with :
template<typename t> struct TNumSeq
{
public:
    friend inline bool operator<(const TNumSeq &range, const T &val);
    friend inline bool operator>(const TNumSeq &range, const T &val);
    friend inline TNumSeq operator&(const TNumSeq &seq1, const TNumSeq &seq2);
    friend inline TNumSeq operator&(const TNumSeq &seq, T val);
    friend inline TNumSeq operator&(T val, const TNumSeq &seq);

    inline TNumSeq();
    inline TNumSeq(const T &b, const T &e);
    inline explicit TNumSeq(const T &b);
    inline TNumSeq &operator=(const TNumSeq &that);
    inline TNumSeq& operator&=(const TNumSeq &that);
    inline TNumSeq operator&(const TNumSeq &that);

    inline bool has(const T& val) const;
    inline operator bool() const;

    class const_iterator : public std::iterator<std::forward_iterator_tag, T>
    {
    public:
        const_iterator(const TNumSeq &seq, const T &inc, const T &val);
        inline T operator*() const;
        inline const_iterator operator++(int);
        inline const_iterator &operator++();
        inline bool operator==(const const_iterator &that) const;
        inline bool operator!=(const const_iterator &that) const;
    };

    const_iterator begin(T inc = 1) const;
    const_iterator end(T inc = 1) const;
};

It's quite simple, you can define a numeric range, use the < and > operators to check if a value or another range is outside the range on either side, you can use & to get the part of a range that overlaps with another.

Empty ranges work fine, implicit bool conversion(Danger! Will Robinson) checks for emptiness. The has() member gives you membership test. And there is a forward const_iterator that lets this class masquerade as if the sequence were a real set of stored integers. The begin() and end() functions take a stepping increment for iteration by any "jump".

So to test this out... I tested the following code :

int i = 0, j = 1 << 30, sum = 0;
for(int n = i; n < j; n ++)
{
    sum += n;
}

versus

int i = 0, j = 1 << 30;
TNumSeq<int> s(i, j);
int sum = std::accumulate(s.begin(), s.end(), 0);

Ignore the fact that the default int will overflow the summation, for the range 0 to 2^30, that's immaterial for this test. One (and certainly, yours truly) would imagine that by no stretch of imagination, the generated code for the second example would be faster that the bare bones loop in the first - Look at all that is happening in the accumulate() call :
  • Two method invocations to get the iterators, which are at least 12 byte objects, have a constructor, and passed by value to accumulate, thus copying them twice each (apparently).
  • A loop inside, which calls the comparison,increment and dereference operators on the iterator object, both of whose implementations contain member access.
Seems like an awfully complicated code path for a very trivial task - but when I timed this I found that the two versions run at the same speed. Go figure!
Here is the exact assembly code generated by the compiler for the two cases :

009A18E0  add         ecx,eax
009A18E2  inc         eax
009A18E3  cmp         eax,40000000h
009A18E8  jl          testNumSeq+60h (9A18E0h)
versus
009A1BE0  add         ecx,eax
009A1BE2  add         eax,esi
009A1BE4  cmp         eax,40000000h
009A1BE9  jne         testNumSeq+360h (9A1BE0h)

And for increment by an arbitrary number :

00BA18E4  add         ecx,eax
00BA18E6  add         eax,2
00BA18E9  cmp         eax,40000000h
00BA18EE  jl          testNumSeq+64h (0BA18E4h)
versus
00BA1BF4  add         ecx,eax
00BA1BF6  add         eax,edx
00BA1BF8  cmp         eax,40000000h
00BA1BFD  jne         testNumSeq+374h (0BA1BF4h)                                                     

No difference except the fact that the for loop uses a constant as the increment, whereas the TNumSeq uses a variable on a register as the increment. How the hell does this happen? It seems like magic, the way that the compiler took away all that 50 odd lines of code in my class and came up with the succinct loop above. The answer is that it is due to being able to express intention more clearly to the compiler. There are many little things that happen, I will detail a few...
  • RVO and move semantics - The iterator pass by value copying is elided by RVO (return value optimization). With C++11, the move semantics feature takes care of most cases of this at the language level, and whenever an object is moved, it actually reuses the storage and state, rather than copy to a new place and destroy the old one. That means, you can treat any old object, even if it is large as if it were like an int and pass it around by value with impunity, knowing that the language/compiler is taking care of efficiency concerns. No overhead induced or change to existing code!
  • const correctness - since the parent sequence of the iterator is stored within the iterator as a const reference, and the iterator dereference operator is const, The compiler knows that the state of the iterator cannot change durin operator* and that the sequence cannot change during the that as well as operator++. This means it can keep those values in registers.
  • template inlining - templates are not code, they are actually code generators with zero overhead. All the above code becomes a single chunk of some meta-C++ which, unlike the C style code, has information alongside about what the code means. The compiler is able to use this extra information to generate the best possible machine code. Trust me, it's way more smarter than you or me, when it comes to this. As an assembly programmer I often observe compiler generated machine code to see if there is any scope of improvement and 9 times out of 10, there is none, unless I radically change the algorithm or use vectorization instructions...
So, in conclusion, we can see that writing idiomatic C++ code in a modern way sacrifices nothing, in performance while still providing high level abstractions. This is the future! To quote Sutter from GoingNative2012 "Performance has always been the king, since the entire history of programming, except the last decade,but now it's 'The Return of the king'

More on this as i progress with BABEL

Sunday, February 5, 2012

The library of babel

If you haven't heard of that library, I recommend that you go and read the short story by that name ASAP, pronto, on the double, etc. etc.

This post however is one of a series, where I will document the creation of my own utility library in C++.

What?

One of the biggest and most important steps in creating software, whether a product or a framework is to name it properly, such that it makes some sense. A recursive acronym as the name is even better.

Since this library will contain everything (in terms of code that I ever needed in more than one project), I call it babel - To expand the recursive acronym - Babel Attempts Being Everything Literally
I don't intend anyone except me to use it, it's my little toolbox to do things the way I want them done.

It will be a heterogeneous collection of classes and modules which will include stuff extending the STL, wrappers and helpers for OpenGL, CUDA, COM, Directshow and DES.
Stuff for Data compression, Image and text handling, machine learning and so on.

Whatever I feel I want to include! Anything and Everything.

Why?

Why one more library? Especially since there is Loki and Boost and QT and what not, written by far more smarter programmers?

For several reasons :
  • Writing a high quality library in C++ is a great exercise to improve coding and abstraction skills. Especially since we have C++11 now and the game is changing fast.
  • If an existing powerful library like boost does not do exactly what I wish, it is probably beyond my capability to modify it to do so. And much less fun studying and meddling with existing code, than writing it from scratch.
  • There are some things --some trivial and some not-- that BABEL is planned to implement that no one anywhere else seems to have implemented. BABEL will be like my own personal workshop - the set of things it will do is probably not present in any one single library. And piecing together parts from different libraries in a program seems very inelegant sometimes.

How?

Essentially with MS Visual Studio 2010, using as many C++0x/11 features that I can use to make the code modern.
The NVCC CUDA compiler also figures in the mix.

My process for developing features, is to look at a very common task that I often code in my projects, and make an abstraction for that.
Then I look at how to generalize that and match it to the other parts that already exist.
I keep repeating this process, sometimes rewriting and extending older code to match the newer stuff better.
Once in a while I refactor, and try to see how each class or module could utilize the others.

When?

I started building parts of this about a year ago, and there are numerous fragments of useful code that I have in diverse locations, from several years back, which I will slowly assimilate into this.

The idea is after a point, there is such a good toolkit that my projects become a snap to write and I have more time to extend the library.

Where?

I'll eventually put up a git repository on github, hopefully at least parts of it will be of interest to others.
I will post about the design and implementation of various parts of it on this blog, as I make progress...

Saturday, February 4, 2012

Big end failure

Recently I did a project - which involved debugging a mixed .NET/C++ application that was failing on a 64 bit build. Now this was a complicated non-trivial app, that used the Microsoft DirectShow multimedia framework, and one of the reasons that I was hired, was because I have pretty decent experience with this.

My first line of attack was obviously to check the directshow stuff and while the code wasn't really to my taste, It seemed to work in 32 bit just fine. The 64 bit build just refused to work. I immediately jumped to the conclusion that there was something to do with the interaction of a 64 bit process and 32 bit filters. I checked whether a DShow filter works out of the box after rebuilding with 64 bit, and it did. So I had nothing.

Then I had no choice but to run both versions in separate debugger instances simultaneously, stepping to the point where something differed. The debugging was made quite annoying by the fact that for some reason, the 32 bit app would throw an access violation everytime I single stepped, perhaps the environment was broken(I was doing all this over a remote connection), perhaps attaching to a .NET app calling a C++ dll was messing up the debugger. The only way I could step was to add new breakpoints and resume the execution each time! It finally led to a decryption function, and there was a lot of bit twiddling going on and some typedef'd types. Once again I thought I'd found the issue - I made sure every instance of int in that code was replaced with _int32 so that the code would not be wrong in assuming every int was 32 bits. But the problem still remained.

Once again I did the side-by-side debugging and painstakingly expanded the macros in the crytpo code to compare intermediate values of expressions, and suddenly the light shone through - A macro that returned the last byte of a DWORD gave different results on x86 and x64. In other words, the code was using the big-endian version of things - Looking through a few headers, it became obvious. The code comes from a very old C program, written in the 90s : Though it checked for various architectures, it somehow assumed that anything not x86 was a big-endian platform (which may have been true for the platforms it had been intended for) - However this made it also assume that x64 was a big-endian, and so we had our mess-up.
So in the end it was just a line or two of code actually changed, but who would have "thunk it" eh?

PS : The title of this post is a pun!