Monday, July 23, 2012

Tales from the crypt - Cross linked partition tables

Back in 1999 I happened to own a DX4 100 ( That's a clock tripled 486 processor) based system. I ran a dual boot with Windows 98 and Redhat Linux 6.1 on a small 2 GB hard drive.

The system had 32 MB of RAM and I used about 100 MB of page file for Windows on a separate partition. A huge annoyance was the fact that I needed to also allocate precious space for Linux to swap on. I could have made Linux use the same file as a swap space, but there was some article about how it's better to use a whole partition. Using that partition as swap for Linux would make Windows no longer recognize it as a file system.

So I decided to do something quite dangerous, at the risk of destroying all my data. I decided to create a cross linked partition table.

Using Peter Norton's Diskedit (Wasn't it just amazing how it brought up a graphics mouse pointer while still appearing to be in text mode?) I added a new partition entry, and pointed it to a range that lived within the page file of the Windows installation.

After booting into Linux, I used dd to backup a sector from that new partition (just in case), and then overwrote that with 0s. Reading the page file on the Windows swap partition showed the same block of 0s I had just written.

I added a mkswap and swapon command to the init scripts and and things worked perfectly on Linux.
Windows ran perfectly too, oblivious to the fact that the page file was doing double duty.

A dangerous and crazy trick, but it did the job for a long time, until I got a 4.3 gig drive.

Monday, February 6, 2012

C++ and beyond - The power of abstractions

Modern C++
I've been primarily a C++ programmer, even though I have dabbled with Assembler, C, Object Pascal, Java, C#, and Javascript (in that order, chronologically). Over the years, I learned to adapt my style as much as possible from the "C with classes" style to "C++ with STL" to "Modern C++".
There's a lot of awesome material online from Andrei Alexandrescu, Herb Sutter, Bjarne Stroustrup and Scott Meyers, that mostly popular since the C++0x standard came into the light and after looking at those, my understanding of the C++ toolbox improved quite a bit.

One of the oft repeated tenets in the modern C++ community is that well written generic code using type-rich templates will produce faster code than C style if used properly. I always took this on faith, since the people who claim this (like the above 4 gentlemen) have forgotten more C++ than I ever learned.

Now that I started work on BABEL, I decided to test this out and see with my own eyes, if modern C++ with abstract,type-rich idioms can beat plain old C style code, while remaining extremely high level and readable.

A numerical sequence class
I often deal with problems that involve the use of intervals or ranges... I wanted to create an abstraction that lets me deal with these in a simple manner.

Thus I want a type that holds a range like [begin, end) and define operator overloads and other member functions for slicing, disjunction and so on. One useful feature would be to have it support an iterator interface, so that it plugs into STL numeric algorithms easily.

Here's the basic public interface I ended up with :
template<typename t> struct TNumSeq
    friend inline bool operator<(const TNumSeq &range, const T &val);
    friend inline bool operator>(const TNumSeq &range, const T &val);
    friend inline TNumSeq operator&(const TNumSeq &seq1, const TNumSeq &seq2);
    friend inline TNumSeq operator&(const TNumSeq &seq, T val);
    friend inline TNumSeq operator&(T val, const TNumSeq &seq);

    inline TNumSeq();
    inline TNumSeq(const T &b, const T &e);
    inline explicit TNumSeq(const T &b);
    inline TNumSeq &operator=(const TNumSeq &that);
    inline TNumSeq& operator&=(const TNumSeq &that);
    inline TNumSeq operator&(const TNumSeq &that);

    inline bool has(const T& val) const;
    inline operator bool() const;

    class const_iterator : public std::iterator<std::forward_iterator_tag, T>
        const_iterator(const TNumSeq &seq, const T &inc, const T &val);
        inline T operator*() const;
        inline const_iterator operator++(int);
        inline const_iterator &operator++();
        inline bool operator==(const const_iterator &that) const;
        inline bool operator!=(const const_iterator &that) const;

    const_iterator begin(T inc = 1) const;
    const_iterator end(T inc = 1) const;

It's quite simple, you can define a numeric range, use the < and > operators to check if a value or another range is outside the range on either side, you can use & to get the part of a range that overlaps with another.

Empty ranges work fine, implicit bool conversion(Danger! Will Robinson) checks for emptiness. The has() member gives you membership test. And there is a forward const_iterator that lets this class masquerade as if the sequence were a real set of stored integers. The begin() and end() functions take a stepping increment for iteration by any "jump".

So to test this out... I tested the following code :

int i = 0, j = 1 << 30, sum = 0;
for(int n = i; n < j; n ++)
    sum += n;


int i = 0, j = 1 << 30;
TNumSeq<int> s(i, j);
int sum = std::accumulate(s.begin(), s.end(), 0);

Ignore the fact that the default int will overflow the summation, for the range 0 to 2^30, that's immaterial for this test. One (and certainly, yours truly) would imagine that by no stretch of imagination, the generated code for the second example would be faster that the bare bones loop in the first - Look at all that is happening in the accumulate() call :
  • Two method invocations to get the iterators, which are at least 12 byte objects, have a constructor, and passed by value to accumulate, thus copying them twice each (apparently).
  • A loop inside, which calls the comparison,increment and dereference operators on the iterator object, both of whose implementations contain member access.
Seems like an awfully complicated code path for a very trivial task - but when I timed this I found that the two versions run at the same speed. Go figure!
Here is the exact assembly code generated by the compiler for the two cases :

009A18E0  add         ecx,eax
009A18E2  inc         eax
009A18E3  cmp         eax,40000000h
009A18E8  jl          testNumSeq+60h (9A18E0h)
009A1BE0  add         ecx,eax
009A1BE2  add         eax,esi
009A1BE4  cmp         eax,40000000h
009A1BE9  jne         testNumSeq+360h (9A1BE0h)

And for increment by an arbitrary number :

00BA18E4  add         ecx,eax
00BA18E6  add         eax,2
00BA18E9  cmp         eax,40000000h
00BA18EE  jl          testNumSeq+64h (0BA18E4h)
00BA1BF4  add         ecx,eax
00BA1BF6  add         eax,edx
00BA1BF8  cmp         eax,40000000h
00BA1BFD  jne         testNumSeq+374h (0BA1BF4h)                                                     

No difference except the fact that the for loop uses a constant as the increment, whereas the TNumSeq uses a variable on a register as the increment. How the hell does this happen? It seems like magic, the way that the compiler took away all that 50 odd lines of code in my class and came up with the succinct loop above. The answer is that it is due to being able to express intention more clearly to the compiler. There are many little things that happen, I will detail a few...
  • RVO and move semantics - The iterator pass by value copying is elided by RVO (return value optimization). With C++11, the move semantics feature takes care of most cases of this at the language level, and whenever an object is moved, it actually reuses the storage and state, rather than copy to a new place and destroy the old one. That means, you can treat any old object, even if it is large as if it were like an int and pass it around by value with impunity, knowing that the language/compiler is taking care of efficiency concerns. No overhead induced or change to existing code!
  • const correctness - since the parent sequence of the iterator is stored within the iterator as a const reference, and the iterator dereference operator is const, The compiler knows that the state of the iterator cannot change durin operator* and that the sequence cannot change during the that as well as operator++. This means it can keep those values in registers.
  • template inlining - templates are not code, they are actually code generators with zero overhead. All the above code becomes a single chunk of some meta-C++ which, unlike the C style code, has information alongside about what the code means. The compiler is able to use this extra information to generate the best possible machine code. Trust me, it's way more smarter than you or me, when it comes to this. As an assembly programmer I often observe compiler generated machine code to see if there is any scope of improvement and 9 times out of 10, there is none, unless I radically change the algorithm or use vectorization instructions...
So, in conclusion, we can see that writing idiomatic C++ code in a modern way sacrifices nothing, in performance while still providing high level abstractions. This is the future! To quote Sutter from GoingNative2012 "Performance has always been the king, since the entire history of programming, except the last decade,but now it's 'The Return of the king'

More on this as i progress with BABEL

Sunday, February 5, 2012

The library of babel

If you haven't heard of that library, I recommend that you go and read the short story by that name ASAP, pronto, on the double, etc. etc.

This post however is one of a series, where I will document the creation of my own utility library in C++.


One of the biggest and most important steps in creating software, whether a product or a framework is to name it properly, such that it makes some sense. A recursive acronym as the name is even better.

Since this library will contain everything (in terms of code that I ever needed in more than one project), I call it babel - To expand the recursive acronym - Babel Attempts Being Everything Literally
I don't intend anyone except me to use it, it's my little toolbox to do things the way I want them done.

It will be a heterogeneous collection of classes and modules which will include stuff extending the STL, wrappers and helpers for OpenGL, CUDA, COM, Directshow and DES.
Stuff for Data compression, Image and text handling, machine learning and so on.

Whatever I feel I want to include! Anything and Everything.


Why one more library? Especially since there is Loki and Boost and QT and what not, written by far more smarter programmers?

For several reasons :
  • Writing a high quality library in C++ is a great exercise to improve coding and abstraction skills. Especially since we have C++11 now and the game is changing fast.
  • If an existing powerful library like boost does not do exactly what I wish, it is probably beyond my capability to modify it to do so. And much less fun studying and meddling with existing code, than writing it from scratch.
  • There are some things --some trivial and some not-- that BABEL is planned to implement that no one anywhere else seems to have implemented. BABEL will be like my own personal workshop - the set of things it will do is probably not present in any one single library. And piecing together parts from different libraries in a program seems very inelegant sometimes.


Essentially with MS Visual Studio 2010, using as many C++0x/11 features that I can use to make the code modern.
The NVCC CUDA compiler also figures in the mix.

My process for developing features, is to look at a very common task that I often code in my projects, and make an abstraction for that.
Then I look at how to generalize that and match it to the other parts that already exist.
I keep repeating this process, sometimes rewriting and extending older code to match the newer stuff better.
Once in a while I refactor, and try to see how each class or module could utilize the others.


I started building parts of this about a year ago, and there are numerous fragments of useful code that I have in diverse locations, from several years back, which I will slowly assimilate into this.

The idea is after a point, there is such a good toolkit that my projects become a snap to write and I have more time to extend the library.


I'll eventually put up a git repository on github, hopefully at least parts of it will be of interest to others.
I will post about the design and implementation of various parts of it on this blog, as I make progress...

Saturday, February 4, 2012

Big end failure

Recently I did a project - which involved debugging a mixed .NET/C++ application that was failing on a 64 bit build. Now this was a complicated non-trivial app, that used the Microsoft DirectShow multimedia framework, and one of the reasons that I was hired, was because I have pretty decent experience with this.

My first line of attack was obviously to check the directshow stuff and while the code wasn't really to my taste, It seemed to work in 32 bit just fine. The 64 bit build just refused to work. I immediately jumped to the conclusion that there was something to do with the interaction of a 64 bit process and 32 bit filters. I checked whether a DShow filter works out of the box after rebuilding with 64 bit, and it did. So I had nothing.

Then I had no choice but to run both versions in separate debugger instances simultaneously, stepping to the point where something differed. The debugging was made quite annoying by the fact that for some reason, the 32 bit app would throw an access violation everytime I single stepped, perhaps the environment was broken(I was doing all this over a remote connection), perhaps attaching to a .NET app calling a C++ dll was messing up the debugger. The only way I could step was to add new breakpoints and resume the execution each time! It finally led to a decryption function, and there was a lot of bit twiddling going on and some typedef'd types. Once again I thought I'd found the issue - I made sure every instance of int in that code was replaced with _int32 so that the code would not be wrong in assuming every int was 32 bits. But the problem still remained.

Once again I did the side-by-side debugging and painstakingly expanded the macros in the crytpo code to compare intermediate values of expressions, and suddenly the light shone through - A macro that returned the last byte of a DWORD gave different results on x86 and x64. In other words, the code was using the big-endian version of things - Looking through a few headers, it became obvious. The code comes from a very old C program, written in the 90s : Though it checked for various architectures, it somehow assumed that anything not x86 was a big-endian platform (which may have been true for the platforms it had been intended for) - However this made it also assume that x64 was a big-endian, and so we had our mess-up.
So in the end it was just a line or two of code actually changed, but who would have "thunk it" eh?

PS : The title of this post is a pun!

Monday, December 12, 2011

The third circle

In my last post, I had mentioned circle drawing algorithms and how I'd rediscovered one after a couple of decades. "Why rasterize circles in this era of GPUs and 3d accelerators?", the gentle reader may ask...
Simply because :
  • It's always fun to re-visit fundamentals.
  • One may wish to merely iterate a point over a circular path and a GPU can't help you there.
  • One may have never heard or thought of this method, so it's fun!
Let's look at the ways you can draw circles on a raster array.

The first method
The math here is the simplest...Which is why it made most sense to me when I was 9 years old.
A circle has the equation : x^2 + y^2 = r^2 - So, if you iterate x over the range 0 to r, you can solve for y at each x.
Drawing lines between these points sequentially gives you an arc that's quarter of a circle. One can then mirror these points about the origin, both vertically and horizontally to get the other three quarters of the pi(e).

Code follows...
procedure TForm1.DrawCircle1(cx, cy, r: Integer);
  x, y, rSquared : Integer;
  oldy, oldx : integer;
  PaintBox1.Canvas.MoveTo(cx, cy + r);
  rSquared := r * r;

  oldx := 0;
  oldy := r;

  for x := 0 to r do with PaintBox1.Canvas do
    y := Round(Sqrt(rSquared - (x * x)));

    MoveTo(cx - oldx, cy - oldy);
    LineTo(cx - x, cy - y);

    MoveTo(cx - oldx, cy + oldy);
    LineTo(cx - x, cy + y);

    MoveTo(cx + oldx, cy - oldy);
    LineTo(cx + x, cy - y);

    MoveTo(cx + oldx, cy + oldy);
    LineTo(cx + x, cy + y);

    oldx := x;
    oldy := y;

Why Delphi?
Because it is based on Object Pascal, the simplest, type safe, statically compiled language that I know of - It has all the low level power of C, and most of the convenient features of OO, decent type safe generics, more akin to C++ templates than those of Java or C#, plus a superb IDE and RAD designer. It takes very little time to create fully working native GUI applications with it, which are very easily modified.

The language is verbose, and sometimes it requires a lot more code to do some stuff than in C++ - but it is very readable - besides, Wirth is my all time favorite computer science chap, and so I've a soft spot for Pascal based languages.

So in the code, we iterate x across the range 0..r because we can compute one quarter of the circle, and draw it mirrored across the X and Y axes.

Actually we can even compute just 1/8th of the circle and generate the other 7 sections by using all combinations of mirroring and swapping X and Y values.

Note that since we iterate over X linearly, The exact shape drawn is not a perfect circle - At the points where Y gets incremented a lot of a change of 1 in X, there are quite straight lines. Doing it by mirroring four or eight pieces does it much better, since a quarter or eighth of 90 degrees is not a big angle and the maximum dx/dy for that range is less than 1

Method 2
I know, all you academic minded ones were dying to read this one.
x = r cos Q and y = r sin Q

Not much to say about this - When I first learned of sines and cosines, I was like - "Ah OK, it's just a look-up table of magic values for the ratios of x and y with h for every angle" - And for the most part it is....
Inside your FPU is a lookup table and some interpolation magic to give you sin() and cos()

procedure TForm1.DrawCircle2(cx, cy, r: Integer);
  angle, angleInc, angleMax : Extended;
  x, y : Extended;
  PaintBox1.Canvas.MoveTo(cx + r, cy);

  angleinc := 1 / r;
  angle := 0;
  angleMax := (2 * 3.14159265) + angleinc;

  while angle <= angleMax do
    x := r * Cos(angle);
    y := r * Sin(angle);
    PaintBox1.Canvas.LineTo(Round(x) + cx, Round(y) + cy);
    angle := angle + angleinc;

We iterate over the 360 degrees, but we do not choose an arbitrary increment. Too small an increment and we redraw the same pixel, too big and our circle becomes a polygon.

Since the perimeter of a circle is 2 * PI * r and we want that many pixels drawn at most and at least, the angle we increment by is (2 * PI) / (2 * PI * r) which is 1/r

This means the target pixel drawn at each iteration is one pixel distance away from the last one. This will certainly result in some overdraw, but it is a reasonable value to use. Any more and we will get gaps where the small line segments that form our circle are parallel to the X and Y axes.

Method 3
This is a surprisingly ingenious way to avoid trigonometric functions in the loop. Back in the old days, such operations were really heavy, and I dare say this method would have been the fastest. In my test code, it just about beats the second method, but the margin is too small to be of much significance. The reason is that in this day and age, sin() and cos() are not so expensive after all, they are probably single cycle instructions on all current processors.

procedure TForm1.DrawCircle3(cx, cy, r: Integer);
  angle, angleInc : Extended;
  i : integer;
  x, y, sina, sinb, cosa, cosb, sinab, cosab : Extended;
  PaintBox1.Canvas.MoveTo(cx + r, cy);

  angleinc := 360 / r;
  sinb := Sin(angleinc / (180.0 / 3.14159));
  cosb := Cos(angleinc / (180.0 / 3.14159));

  sina := 0;
  cosa := 1;

  i := Round(360 / angleinc);
  while i > 0 do
    sinab := sina * cosb + cosa * sinb;
    cosab := cosa * cosb - sina * sinb;
    x := r * cosab;
    y := r * sinab;
    sina := sinab;
    cosa := cosab;
    PaintBox1.Canvas.LineTo(Round(x) + cx, Round(y) + cy);

Once again, we use the sine and cosine relations while iterating over all the angles, however, a simple trigonometric identity lets us avoid calling sin() and cos() more than once.
sin(a + b) = sin(a) * cos(b) + cos(a) * sin(b)
cos(a + b) = cos(a) * cos(b) - sin(a) * sin(b)
If A be our initial angle namely 0, and B be the increment, we can get sin(A + B) and cos(A + B) quite simply. After that A gets the new incremented angle and we can repeat the process.

We ended up replacing the sin() and cos() function calls with four multiplications and an addition and a subtraction.

In a static compiled or JIT compiled language sin() and cos() go down as single FPU instructions, and this doesn't make much of a difference in speed - However, in an interpreted language, I believe a function call will occur, and might end up being hundreds of times more expensive than the arithmetic.

It might even be possible to write this code using integer arithmetic alone, which you can't do with sin() and cos() - This may be useful in some embedded systems.

There is something quite similar (or is it exactly same?) on HAKMEM item 149 to 152, but that doesn't give a universal circle drawing algorithm.

One thing to bear in mind is that errors might get accumulated in this method, since it is incremental in nature, but we're working with integer co-ordinate output, so I doubt if you would get enough error to plot an incorrect pixel.

So that was the mystery of the third circle!

Saturday, December 10, 2011

Perspective correct texture mapping

Recently I worked on an interesting project, which involved creating rudimentary 3d rendering on a 2d HTML5 canvas using javascript.

Why not WebGL or CSS 3d transforms?

Because this was meant to run on some smart phones that did not support those in their browsers. The client required a way to display an image rotated in 3d space, with perspective projection. Now, while the canvas object supports 2d transformations like rotation, scaling and shearing, it is not possible to do perspective projection using these alone, so a from scratch implementation was needed.

Considering that it had to be coded in javascript, the implementation had to be as fast as possible so that it could be used for real time animation effects.

How it's done

The first thing to do was to create a representation of a 3d co-ordinate, that could be rotated, scaled and translated in any dimension.

Now I'm not a big fan of plugging in formulas into code directly, so I usually rework the math from scratch every time. I opted not to use the so called simple approach of homogenous transformation matrices, since (at least to me), the real meaning of what happens to a 3d point when transformed gets obscured in the matrix. Instead I chose a more direct approach. When we consider a 3d rotation, we see that it can be represented by rotation along two axes.

I opted to use the intuitive representation of yaw and pitch - Imagine a sphere like the earth with a point on it - Yaw makes the point orbit along its latitude, while pitch moves it along its longitude.

After some scribbled diagrams, a little sine here and a little cosine there, I got the rotation math right. Translation and scaling are very simply addition and multiplication of the co ordinate values by some factor.

While coding this, I realized one long forgotten trick - About two decades ago, I remember a book I read titled Electron Graphics and Sound which first taught me the fundamentals of computer graphics.

Now the Electron microcomputer, and its more advanced cousin the BBC micro were 8 bit machines from the early 80s and ran at 1 Mhz with 32KB of RAM. In the book, the author detailed three methods of drawing circles, each faster than the previous.

The first two were quite simple :
  • Solve Y for every X, given R and the Pythagoras theorem.
  • Use X = R cos(theta) and Y = R sin(theta)
Both these approaches were very slow, because square-roots and trigonometric operations were very expensive on these machines.

The third method calculated a sine and cosine only once outside the loop, and had nothing but arithmetic inside the loop. It was blazing fast. I never fully understood how that worked back then, but knew that there was something to it. For years afterwards I tried to remember what the code was meant to do, but I could not, and I didn't have the book anymore.

Now suddenly, while I was implementing the rotation code for this 3d stuff, it dawned on me, and I rediscovered the third algorithm finally. I'll detail that in the next post.

So now that I had the 3d co-ordinates representation worked out, I created an abstraction for an image floating around in 3d space, with each of its corners represented as a 3d point. The next step is...

2d Projection

We need to map those image edge points onto the 2d screen, correcting for perspective. This is quite a simple operation...

In most 3d scenes represented on a 2d view, there is a vanishing point and as things get further away, they get closer to the vanishing point. At infinite distance, all 3d points map onto the vanishing point. So it's a simple matter of moving the X and Y co-ordinates towards the vanishing point based on the Z.

You need to choose some projection parameters such as the view angle/scale, how many pixels represents one unit of space along the Z axis, and where along the Z axis the observers eye is located. If we call the unit zunit, the scale factor zscale and the eye position zeye then : The object appears zscale times smaller for every zunit away from zeye ...

That sounds very German when spoken out loud but it's English! Now we end up with a trapezoid that represents how the image edges would look in 3d space, when drawn on the 2d screen. Next we move on to...

Texture mapping

We need to map the image texture from its original rectangular shape onto the projected trapezoid. The naive way to do this would be :

  • Loop over every image co-ordinate
  • Transform the point according to the current pitch/yaw/scale
  • Project the point onto 2d
  • Draw that pixel with the image pixel

This would be embarrassingly slow and moreover it would also be wrong! First of all, doing several trigonometric ops per pixel will defenestrate performance completely and secondly this assumes that every single source pixel on the image rectangle maps to a single pixel on the destination - which is wrong.

An image pixel can actually map to any number of pixels on the trapezoid, since a region of the trapezoid can be bigger or smaller than the corresponding region on the image. Thus, we need to use the well established technique of doing mappings in reverse - the logic is as follows :
  • We need to draw every pixel on the destination trapezoid
  • We need to draw each pixel once and only once!
  • Thus we should iterate over the destination pixels, find the corresponding source point on the image, and then paint that on
OK that deals with one of the problems, above but what about performance? First of all, how do we iterate over all the pixels on the quad? The fast way to do this is to always is scan-line by scan-line, which ensures that memory access is as linear as possible when we access the pixel data. So the first step is to find out the vertical span of the trapezoid - which is nothing but the minimum and maximum Y of all of the four corners. Now we iterate over the Y span and figure out the X span at each Y thusly:

Now that we have a Y and a range of X to draw on the destination, we need to find which point on the image maps here.

Say we are on point P on the red scan-line... we are a fraction PR/RS along the line RS. R is a fraction AR/AB along line AB and S is a fraction AS/AC along line AC. We could find out where P would end up on the source image using the same relation, interpolating linearly, and using the image vertices for A, B and C. This ought to be the pixel that's drawn at P on the trapezoid...

Elementary... Or so I thought!

Except that things are not so linear - The results looked good on photographic images, but when I used a grid image as a test, the output looked extremely warped when the perspective was high. There was some curvature induced on horizontal lines like this :

Looks somewhat like relativistic contraction effects! But we're supposed to be stationary! This approach was super-fast - It was down to about 13ms for rendering a 800x600 image on Chrome.

There were only about 8 arithmetic operations in the inner loop. The warping is quite undetectable on photographic images, but still, it's wrong!

The right way

So I decided to rework the math to solve the following problem :

Once we had f1 and f2, we could look up the same point on the image rectangle easily. I started out on paper, wrote the equations for f1 and f2, and filled several sheets, getting nowhere.

I started again, this time using a text editor, so I could avoid mistakes when substituting expressions. I also wrote a C program to assert the validity of each step of the equation as I progressed, using actual values, and this helped me catch some errors I had made. This seems like a good way to do algebra, since later on, you can simply copy and paste the code into your program!

I won't bore you with the algebra, since it's quite involved... but finally I ended up with a quadratic equation, that could be solved for f1, and f2 could be derived easily from f1.

Now, I knew this was going to be quite slow! Solving a quadratic for every pixel is heavy, since there is a square root involved, and besides the co-efficients were complex expressions involving many of the input co-ordinates and distances. More variables accessed in the inner loop always means more code-sloth. I tested it out and the rendering was perfect, but the performance abysmal. That needed fixing.

Then I recalled my client had mentioned something about implementing Quake style optimizations, since it had been mentioned in a web page that demonstrated perspective texture mapping in C. After looking it up, I realized quite soon what the idea was (vaguely remember reading about this in Michael Abrash's Black Book Of Graphics Programming).

In a nutshell, instead of doing the mapping math on every pixel, do it every N pixels and in between use linear interpolation for the texture co-ordinates. Over a small region such as 16x16, the error is not visible, so it's a great compromise between speed and accuracy. The funny thing is that if we increase the step from N to say, the entire X span width, we end up with exactly the original incorrect warped output!

Then I did a few vanilla optimizations - loop hoisting, making the texture image's width padded to a power of two (a simple trick to make i = y * w + x into i = y << n + x while accessing the canvas data buffer), and reducing variable access in the inner loop. After implementing all this, the performance ended up at about 60ms per frame to render a 800x600 image texture. In a pinch, the step N can be increased to about 64 to double performance and it still looks good.

So there you have it! Quite amazing that we can do this stuff in real time in Javascript.

Monday, December 5, 2011

Reinventing the wheel for fun and profit

A few months back, I had been involved in an image processing project, which dealt with image binarization for OCR. One of the steps during that process was to segment an image into regions labeled red, white and black. They represented black on white text, white on black text and blank regions respectively. The input consisted of three monochrome images with non regular regions filled with white. The segmentation was done as follows:
  • A rectangle is chosen (initially the entire image), and for both vertical and horizontal directions, at every possible location to split was tested.
  • At each split point, the pixel counts for the two sub rectangles were taken for each of the three input images.
  • The split point was assigned a score based on how uniform the two sub rectangles were in terms of the labeled pixels.
  • The best split point was chosen and we now had two rectangles.
  • The process repeated recursively for both these rectangles, until they reached a certain score, or a certain minimum size.
Suffice to say that most of the work involved counting the number of “on” pixels in an arbitrary rectangular region of an image. Here is a visualization of how the splitting ends up for a web page image – The regions that are predominantly white on black are green, those that are black on white are marked red and the split lines are blue.

After I got the initial implementation working, it was horrendously slow, because for every rectangle, at every horizontal or vertical split position, the number of “on” pixels was counted by iterating entirely over the two split rectangles The first optimization that came to mind was to initially do the count for the whole rectangle and maintain the pixel count at every split line too. The pixel counts for both sides could then be incrementally calculated, as the split point moved across the rectangle. However, this process would have to be repeated for the sub rectangles, so it would still be slow. I scratched my head for about a minute and then suddenly I had it – If we could calculate and cache the counts for every possible rectangle starting at (0, 0) then we could get the count for any arbitrary rectangle with a handful of arithmetic ops – thusly :

To get the count for the green rectangle K, we take the sum of the entire image up to the bottom right corner of K and from that we subtract the top region and the left region. We ended up subtracting the purple top-left region twice, so we add it back once. All these rectangles start at (0, 0) can be read directly from our precomputed cache.So in essence for any rectangle in the image, we can get the pixel count by doing 4 memory accesses and three subtractions.

Blazing fast! So fast in fact, that this becomes the fastest part of the whole algorithm.

After a few days, I find out that this technique is a known one called by the fancy name “integral images”, and can be used in a number of applications where arrays of any number of dimensions have to be summed up.

I'm thinking of implementing a 2D LZ77 like compression technique based on this that would be suitable for screen-shot like images. The idea would be to quickly identify two equal sized rectangular regions that sum up to the same value. Two patches that had the same sum would be a potential match. Differing sums would mean the patches could never match exactly, so we avoid a lot of pixel-wise comparison. That means we can test every possible rectangle against every other one to find matches, and do it quite fast.

What was a bit funny to me was that this technique seems so obvious (as I said before, it took me all of a minute to come up with this and realize it would work), but has a pretty fancy name and was one of the key “innovations” of the famous Viola-Jones face detection paper, which convinces me further of how academic literature tends to make quite simple stuff seem complex and path-breaking.

Moral of the story:

Invent as many wheels as you can - many of them will be re-inventions, but the process of invention is more fun and beneficial than that of using a pre-invented wheel.