Monday, December 12, 2011

The third circle


In my last post, I had mentioned circle drawing algorithms and how I'd rediscovered one after a couple of decades. "Why rasterize circles in this era of GPUs and 3d accelerators?", the gentle reader may ask...
Simply because :
  • It's always fun to re-visit fundamentals.
  • One may wish to merely iterate a point over a circular path and a GPU can't help you there.
  • One may have never heard or thought of this method, so it's fun!
Let's look at the ways you can draw circles on a raster array.

The first method
The math here is the simplest...Which is why it made most sense to me when I was 9 years old.
A circle has the equation : x^2 + y^2 = r^2 - So, if you iterate x over the range 0 to r, you can solve for y at each x.
Drawing lines between these points sequentially gives you an arc that's quarter of a circle. One can then mirror these points about the origin, both vertically and horizontally to get the other three quarters of the pi(e).

Code follows...
procedure TForm1.DrawCircle1(cx, cy, r: Integer);
var
  x, y, rSquared : Integer;
  oldy, oldx : integer;
begin
  PaintBox1.Canvas.MoveTo(cx, cy + r);
  rSquared := r * r;

  oldx := 0;
  oldy := r;

  for x := 0 to r do with PaintBox1.Canvas do
  begin
    y := Round(Sqrt(rSquared - (x * x)));

    MoveTo(cx - oldx, cy - oldy);
    LineTo(cx - x, cy - y);

    MoveTo(cx - oldx, cy + oldy);
    LineTo(cx - x, cy + y);

    MoveTo(cx + oldx, cy - oldy);
    LineTo(cx + x, cy - y);

    MoveTo(cx + oldx, cy + oldy);
    LineTo(cx + x, cy + y);

    oldx := x;
    oldy := y;
  end;
end;

Why Delphi?
Because it is based on Object Pascal, the simplest, type safe, statically compiled language that I know of - It has all the low level power of C, and most of the convenient features of OO, decent type safe generics, more akin to C++ templates than those of Java or C#, plus a superb IDE and RAD designer. It takes very little time to create fully working native GUI applications with it, which are very easily modified.

The language is verbose, and sometimes it requires a lot more code to do some stuff than in C++ - but it is very readable - besides, Wirth is my all time favorite computer science chap, and so I've a soft spot for Pascal based languages.

So in the code, we iterate x across the range 0..r because we can compute one quarter of the circle, and draw it mirrored across the X and Y axes.

Actually we can even compute just 1/8th of the circle and generate the other 7 sections by using all combinations of mirroring and swapping X and Y values.

Note that since we iterate over X linearly, The exact shape drawn is not a perfect circle - At the points where Y gets incremented a lot of a change of 1 in X, there are quite straight lines. Doing it by mirroring four or eight pieces does it much better, since a quarter or eighth of 90 degrees is not a big angle and the maximum dx/dy for that range is less than 1

Method 2
I know, all you academic minded ones were dying to read this one.
x = r cos Q and y = r sin Q

Not much to say about this - When I first learned of sines and cosines, I was like - "Ah OK, it's just a look-up table of magic values for the ratios of x and y with h for every angle" - And for the most part it is....
Inside your FPU is a lookup table and some interpolation magic to give you sin() and cos()

procedure TForm1.DrawCircle2(cx, cy, r: Integer);
var
  angle, angleInc, angleMax : Extended;
  x, y : Extended;
begin
  PaintBox1.Canvas.MoveTo(cx + r, cy);

  angleinc := 1 / r;
  angle := 0;
  angleMax := (2 * 3.14159265) + angleinc;

  while angle <= angleMax do
  begin
    x := r * Cos(angle);
    y := r * Sin(angle);
    PaintBox1.Canvas.LineTo(Round(x) + cx, Round(y) + cy);
    angle := angle + angleinc;
  end;
end;

We iterate over the 360 degrees, but we do not choose an arbitrary increment. Too small an increment and we redraw the same pixel, too big and our circle becomes a polygon.

Since the perimeter of a circle is 2 * PI * r and we want that many pixels drawn at most and at least, the angle we increment by is (2 * PI) / (2 * PI * r) which is 1/r

This means the target pixel drawn at each iteration is one pixel distance away from the last one. This will certainly result in some overdraw, but it is a reasonable value to use. Any more and we will get gaps where the small line segments that form our circle are parallel to the X and Y axes.

Method 3
This is a surprisingly ingenious way to avoid trigonometric functions in the loop. Back in the old days, such operations were really heavy, and I dare say this method would have been the fastest. In my test code, it just about beats the second method, but the margin is too small to be of much significance. The reason is that in this day and age, sin() and cos() are not so expensive after all, they are probably single cycle instructions on all current processors.

procedure TForm1.DrawCircle3(cx, cy, r: Integer);
var
  angle, angleInc : Extended;
  i : integer;
  x, y, sina, sinb, cosa, cosb, sinab, cosab : Extended;
begin
  PaintBox1.Canvas.MoveTo(cx + r, cy);

  angleinc := 360 / r;
  sinb := Sin(angleinc / (180.0 / 3.14159));
  cosb := Cos(angleinc / (180.0 / 3.14159));

  sina := 0;
  cosa := 1;

  i := Round(360 / angleinc);
  while i > 0 do
  begin
    sinab := sina * cosb + cosa * sinb;
    cosab := cosa * cosb - sina * sinb;
    x := r * cosab;
    y := r * sinab;
    sina := sinab;
    cosa := cosab;
    Dec(i);
    PaintBox1.Canvas.LineTo(Round(x) + cx, Round(y) + cy);
  end;
end;

Once again, we use the sine and cosine relations while iterating over all the angles, however, a simple trigonometric identity lets us avoid calling sin() and cos() more than once.
sin(a + b) = sin(a) * cos(b) + cos(a) * sin(b)
cos(a + b) = cos(a) * cos(b) - sin(a) * sin(b)
If A be our initial angle namely 0, and B be the increment, we can get sin(A + B) and cos(A + B) quite simply. After that A gets the new incremented angle and we can repeat the process.

We ended up replacing the sin() and cos() function calls with four multiplications and an addition and a subtraction.

In a static compiled or JIT compiled language sin() and cos() go down as single FPU instructions, and this doesn't make much of a difference in speed - However, in an interpreted language, I believe a function call will occur, and might end up being hundreds of times more expensive than the arithmetic.

It might even be possible to write this code using integer arithmetic alone, which you can't do with sin() and cos() - This may be useful in some embedded systems.

There is something quite similar (or is it exactly same?) on HAKMEM item 149 to 152, but that doesn't give a universal circle drawing algorithm.

One thing to bear in mind is that errors might get accumulated in this method, since it is incremental in nature, but we're working with integer co-ordinate output, so I doubt if you would get enough error to plot an incorrect pixel.

So that was the mystery of the third circle!

Saturday, December 10, 2011

Perspective correct texture mapping

Recently I worked on an interesting project, which involved creating rudimentary 3d rendering on a 2d HTML5 canvas using javascript.

Why not WebGL or CSS 3d transforms?

Because this was meant to run on some smart phones that did not support those in their browsers. The client required a way to display an image rotated in 3d space, with perspective projection. Now, while the canvas object supports 2d transformations like rotation, scaling and shearing, it is not possible to do perspective projection using these alone, so a from scratch implementation was needed.

Considering that it had to be coded in javascript, the implementation had to be as fast as possible so that it could be used for real time animation effects.

How it's done

The first thing to do was to create a representation of a 3d co-ordinate, that could be rotated, scaled and translated in any dimension.

Now I'm not a big fan of plugging in formulas into code directly, so I usually rework the math from scratch every time. I opted not to use the so called simple approach of homogenous transformation matrices, since (at least to me), the real meaning of what happens to a 3d point when transformed gets obscured in the matrix. Instead I chose a more direct approach. When we consider a 3d rotation, we see that it can be represented by rotation along two axes.

I opted to use the intuitive representation of yaw and pitch - Imagine a sphere like the earth with a point on it - Yaw makes the point orbit along its latitude, while pitch moves it along its longitude.

After some scribbled diagrams, a little sine here and a little cosine there, I got the rotation math right. Translation and scaling are very simply addition and multiplication of the co ordinate values by some factor.

While coding this, I realized one long forgotten trick - About two decades ago, I remember a book I read titled Electron Graphics and Sound which first taught me the fundamentals of computer graphics.

Now the Electron microcomputer, and its more advanced cousin the BBC micro were 8 bit machines from the early 80s and ran at 1 Mhz with 32KB of RAM. In the book, the author detailed three methods of drawing circles, each faster than the previous.

The first two were quite simple :
  • Solve Y for every X, given R and the Pythagoras theorem.
  • Use X = R cos(theta) and Y = R sin(theta)
Both these approaches were very slow, because square-roots and trigonometric operations were very expensive on these machines.

The third method calculated a sine and cosine only once outside the loop, and had nothing but arithmetic inside the loop. It was blazing fast. I never fully understood how that worked back then, but knew that there was something to it. For years afterwards I tried to remember what the code was meant to do, but I could not, and I didn't have the book anymore.

Now suddenly, while I was implementing the rotation code for this 3d stuff, it dawned on me, and I rediscovered the third algorithm finally. I'll detail that in the next post.

So now that I had the 3d co-ordinates representation worked out, I created an abstraction for an image floating around in 3d space, with each of its corners represented as a 3d point. The next step is...

2d Projection

We need to map those image edge points onto the 2d screen, correcting for perspective. This is quite a simple operation...

In most 3d scenes represented on a 2d view, there is a vanishing point and as things get further away, they get closer to the vanishing point. At infinite distance, all 3d points map onto the vanishing point. So it's a simple matter of moving the X and Y co-ordinates towards the vanishing point based on the Z.

You need to choose some projection parameters such as the view angle/scale, how many pixels represents one unit of space along the Z axis, and where along the Z axis the observers eye is located. If we call the unit zunit, the scale factor zscale and the eye position zeye then : The object appears zscale times smaller for every zunit away from zeye ...

That sounds very German when spoken out loud but it's English! Now we end up with a trapezoid that represents how the image edges would look in 3d space, when drawn on the 2d screen. Next we move on to...

Texture mapping

We need to map the image texture from its original rectangular shape onto the projected trapezoid. The naive way to do this would be :

  • Loop over every image co-ordinate
  • Transform the point according to the current pitch/yaw/scale
  • Project the point onto 2d
  • Draw that pixel with the image pixel

This would be embarrassingly slow and moreover it would also be wrong! First of all, doing several trigonometric ops per pixel will defenestrate performance completely and secondly this assumes that every single source pixel on the image rectangle maps to a single pixel on the destination - which is wrong.

An image pixel can actually map to any number of pixels on the trapezoid, since a region of the trapezoid can be bigger or smaller than the corresponding region on the image. Thus, we need to use the well established technique of doing mappings in reverse - the logic is as follows :
  • We need to draw every pixel on the destination trapezoid
  • We need to draw each pixel once and only once!
  • Thus we should iterate over the destination pixels, find the corresponding source point on the image, and then paint that on
OK that deals with one of the problems, above but what about performance? First of all, how do we iterate over all the pixels on the quad? The fast way to do this is to always is scan-line by scan-line, which ensures that memory access is as linear as possible when we access the pixel data. So the first step is to find out the vertical span of the trapezoid - which is nothing but the minimum and maximum Y of all of the four corners. Now we iterate over the Y span and figure out the X span at each Y thusly:


Now that we have a Y and a range of X to draw on the destination, we need to find which point on the image maps here.

Say we are on point P on the red scan-line... we are a fraction PR/RS along the line RS. R is a fraction AR/AB along line AB and S is a fraction AS/AC along line AC. We could find out where P would end up on the source image using the same relation, interpolating linearly, and using the image vertices for A, B and C. This ought to be the pixel that's drawn at P on the trapezoid...

Elementary... Or so I thought!

Except that things are not so linear - The results looked good on photographic images, but when I used a grid image as a test, the output looked extremely warped when the perspective was high. There was some curvature induced on horizontal lines like this :


Looks somewhat like relativistic contraction effects! But we're supposed to be stationary! This approach was super-fast - It was down to about 13ms for rendering a 800x600 image on Chrome.

There were only about 8 arithmetic operations in the inner loop. The warping is quite undetectable on photographic images, but still, it's wrong!

The right way

So I decided to rework the math to solve the following problem :


Once we had f1 and f2, we could look up the same point on the image rectangle easily. I started out on paper, wrote the equations for f1 and f2, and filled several sheets, getting nowhere.

I started again, this time using a text editor, so I could avoid mistakes when substituting expressions. I also wrote a C program to assert the validity of each step of the equation as I progressed, using actual values, and this helped me catch some errors I had made. This seems like a good way to do algebra, since later on, you can simply copy and paste the code into your program!

I won't bore you with the algebra, since it's quite involved... but finally I ended up with a quadratic equation, that could be solved for f1, and f2 could be derived easily from f1.

Now, I knew this was going to be quite slow! Solving a quadratic for every pixel is heavy, since there is a square root involved, and besides the co-efficients were complex expressions involving many of the input co-ordinates and distances. More variables accessed in the inner loop always means more code-sloth. I tested it out and the rendering was perfect, but the performance abysmal. That needed fixing.

Then I recalled my client had mentioned something about implementing Quake style optimizations, since it had been mentioned in a web page that demonstrated perspective texture mapping in C. After looking it up, I realized quite soon what the idea was (vaguely remember reading about this in Michael Abrash's Black Book Of Graphics Programming).

In a nutshell, instead of doing the mapping math on every pixel, do it every N pixels and in between use linear interpolation for the texture co-ordinates. Over a small region such as 16x16, the error is not visible, so it's a great compromise between speed and accuracy. The funny thing is that if we increase the step from N to say, the entire X span width, we end up with exactly the original incorrect warped output!

Then I did a few vanilla optimizations - loop hoisting, making the texture image's width padded to a power of two (a simple trick to make i = y * w + x into i = y << n + x while accessing the canvas data buffer), and reducing variable access in the inner loop. After implementing all this, the performance ended up at about 60ms per frame to render a 800x600 image texture. In a pinch, the step N can be increased to about 64 to double performance and it still looks good.

So there you have it! Quite amazing that we can do this stuff in real time in Javascript.

Monday, December 5, 2011

Reinventing the wheel for fun and profit

A few months back, I had been involved in an image processing project, which dealt with image binarization for OCR. One of the steps during that process was to segment an image into regions labeled red, white and black. They represented black on white text, white on black text and blank regions respectively. The input consisted of three monochrome images with non regular regions filled with white. The segmentation was done as follows:
  • A rectangle is chosen (initially the entire image), and for both vertical and horizontal directions, at every possible location to split was tested.
  • At each split point, the pixel counts for the two sub rectangles were taken for each of the three input images.
  • The split point was assigned a score based on how uniform the two sub rectangles were in terms of the labeled pixels.
  • The best split point was chosen and we now had two rectangles.
  • The process repeated recursively for both these rectangles, until they reached a certain score, or a certain minimum size.
Suffice to say that most of the work involved counting the number of “on” pixels in an arbitrary rectangular region of an image. Here is a visualization of how the splitting ends up for a web page image – The regions that are predominantly white on black are green, those that are black on white are marked red and the split lines are blue.


After I got the initial implementation working, it was horrendously slow, because for every rectangle, at every horizontal or vertical split position, the number of “on” pixels was counted by iterating entirely over the two split rectangles The first optimization that came to mind was to initially do the count for the whole rectangle and maintain the pixel count at every split line too. The pixel counts for both sides could then be incrementally calculated, as the split point moved across the rectangle. However, this process would have to be repeated for the sub rectangles, so it would still be slow. I scratched my head for about a minute and then suddenly I had it – If we could calculate and cache the counts for every possible rectangle starting at (0, 0) then we could get the count for any arbitrary rectangle with a handful of arithmetic ops – thusly :


To get the count for the green rectangle K, we take the sum of the entire image up to the bottom right corner of K and from that we subtract the top region and the left region. We ended up subtracting the purple top-left region twice, so we add it back once. All these rectangles start at (0, 0) can be read directly from our precomputed cache.So in essence for any rectangle in the image, we can get the pixel count by doing 4 memory accesses and three subtractions.

Blazing fast! So fast in fact, that this becomes the fastest part of the whole algorithm.

After a few days, I find out that this technique is a known one called by the fancy name “integral images”, and can be used in a number of applications where arrays of any number of dimensions have to be summed up.

I'm thinking of implementing a 2D LZ77 like compression technique based on this that would be suitable for screen-shot like images. The idea would be to quickly identify two equal sized rectangular regions that sum up to the same value. Two patches that had the same sum would be a potential match. Differing sums would mean the patches could never match exactly, so we avoid a lot of pixel-wise comparison. That means we can test every possible rectangle against every other one to find matches, and do it quite fast.

What was a bit funny to me was that this technique seems so obvious (as I said before, it took me all of a minute to come up with this and realize it would work), but has a pretty fancy name and was one of the key “innovations” of the famous Viola-Jones face detection paper, which convinces me further of how academic literature tends to make quite simple stuff seem complex and path-breaking.


Moral of the story:

Invent as many wheels as you can - many of them will be re-inventions, but the process of invention is more fun and beneficial than that of using a pre-invented wheel.

Saturday, December 3, 2011

Overdue optimization is the root of all bloat

"Premature optimization is the root of all evil"

It's an oft-repeated mantra among people in the industry – everybody rightly presumes that slow working code is better than fast broken code.

However this is based on the dubious assumption that a given programmers code is bound to be less ridden with bugs if he codes a slow approach. And the further weak premise that code is ever fully correct.

Let's look at another industry, the automotive one – Any serious vehicle is designed with a certain performance metric from the start, and unless it meets that, it is a failure. In the software world, fast code is often seen as complex code, but that is a very naive view of optimization. A true optimization, in theory, reduces the amount of work being done, and often the amount of code. Any complexity induced is more of a syntactic nature, depending on the expressive skill of the programmer.

Compilers are damn smart these days, but then again things haven't really changed that much – Take a look at the following code that copies C style NUL terminated strings.
 
    int j = 0;
    while(src[j])
    {
        dst[j] = src[j];
        ++j;
    }

Then this :
 
    char *pSrc = src, *pDst = dst;
    while(*pDst++ = *pSrc++);

The second version is full 6% faster measured over a million iterations for 64KB strings, compiled with MSVC 2010, all optimizations on.

One would assume that the compiler is capable of generating the same code for both, but in the first case it reloads the value for src[j] in order to check it for NUL in the while loop. In the second version,  --which generates the exact same code as an inline call to strcpy() --, one memory access is avoided.

6% is not an insignificant fraction, for a trivial optimization like this.

Of course, this is a contrived example, but the same principles hold in a number of scenarios.

Some would argue that the second version is less readable. I would counter that it indicates a lack of fluency in the language involved.
Any non-trivial piece of code requires a certain amount of programming skill, and in most cases, someone who is qualified enough to write a correct implementation is capable of writing a correct and fast version too.

If someone makes errors during optimization, that is essentially carelessness that would have happened even while writing a vanilla implementation. Sloppy coding won't be cured by using naive approaches.

Starting out with the intent that code must be fast as well as correct, one must gather good habits and make it almost reflex like to perform small optimizations and diligently work to understand fast ways of doing common tasks. You don't become a boxer by first concentrating on accuracy, then strength, then speed – You do all three at once. An awareness of what is happening behind the scenes in the generated machine code is very important; not only for speed optimization, but also for figuring out what is wrong when things go south. This applies to even non-compiled languages – Understanding how memory is organized and how data structures end up on the metal leads to good coding approaches.

Dumbing down the code initially will most probably lock it down in such a fashion that reworking it to be faster at a higher level will become impractically difficult, and doing low level optimizations later, may not be of much value.

It's the difference between a super car built ground up for performance versus a clunker with bolt on supercharging.

There is a lot of running code out there, almost all of it is buggy, but very little is efficient – Bloat abounds, since hardware cost goes down and performance goes up regularly. However, the brick wall is near, since improving serial code execution speeds has become more and more difficult. The only way out is parallelism, and that is being marketed aggressively; the only problem is that no one got the memo on the inapplicability of parallelism for most code.

You can have a zillion cores and parallelize all the possible parts of your program, but the serial path will remain, and the only choice is to go back to school and read some Michael Abrash

TANSTAFC - There ain't no such thing as the fastest code