Saturday, July 5, 2014

A DOM of many hovered class

So last week has been entriely about optimization. The website I'm working on had serious performance issues and it needed to be fast, to qualify as working.

The problem

The problem is simple - we have a back end database that supplies data about cloud computing jobs, and we need to display these in a table, allowing the user to view and control the jobs.
The data is refreshed every 30 seconds and each job element has about nine pieces of information. Four of them are shown in a table row, and the rest appear in an expandable row when the main row is clicked.

The initial implementation had the server generate the whole HTML, and we simply fetched that via AJAX and assiged it using innerHTML
This was unacceptable - despite gzip encoding, the amount of data per request is huge - For my test case, with ~900 jobs, the compacted HTML is about 900 KB, which gzips to ablut 40 KB. Each table row has a total of 32 HTML tags, so we have about 30000 styled HTML elements.

This may not seem like much - you can easily generate and manipulate millions of rows in a HTML table, however, when you have nested tags, CSS, attributes, and event handlers, it starts to bog down.

With Chrome, the innerHTML approach was passable, but with IE 11 it was simply too much for the browser to handle. A factor of 6x to 8x slower.
Any action by the user that affects the jobs, causes a refresh, which made for a very laggy UI.

Moreover the Django backend also spent too much time generating the HTML - Upto 0.6 seconds just for this. Inconceivable!

The table rows needed to have filtering - only certain rows would be visible, based on the selected tab in a tab control. The user needed to be able to select any number of jobs to perform job actions on them. Both selection and tab switching were too slow.

Something had to be done! After much profiling and reading numerous blogs about javascript performance, I managed to speed up various things, by a factor of anywhere from 2x to 100x

Lesson 1 - Templating on client side

The first approach was to scrap the generation of HTML serverside, and send the job data as JSON to the client, to be rendered there instead.

Now the JS would render the HTML for each job, using a very basic templating function. Using innerHTML to construct the table was still inevitable.

The server side time was reduced, and the simple javascript templater I wrote seemed so much faster, with negligible times to subtitute values into the HTML template. It was a simple search and replace function :

function template(dict, str)
    for(key in dict)
       str.replaceAll('<<' + key + '>>', dict[key])

The reason for Django's slowness seems to be that its template engine is not so much a glorified printf but quite a complicated parser/interpreter ( given that you can embed code inside a Django template ).

Overkill for a printf() like requirement

Still, this was not ideal - every refresh still led to rebuilding the HTML table even if nothing changed from the previous refresh.

Lesson 2 - Use incremental updates

Obviously, some kind of incremental update was needed. Ideally we would like only a delta to be fetched from the server, but practically there were difficulties implementing this serverside.

The time taken by the server to send off the entire job data was negligible, as was the time taken by the browser to deserialize it into JSON, so there was not much gain to be had from trying to d/dt the data on the server side.

Instead we could do the diff on the client side. It was fairly simple and fast to do this - given the latest job data, and the previous one, we can generate the diff as additions, deletions and changes ( pseudo code below )

for(i in old) 
    if(not i in new) 

for(i in new)
    if(not i in old) 

for(i in new)
    if((i in old) and old[i] != new[i])

The data are in the form of a dictionary keyed by a job number, and each value is in turn a dictionary.(Yes, we know JS doesn't really have dictionaries, only objects or whatever!)

The data looks like this :

{ 1234 : { job_name: "abcd_xyz8765",  job_time : "10:24:05" ... } },
{ 1230 : { job_name: "abcd_xyz9645",  job_time : "01:04:39" ... } } ...

The best way to compare two objects in JS seems to be to JSON.stringify() them and do a string compare. Checking each field individually doesn't seem to be any faster, and may be even slower, considering that each member access will cause a dictionary lookup.

Lesson 3 - Know your data

There are 5 scenarios that happen :

  • Initial page load, any number of rows get added - this happens only at login, but makes the most psychological impact in terms of how snappy the site seems.
  • Periodic refresh, a few rows change state - but if the UI is busy during the update, it feels sluggish.
  • User adds a job, one row gets changed.
  • User terminates one or more jobs, any number of rows chage state.
  • User removes one or more jobs.

The add case is the most important even though it happens only once per session.

The table is always sorted based on a job index, in descending order, so it becomes necessary to insert any new rows into the right locations.
The following simple approach acts like an insertion sort, which is extremely slow.

for(job in newjob)
    for(row in rows)
        if(row.jobindex < job.index)
            insertBefore(row, makerow(job));

For every new row, its insertion location needs to be found in the table. Sure, we could conceivably make the above into a binary search. Not non-trivial, but still a mass of code.
Even then, we have to insert elements in the middle of a table, which is slow.

After much head scratching, I realized, I don't even need to do this - The job indices in the server are unique and always increasing. The data is also always sorted ( because it ends up in a JSON dictionary sorted by job index ).

If we just ensure that the job data is in the same sort order as the table, we can avoid all the above slowness/complexity.

So for case 1, just appending each row is fine.

For case 3, a new job will always end up right at the top of the table, since a new job will always have an index greater than any job previously seen.

So we can write something like :

if(table.isempty())  // case 1
    for(job in newjob)
        row = makerow(job);
else                // case 3
    for(job in newjob)
        row = makerow(job);

A huge amount of unnecessary work in finding where a job fits in the table is avoided and appending, rather than inserting elements into the DOM is way faster too.
This was good for a speedup of more than 2x for case 1.
Case 3 conceivably never requires more than one row insertion - jobs are started one at a time, so the time is negligible.

Deletion also initially was very slow, because the initial code was something like this

for(job in deljobs)
    row = findRowWithJobIndex(job.index);

findRowWithJobIndex() used jQuery with the find('[key=value]') selector, which is terribly slow, since it has to be a linear search.

One again, knowing that the deljobs array has descending job indices and so do the table rows, we end up with a much faster delete :

i = 0;
for(row in rows)
    if(i < deljobs.length)
        if(row.jobindex == deljobs[i])

We are essentially doing only one linear pass on the rows and picking off each one that needs to be removed.

More than 5x speedup when deleting all the rows, and close to 10x on IE 11

Simple fast code, resulting from using knowledge about the data, rather than using a catch-all generic algorithm.

Lesson 4 - Profile

You can only get so far with logic optimization, eventually you have to profile, IE 11 has a very nice profiler, as does Chrome.

I believed the JSON parsing and delta generation code, the HTML template rendering etc. would be very slow, but all that took very little time!

After profiling, the major performance bottlenecks were (surprisingly) :
  • DOM node creation
  • jQuery in loops
  • Adding, removing and testing for class names in DOM elements
Assigning HTML strings to innerHTML is very slow on IE - It stands to reason, the browser has to parse HTML and make it into DOM objects, and pasring fully spec'd HTML is going to be super slow (Chrome does it real fast though!)

The alternative is to create the DOM elements using code like this :

var tr = document.createElement('tr');
var td1 = document.createElement('td');
td1.setAttribute('class', 'job-detail');

var td1a = document.createElement('a');
td1a.setAttribute('class', 'job-link');
td1a.setAttribute('href', '#');

var td2 = document.createElement('td');
td2.setAttribute('class', 'text-center');

var td3 = document.createElement('td');
td3.setAttribute('class', 'text-center');

var td4 = document.createElement('td');
td4.setAttribute('class', 'text-center');


However, this is just as slow as innerHTML even though a number of synthetic benchmarks show otherwise on the web. But the profiler does not lie.

The answer is that we need to create this node just once, and clone copies of it with cloneNode() when we need more.

In fact doing that means we can possibly keep the template inside the easily tweakable HTML, instead of in the above opaque JS code.
Using this approach improves the node creation performance extremely, leaving appendChild() as the main laggard.

There are hordes of JS performance blogs (including one by John Resig) which hint that great speedup can be had by using createDocumentFragment(), appending many nodes to that, and then appending the fragment to the main DOM.

However on both Chrome and IE 11, this approach has worse performance. Browsers have evolved a lot since the days of IE 8 and we're not in Kansas anymore.
The profiler is the ultimate authority, despite any number of 3rd party benchmarks showing the contrary.

After a lot of experimentation, I found that there is probably no straightforward way to speed up appendChild().
Now it's all the browsers fault!

The second bottleneck is jQuery.

jQuery's $(elem) function does a huge amount of work. Whenever you need to do it more than once, it makes sense to use the raw DOM methods instead, even if it involves more verbose code. jQuery makes for compact neat code, but performance is dismal.

$('.someclass') can be terribly slow - it is much faster to iterate over all the elements and check for the class.
The reasoning is that jQuery (or the corresponding DOM method) is going to iterate over some data structure anyway, and do some work with each element so it can give it back to you.
I am not quite sure, but it seems like this is linear on the number of elements - It could conceivably be a log N operation, but who knows?

Instead you can iterate yourself over the elements using elem.children[] and do the work in one shot if the classname matches. Even more so if you need to test for things other than the classname, like attribute values.

$(elem).find('[key=value]') should be avoided like the plague, if elem has hundreds of child elements

The ideal framework to use for performance is Vannilla.js where possible.

The third bottleneck is dealing with DOM element classnames.

  • $(elem).addClass('someclass') is slow
  • elem.className += ' someclass' is not much faster
  • elem.classList.add('someclass') is the fastest method
  • elem.classList[n] != null is the best way to test for a class name, if the position of that classname is fixed
  • elem.classList.has('classname') is the next best

This was used in the table row selection code, and this, along with some other tweaks mentioned below took the speed of selection from 14 seconds for 900 rows on IE 11 to 0.12 seconds - a 100x speedup!

On Chrome it was not as dramatic, but the times dropped out to below whats measurable accurately, but certainly from 120 ms to about 5 ms.

Lesson 5 - Old tricks always work

There are a few simple habits that you develop when writing high performance code in static languages, these actually spill over even to dynamic languages, with greater effect

  • If you use an expression more than once : make it a variable

For example when you see:


You should realize that every [] and . is a potential bottleneck.

var cls = row[i].classList

will be faster

  • Arrays are fast : when in doubt use an array.
  • Cache it if you can : If you have something generated by a large chunk of code, hang on to it for later - its easier to know if an answer changed than recalculate it. Applies very much to dynamic langauges since they are bound by CPU speed rather than memory access speed.
  • Pull invariant code out of loops : The compiler does this for you in static languages. In dynamic languages you need to do it yourself.
  • Avoid dynamic allocation : Hard to tell in a dynamic language when it doesn't happen, but be aware of when it can happen - reuse objects instead of making new ones. Allocation may be very fast, but constrution can be very heavy in a language like JS - This is why jQuery $(elem) is so bad.
  • Optimize the worst case : If the worst case is frequent enough to be noticed, optimize that at the expense of the average or best case. Mike Abrash said the same.
  • Make it look fast, even if it is incurably slow : This is psychology - a console tool that shows progress with a fast \|/-\ spinner appears way faster than something that shows a 0 to 100 progress indicator. Microsoft knows this, which is why progress bars animate as well as move forward.
  • Optimize multiple condition if() statements : When testing for various things with && and ||, try to order the conditions in a way that maximizes the short circuit of the operator.

For example when you have:

if(leastLikely || lessLikely || likely)

if(!likely || !lessLikely || !leastLikely)

The conditions are evaluated maximum number of times, instead the following :

if(likely ||  lessLikely || leastLikely)

if(!leastLikely || !lessLikely || !likely)

will be faster, especially if the three conditions are expressions rather than values, and in any case, due to branch prediction.

All these little tweaks shave off a few dozen milliseconds here and there and it adds up. The trick is to make this simple stuff a reflex.


It seems like the effects of basic optimization are amplified highly when using dynamic languages. You can get away with a lot of inefficiency in static languages - the compiler works really hard to offset your stupidity, but in dynamic languages, it gets thrown back in your face.

Saturday, June 21, 2014

Embracing the hate : python, JS, node.js,

For a few months now, I have been working mostly with web based code, migrating from node.js/coffeescript to django/python, and extending the website.

The abomination that is node.js and the ugliness that is coffeescript needs another post in itself. And the monstrosity that is Bootstrap, I've learned to live with, since I just copied the entire CSS and HTML for the new portal.

The first few weeks I had to add certain features to the existing website, to somehow grok through the coffeescript and understand what does what.

The first stumbling block was the "almost but not quite" javascript syntax of coffeescript. I can understand inventing a new syntax if your programming paradigm is extremely different from existing languages, but a whitespace significant, functional looking style is absolutely mismatched for a very imperative, side effect filled application like a web portal.

The second one was the fact that it's hard to tell in a node application which part executes on the server and which on the client. In making it seem seamless across client and server, you have no idea what's happening at the HTTP level.

All that is history however, since I moved everything to django, but then jade/less came to bite me - all the CSS was in the form of .less and all the HTML in the form of .jade files. I tried many tools to try converting the code, but everything failed. Even the server on which the actual node application ran, I was not able to get the jade/less files convert to HTML/CSS due to various incomprehensible reasons.
Finally I found a windows based tool and managed to batch convert everything.

Now I was all set to tango with django, but after programming in static typed C++ for more than a decade, it seemed like building out of green bamboo after concrete and steel, dealing with JS and python.

The saving grace was that the model is simple and straightforward - you have URLs that map to python functions that render the HTML you need. None of that over-complicated enterprise stuff that you find in high level frameworks.

Database ORM was reasonably reasonable, although getting existing database queries to map to django's model is quite a head scratcher. Thank heavens for stackoverflow!

Intellisense is pathetic for dynamic languages.

I was trained under a regimen of Delphi, Visual Studio and Eclipse and expect code to write 40% of itself. Half the time, if intellisense works right, you don't have to look at the documentation for APIs, otherwise it degenerates into a continuous cycle of looking on google for methods you believe "should" exist and often you may never realize a particular functionality exists.

For e.g. who would think that the method used to remove elements from an array is called splice() in JS? How do you add something to an array, is it push(), append(), add() or what? Intellisense, when it works, tells you what methods an object has. However intellisense is theoretically impossible to do perfectly in a dynamic language.

In some ways JS and python are deviously similar, but it's a trap: 
  • Is it ' '.join['1','2','3'] or is it [1, 2, 3].join(' ') 
  • Is it arr.length or is it len(arr) ?
  • Is it for(i in items) or is it for i in items 
  • Is whitespace significant or not? Surprisingly, in JS whitespace is not significant, however, newlines are significant and a misplaced one can cause syntax errors or even code that doesn't do what it seems to.
It's frustrating and annoying....

Of course some things are cool, like the [:] array slicing syntax in python but then there are numerous annoyances mostly because of the dynamic nature of the language
  • Numbers are floating point in JS, you have to bitwise or it with 0 to get an integer, and this is idiomatic!
  • You misspell a variable being assigned to and it simply refers to a new one. The last time I saw that was in BASIC and a it's a wellspring of errors. Silly mistakes and typos should not require runtime debugging! 
  • Variable scoping is weird in JS
  • Pass arguments by value but pass object arguments by reference - But, strings are objects, yet they act like values. The only thing worse is passing everything by reference like in FORTRAN. 
  • Performance anxiety - you can't really get a sense for how fast something is going to run. It may be super-optimized due to the VM, or it may be horribly inefficient. There's a continuous worry about slowdowns (and it is a valid worry, when we're talking like 5 to 20 times slower than C++). I had no idea that using map() is faster in python than a loop. There would be no valid reason in a static language for something like that - iteration is iteration
The saving grace is that the latest ECMAScript 6 standard brings a lot of progress to JS and makes it possible to use a much more saner subset of the language.

Python 3 promises much but since they break compatibility it looks bleak.

For now I simply follow grumpy cats advice :

Monday, July 23, 2012

Tales from the crypt - Cross linked partition tables

Back in 1999 I happened to own a DX4 100 ( That's a clock tripled 486 processor) based system. I ran a dual boot with Windows 98 and Redhat Linux 6.1 on a small 2 GB hard drive.

The system had 32 MB of RAM and I used about 100 MB of page file for Windows on a separate partition. A huge annoyance was the fact that I needed to also allocate precious space for Linux to swap on. I could have made Linux use the same file as a swap space, but there was some article about how it's better to use a whole partition. Using that partition as swap for Linux would make Windows no longer recognize it as a file system.

So I decided to do something quite dangerous, at the risk of destroying all my data. I decided to create a cross linked partition table.

Using Peter Norton's Diskedit (Wasn't it just amazing how it brought up a graphics mouse pointer while still appearing to be in text mode?) I added a new partition entry, and pointed it to a range that lived within the page file of the Windows installation.

After booting into Linux, I used dd to backup a sector from that new partition (just in case), and then overwrote that with 0s. Reading the page file on the Windows swap partition showed the same block of 0s I had just written.

I added a mkswap and swapon command to the init scripts and and things worked perfectly on Linux.
Windows ran perfectly too, oblivious to the fact that the page file was doing double duty.

A dangerous and crazy trick, but it did the job for a long time, until I got a 4.3 gig drive.

Monday, February 6, 2012

C++ and beyond - The power of abstractions

Modern C++
I've been primarily a C++ programmer, even though I have dabbled with Assembler, C, Object Pascal, Java, C#, and Javascript (in that order, chronologically). Over the years, I learned to adapt my style as much as possible from the "C with classes" style to "C++ with STL" to "Modern C++".
There's a lot of awesome material online from Andrei Alexandrescu, Herb Sutter, Bjarne Stroustrup and Scott Meyers, that mostly popular since the C++0x standard came into the light and after looking at those, my understanding of the C++ toolbox improved quite a bit.

One of the oft repeated tenets in the modern C++ community is that well written generic code using type-rich templates will produce faster code than C style if used properly. I always took this on faith, since the people who claim this (like the above 4 gentlemen) have forgotten more C++ than I ever learned.

Now that I started work on BABEL, I decided to test this out and see with my own eyes, if modern C++ with abstract,type-rich idioms can beat plain old C style code, while remaining extremely high level and readable.

A numerical sequence class
I often deal with problems that involve the use of intervals or ranges... I wanted to create an abstraction that lets me deal with these in a simple manner.

Thus I want a type that holds a range like [begin, end) and define operator overloads and other member functions for slicing, disjunction and so on. One useful feature would be to have it support an iterator interface, so that it plugs into STL numeric algorithms easily.

Here's the basic public interface I ended up with :
template<typename t> struct TNumSeq
    friend inline bool operator<(const TNumSeq &range, const T &val);
    friend inline bool operator>(const TNumSeq &range, const T &val);
    friend inline TNumSeq operator&(const TNumSeq &seq1, const TNumSeq &seq2);
    friend inline TNumSeq operator&(const TNumSeq &seq, T val);
    friend inline TNumSeq operator&(T val, const TNumSeq &seq);

    inline TNumSeq();
    inline TNumSeq(const T &b, const T &e);
    inline explicit TNumSeq(const T &b);
    inline TNumSeq &operator=(const TNumSeq &that);
    inline TNumSeq& operator&=(const TNumSeq &that);
    inline TNumSeq operator&(const TNumSeq &that);

    inline bool has(const T& val) const;
    inline operator bool() const;

    class const_iterator : public std::iterator<std::forward_iterator_tag, T>
        const_iterator(const TNumSeq &seq, const T &inc, const T &val);
        inline T operator*() const;
        inline const_iterator operator++(int);
        inline const_iterator &operator++();
        inline bool operator==(const const_iterator &that) const;
        inline bool operator!=(const const_iterator &that) const;

    const_iterator begin(T inc = 1) const;
    const_iterator end(T inc = 1) const;

It's quite simple, you can define a numeric range, use the < and > operators to check if a value or another range is outside the range on either side, you can use & to get the part of a range that overlaps with another.

Empty ranges work fine, implicit bool conversion(Danger! Will Robinson) checks for emptiness. The has() member gives you membership test. And there is a forward const_iterator that lets this class masquerade as if the sequence were a real set of stored integers. The begin() and end() functions take a stepping increment for iteration by any "jump".

So to test this out... I tested the following code :

int i = 0, j = 1 << 30, sum = 0;
for(int n = i; n < j; n ++)
    sum += n;


int i = 0, j = 1 << 30;
TNumSeq<int> s(i, j);
int sum = std::accumulate(s.begin(), s.end(), 0);

Ignore the fact that the default int will overflow the summation, for the range 0 to 2^30, that's immaterial for this test. One (and certainly, yours truly) would imagine that by no stretch of imagination, the generated code for the second example would be faster that the bare bones loop in the first - Look at all that is happening in the accumulate() call :
  • Two method invocations to get the iterators, which are at least 12 byte objects, have a constructor, and passed by value to accumulate, thus copying them twice each (apparently).
  • A loop inside, which calls the comparison,increment and dereference operators on the iterator object, both of whose implementations contain member access.
Seems like an awfully complicated code path for a very trivial task - but when I timed this I found that the two versions run at the same speed. Go figure!
Here is the exact assembly code generated by the compiler for the two cases :

009A18E0  add         ecx,eax
009A18E2  inc         eax
009A18E3  cmp         eax,40000000h
009A18E8  jl          testNumSeq+60h (9A18E0h)
009A1BE0  add         ecx,eax
009A1BE2  add         eax,esi
009A1BE4  cmp         eax,40000000h
009A1BE9  jne         testNumSeq+360h (9A1BE0h)

And for increment by an arbitrary number :

00BA18E4  add         ecx,eax
00BA18E6  add         eax,2
00BA18E9  cmp         eax,40000000h
00BA18EE  jl          testNumSeq+64h (0BA18E4h)
00BA1BF4  add         ecx,eax
00BA1BF6  add         eax,edx
00BA1BF8  cmp         eax,40000000h
00BA1BFD  jne         testNumSeq+374h (0BA1BF4h)                                                     

No difference except the fact that the for loop uses a constant as the increment, whereas the TNumSeq uses a variable on a register as the increment. How the hell does this happen? It seems like magic, the way that the compiler took away all that 50 odd lines of code in my class and came up with the succinct loop above. The answer is that it is due to being able to express intention more clearly to the compiler. There are many little things that happen, I will detail a few...
  • RVO and move semantics - The iterator pass by value copying is elided by RVO (return value optimization). With C++11, the move semantics feature takes care of most cases of this at the language level, and whenever an object is moved, it actually reuses the storage and state, rather than copy to a new place and destroy the old one. That means, you can treat any old object, even if it is large as if it were like an int and pass it around by value with impunity, knowing that the language/compiler is taking care of efficiency concerns. No overhead induced or change to existing code!
  • const correctness - since the parent sequence of the iterator is stored within the iterator as a const reference, and the iterator dereference operator is const, The compiler knows that the state of the iterator cannot change durin operator* and that the sequence cannot change during the that as well as operator++. This means it can keep those values in registers.
  • template inlining - templates are not code, they are actually code generators with zero overhead. All the above code becomes a single chunk of some meta-C++ which, unlike the C style code, has information alongside about what the code means. The compiler is able to use this extra information to generate the best possible machine code. Trust me, it's way more smarter than you or me, when it comes to this. As an assembly programmer I often observe compiler generated machine code to see if there is any scope of improvement and 9 times out of 10, there is none, unless I radically change the algorithm or use vectorization instructions...
So, in conclusion, we can see that writing idiomatic C++ code in a modern way sacrifices nothing, in performance while still providing high level abstractions. This is the future! To quote Sutter from GoingNative2012 "Performance has always been the king, since the entire history of programming, except the last decade,but now it's 'The Return of the king'

More on this as i progress with BABEL

Sunday, February 5, 2012

The library of babel

If you haven't heard of that library, I recommend that you go and read the short story by that name ASAP, pronto, on the double, etc. etc.

This post however is one of a series, where I will document the creation of my own utility library in C++.


One of the biggest and most important steps in creating software, whether a product or a framework is to name it properly, such that it makes some sense. A recursive acronym as the name is even better.

Since this library will contain everything (in terms of code that I ever needed in more than one project), I call it babel - To expand the recursive acronym - Babel Attempts Being Everything Literally
I don't intend anyone except me to use it, it's my little toolbox to do things the way I want them done.

It will be a heterogeneous collection of classes and modules which will include stuff extending the STL, wrappers and helpers for OpenGL, CUDA, COM, Directshow and DES.
Stuff for Data compression, Image and text handling, machine learning and so on.

Whatever I feel I want to include! Anything and Everything.


Why one more library? Especially since there is Loki and Boost and QT and what not, written by far more smarter programmers?

For several reasons :
  • Writing a high quality library in C++ is a great exercise to improve coding and abstraction skills. Especially since we have C++11 now and the game is changing fast.
  • If an existing powerful library like boost does not do exactly what I wish, it is probably beyond my capability to modify it to do so. And much less fun studying and meddling with existing code, than writing it from scratch.
  • There are some things --some trivial and some not-- that BABEL is planned to implement that no one anywhere else seems to have implemented. BABEL will be like my own personal workshop - the set of things it will do is probably not present in any one single library. And piecing together parts from different libraries in a program seems very inelegant sometimes.


Essentially with MS Visual Studio 2010, using as many C++0x/11 features that I can use to make the code modern.
The NVCC CUDA compiler also figures in the mix.

My process for developing features, is to look at a very common task that I often code in my projects, and make an abstraction for that.
Then I look at how to generalize that and match it to the other parts that already exist.
I keep repeating this process, sometimes rewriting and extending older code to match the newer stuff better.
Once in a while I refactor, and try to see how each class or module could utilize the others.


I started building parts of this about a year ago, and there are numerous fragments of useful code that I have in diverse locations, from several years back, which I will slowly assimilate into this.

The idea is after a point, there is such a good toolkit that my projects become a snap to write and I have more time to extend the library.


I'll eventually put up a git repository on github, hopefully at least parts of it will be of interest to others.
I will post about the design and implementation of various parts of it on this blog, as I make progress...

Saturday, February 4, 2012

Big end failure

Recently I did a project - which involved debugging a mixed .NET/C++ application that was failing on a 64 bit build. Now this was a complicated non-trivial app, that used the Microsoft DirectShow multimedia framework, and one of the reasons that I was hired, was because I have pretty decent experience with this.

My first line of attack was obviously to check the directshow stuff and while the code wasn't really to my taste, It seemed to work in 32 bit just fine. The 64 bit build just refused to work. I immediately jumped to the conclusion that there was something to do with the interaction of a 64 bit process and 32 bit filters. I checked whether a DShow filter works out of the box after rebuilding with 64 bit, and it did. So I had nothing.

Then I had no choice but to run both versions in separate debugger instances simultaneously, stepping to the point where something differed. The debugging was made quite annoying by the fact that for some reason, the 32 bit app would throw an access violation everytime I single stepped, perhaps the environment was broken(I was doing all this over a remote connection), perhaps attaching to a .NET app calling a C++ dll was messing up the debugger. The only way I could step was to add new breakpoints and resume the execution each time! It finally led to a decryption function, and there was a lot of bit twiddling going on and some typedef'd types. Once again I thought I'd found the issue - I made sure every instance of int in that code was replaced with _int32 so that the code would not be wrong in assuming every int was 32 bits. But the problem still remained.

Once again I did the side-by-side debugging and painstakingly expanded the macros in the crytpo code to compare intermediate values of expressions, and suddenly the light shone through - A macro that returned the last byte of a DWORD gave different results on x86 and x64. In other words, the code was using the big-endian version of things - Looking through a few headers, it became obvious. The code comes from a very old C program, written in the 90s : Though it checked for various architectures, it somehow assumed that anything not x86 was a big-endian platform (which may have been true for the platforms it had been intended for) - However this made it also assume that x64 was a big-endian, and so we had our mess-up.
So in the end it was just a line or two of code actually changed, but who would have "thunk it" eh?

PS : The title of this post is a pun!

Monday, December 12, 2011

The third circle

In my last post, I had mentioned circle drawing algorithms and how I'd rediscovered one after a couple of decades. "Why rasterize circles in this era of GPUs and 3d accelerators?", the gentle reader may ask...
Simply because :
  • It's always fun to re-visit fundamentals.
  • One may wish to merely iterate a point over a circular path and a GPU can't help you there.
  • One may have never heard or thought of this method, so it's fun!
Let's look at the ways you can draw circles on a raster array.

The first method
The math here is the simplest...Which is why it made most sense to me when I was 9 years old.
A circle has the equation : x^2 + y^2 = r^2 - So, if you iterate x over the range 0 to r, you can solve for y at each x.
Drawing lines between these points sequentially gives you an arc that's quarter of a circle. One can then mirror these points about the origin, both vertically and horizontally to get the other three quarters of the pi(e).

Code follows...
procedure TForm1.DrawCircle1(cx, cy, r: Integer);
  x, y, rSquared : Integer;
  oldy, oldx : integer;
  PaintBox1.Canvas.MoveTo(cx, cy + r);
  rSquared := r * r;

  oldx := 0;
  oldy := r;

  for x := 0 to r do with PaintBox1.Canvas do
    y := Round(Sqrt(rSquared - (x * x)));

    MoveTo(cx - oldx, cy - oldy);
    LineTo(cx - x, cy - y);

    MoveTo(cx - oldx, cy + oldy);
    LineTo(cx - x, cy + y);

    MoveTo(cx + oldx, cy - oldy);
    LineTo(cx + x, cy - y);

    MoveTo(cx + oldx, cy + oldy);
    LineTo(cx + x, cy + y);

    oldx := x;
    oldy := y;

Why Delphi?
Because it is based on Object Pascal, the simplest, type safe, statically compiled language that I know of - It has all the low level power of C, and most of the convenient features of OO, decent type safe generics, more akin to C++ templates than those of Java or C#, plus a superb IDE and RAD designer. It takes very little time to create fully working native GUI applications with it, which are very easily modified.

The language is verbose, and sometimes it requires a lot more code to do some stuff than in C++ - but it is very readable - besides, Wirth is my all time favorite computer science chap, and so I've a soft spot for Pascal based languages.

So in the code, we iterate x across the range 0..r because we can compute one quarter of the circle, and draw it mirrored across the X and Y axes.

Actually we can even compute just 1/8th of the circle and generate the other 7 sections by using all combinations of mirroring and swapping X and Y values.

Note that since we iterate over X linearly, The exact shape drawn is not a perfect circle - At the points where Y gets incremented a lot of a change of 1 in X, there are quite straight lines. Doing it by mirroring four or eight pieces does it much better, since a quarter or eighth of 90 degrees is not a big angle and the maximum dx/dy for that range is less than 1

Method 2
I know, all you academic minded ones were dying to read this one.
x = r cos Q and y = r sin Q

Not much to say about this - When I first learned of sines and cosines, I was like - "Ah OK, it's just a look-up table of magic values for the ratios of x and y with h for every angle" - And for the most part it is....
Inside your FPU is a lookup table and some interpolation magic to give you sin() and cos()

procedure TForm1.DrawCircle2(cx, cy, r: Integer);
  angle, angleInc, angleMax : Extended;
  x, y : Extended;
  PaintBox1.Canvas.MoveTo(cx + r, cy);

  angleinc := 1 / r;
  angle := 0;
  angleMax := (2 * 3.14159265) + angleinc;

  while angle <= angleMax do
    x := r * Cos(angle);
    y := r * Sin(angle);
    PaintBox1.Canvas.LineTo(Round(x) + cx, Round(y) + cy);
    angle := angle + angleinc;

We iterate over the 360 degrees, but we do not choose an arbitrary increment. Too small an increment and we redraw the same pixel, too big and our circle becomes a polygon.

Since the perimeter of a circle is 2 * PI * r and we want that many pixels drawn at most and at least, the angle we increment by is (2 * PI) / (2 * PI * r) which is 1/r

This means the target pixel drawn at each iteration is one pixel distance away from the last one. This will certainly result in some overdraw, but it is a reasonable value to use. Any more and we will get gaps where the small line segments that form our circle are parallel to the X and Y axes.

Method 3
This is a surprisingly ingenious way to avoid trigonometric functions in the loop. Back in the old days, such operations were really heavy, and I dare say this method would have been the fastest. In my test code, it just about beats the second method, but the margin is too small to be of much significance. The reason is that in this day and age, sin() and cos() are not so expensive after all, they are probably single cycle instructions on all current processors.

procedure TForm1.DrawCircle3(cx, cy, r: Integer);
  angle, angleInc : Extended;
  i : integer;
  x, y, sina, sinb, cosa, cosb, sinab, cosab : Extended;
  PaintBox1.Canvas.MoveTo(cx + r, cy);

  angleinc := 360 / r;
  sinb := Sin(angleinc / (180.0 / 3.14159));
  cosb := Cos(angleinc / (180.0 / 3.14159));

  sina := 0;
  cosa := 1;

  i := Round(360 / angleinc);
  while i > 0 do
    sinab := sina * cosb + cosa * sinb;
    cosab := cosa * cosb - sina * sinb;
    x := r * cosab;
    y := r * sinab;
    sina := sinab;
    cosa := cosab;
    PaintBox1.Canvas.LineTo(Round(x) + cx, Round(y) + cy);

Once again, we use the sine and cosine relations while iterating over all the angles, however, a simple trigonometric identity lets us avoid calling sin() and cos() more than once.
sin(a + b) = sin(a) * cos(b) + cos(a) * sin(b)
cos(a + b) = cos(a) * cos(b) - sin(a) * sin(b)
If A be our initial angle namely 0, and B be the increment, we can get sin(A + B) and cos(A + B) quite simply. After that A gets the new incremented angle and we can repeat the process.

We ended up replacing the sin() and cos() function calls with four multiplications and an addition and a subtraction.

In a static compiled or JIT compiled language sin() and cos() go down as single FPU instructions, and this doesn't make much of a difference in speed - However, in an interpreted language, I believe a function call will occur, and might end up being hundreds of times more expensive than the arithmetic.

It might even be possible to write this code using integer arithmetic alone, which you can't do with sin() and cos() - This may be useful in some embedded systems.

There is something quite similar (or is it exactly same?) on HAKMEM item 149 to 152, but that doesn't give a universal circle drawing algorithm.

One thing to bear in mind is that errors might get accumulated in this method, since it is incremental in nature, but we're working with integer co-ordinate output, so I doubt if you would get enough error to plot an incorrect pixel.

So that was the mystery of the third circle!