Saturday, October 3, 2009

An Idiosyncratic Analogical Overview of Some Programming Languages from an Evolutionary Biologist's Perspective

R

R is like a microwave oven. It is capable of handling a wide range of pre-packaged tasks, but can be frustrating or inappropriate when trying to do even simple things that are outside of its (admittedly vast) library of functions. Ever tried to make toast in a microwave? There has been a push to start using R for simulations and phylogenetic analysis, and I am actually rather ambiguous about how I feel about this. On the one hand, I would much rather an open source platform R be used than some proprietary commercial platform such as Mathematica or Matlab. On the other hand, I do not think that R is the most suitable for the full spectrum of these applications. There are some serious limitations on its capability and performance when handling large datasets (mainly for historical design reasons), and, to be frank, I find many aspects of its syntax and idiom quite irritating. I primarily use it for what it was designed for: numerical and statistical computing, as well as data visualization and plotting, and in this context I am quite happy with it. For almost anything else, I look elsewhere. R has an excellent and very friendly community of people actively developing and supporting it, and I might change my view as R evolves. But, as things stand, it simply is not my first choice for the majority of my programming needs.

Python

If R is like a microwave oven, then Python is a full-fledged modern kitchen. You can produce almost anything you want, from toast to multi-course banquets, but you probably need to do some extra work relative to R. With R, if you want an apple pie, all you need to do is pick up a frozen one from a supermarket and heat it up, and you'll be feasting in a matter of minutes. With Python, you will need to knead your own dough, cook your own apple filling, etc., but you will get your apple pie, and, to be honest, programming in Python is such a pleasure that you will probably enjoy the extra work. And what happens if you instead want a pie with strawberries and dark chocolate and a hint of chili in the filling? With R, if you cannot find an appropriate instant pie in your supermarket, you are out of luck, or you might be in for a very painful adventure in trying to mash together some chimeric concoction that will look horrible and taste worse. But with Python, any pie you can dream of is completely within reach, and probably will not take too much more effort than plain old apple pie. From data wrangling and manipulation (prepping data files, converting file formats, relabeling sequences, etc. etc.) to pipelining workflows, and even to many types analyses and simulations, Python is the ideal choice for the majority of tasks that an evolutionary biologist carries out.

C++

If R is like a microwave, and Python is like modern kitchen, then C++ is like an antique Victorian kitchen in the countryside, far, far, far away from any supermarket. You want an apple pie? With C++, you can consider yourself lucky if you do not actually have to grow the apples and harvest the wheat yourself. You will certainly need to mill and sift the flour, churn the butter, pluck the apples, grind the spices, etc. And none of this "set the oven to 400°" business: you will need to chop up the wood for the fuel, start the fire and keep tending the heat while it is baking. You will eventually get the apple pie, but you are going to have to work very, very, very hard to get it, and most of it will be tedious and painful work. And you will probably have at least one or two bugs in the pie when all is done. More likely than not, memory bugs ...

Stepping out of the cooking/kitchen analogy, if I had to point out the single aspect of programming in C++ that makes it such a pain, I would say "memory management". Much of the time programming in C++ is spent coding up the overhead involved in managing memory (in the broadest sense, from declaration, definition and initialization of stack variables, to allocation and deallocation of heap variables, to tracking and maintaining both types through their lifecycles), and even more is spent in tracking down bugs caused by memory leaks. The Standard Template Library certainly helps, and I've come to find it indispensable, but it still exacts its own cost in terms of overhead and its own price in terms of chasing down memory leaks and bugs.

For an example of the overhead, compare looping through elements of a container in Python:

for i in data:
    print i

vs. C++ using the Standard Template Library:

for (std::vector<long>::const_iterator i = data.begin();
        i != data.end();      
        ++i) {
    std::cout << *i << std::endl;
}

And for an example of insiduous memory bug even with the Standard Template Library, consider this: what might happen sometimes to a pointer that you have been keeping to an element in a vector, when some part of your code appends a new element to the vector? It can be quite ugly.

So what does all that extra work and pain get you?

Performance.

When it comes to performance, C++ rocks. My initial attempt at a complex phylogeography simulator was in Python. It took me a week to get it working. I could manage population sizes of about 1000 on a 1G machine, and it could complete 10000 generations in about a week. I rewrote it in C++. Instead of a week, it took me two and a half months to get it to the same level of complexity. When completed, however, it could manage population sizes of over 1 million on a 1 G machine, and run 2.5 million generations in 24 hours.

After that experience, when I am thinking of coding up something that might be computationally intensive or push the memory limits of my machines, the language that comes to mind is C++. More likely than not, however, I would probably still try to code up the initial solution in Python, and only turn to C++ when it becomes clear that Python's performance is not up to the task.

Java

Java, like Python, is a modern kitchen, allowing for a full range of operations with all the sanity-preserving conveniences and facilities (e.g., garbage-collection/memory-management). But it is a large, industrial kitchen, with an executive chef, sous chefs, and a full team of chefs de partie running things. And so, while you can do everything from making toast to multi-course meals, even the simplest tasks takes a certain minimal investment of organization and overhead. At the end of the day, for many simpler things, such as scrambled eggs and toast, you would get just as good results a lot quicker using Python.

I think that Java is really nice language, and I do like its idioms and syntax, which, by design, enforces many good programming practices. It is also probably much more suited for enterprise-level application development than Python. But I find it very top-heavy for the vast majority of things that I do, and the extra investment in programming overhead that it imposes (think: getters and setters) does not buy me any performance benefit at all. As a result, I have not found a need to code a single application in Java since I started using Python several years ago.


6 comments:

  1. Thanks for the analogy. You motivated me to start back up on my Python lessons. I must say, having started to learn scripting in R and VBA, Python sometimes seems funky to me. But this is almost certainly a residue from my starting point, and no fault of Python's. Drop me a line when all your pies are ready...

    ReplyDelete
  2. Jeet,

    Thanks for your perspective on these languages. I currently use Unix for a lot of my basic scripting needs, but Python would probably be much more efficient. Do you know of a good "Intro to Python" book for someone already familiar with other languages?

    Jeremy

    ReplyDelete
  3. Jeremy,

    To be honest, I think that if you have some sort of programming background, you do not really need a book to get started. Between the on-line documentation and basic examples/tutorials scattered around the web, you can be off and running within an hour. If I recall correctly, it took me about an afternoon to get my head around the basics of Python and start cranking out useful scripts (as opposed to the "Hello, World" variety).

    However, it took me quite a bit longer to actually start to pick up the Zen of Python, i.e. learn the Pythonique way of doing things and loose all the habits picked up from my previous history programming in Java and C#. The Pythonique way is usually not only more elegant and "poetic", but often is more efficient, robust and faster. Now, after 4-5 years of programming in Python, I am still learning new idioms and aspects of the language that surprise and delight.

    To this end, i.e. learning the Pythonique way of doing things, I would recommend two books: Mark Pilgrim's "Dive into Python" (http://diveintopython.org/) and Bruce Eckel's "Thinking in Python" (http://www.mindview.net/Books/TIPython). I think that people without programming experience might find these a little challenging, but with those even with a moderate experience in programming will find them very useful.

    ReplyDelete
  4. p.s. regarding Python being more efficient that the UNIX tools such as sed, awk, grep, etc. ... I'm not so sure of that. Measured in terms of computational efficiency, for a lot of things that *can* be done with sed/awk/grep, then these would probably be on par with Python, and, perhaps, in some cases, even better.

    However, Python is a lot more pleasant and easier to work with, especially as the complexity scales up, and so maybe measured in programmer efficiency, Python would be better.

    ReplyDelete
  5. The limiting factor for many of my scripting tasks is the time it takes me to write the script, so I think that improving programming efficiency (as opposed to computational efficiency) is probably well worth the switch.

    ReplyDelete
  6. Nice blog! It seems like you are the perfect profile for someone to test and help advance pypy. It probably won't get you C++ speed, but my observation is that it's making good progress.

    ReplyDelete